Nice real data sets for testing DBSCAN?

https://datascience.stackexchange.com/questions/73704

11-12-2020
|

Question

I'm looking for real datasets on which I could test my DBSCAN algorithm implementation, that is, a dataset of points in (ideally 2 dimmensional) space, or a set of nodes and info about the distances between them.

I have looked on SNAP and CRAWDAD for such datasets, like datasets of road networks with distances, or cities with GPS coordinates, etc, but I haven't found any!

I know that the DBSCAN is said to be one of the best algorithims of it's kind on real data, but can't seem to find the real data sets people use...

Suggestions?

Solution

If you want to test whether your algorithm works as expected, I'd use sklearn datasets. They allow you to create simple synthetic 2D data with certain properties: circles, half moons, etc.

If you want "real" datasets, here is an interesting resource found after a brief search:

https://uni.hi.is/helmut/2019/06/20/datasets-for-dbscan-evaluation/

It seems to be a collection of datasets used in the literature.

Otherwise, I'd recommend you to look for image segmentation datasets, for instance, maps, as they make good candidates for DBSCAN. Kaggle is good place to search, so is the Google Dataset Search tool

OTHER TIPS

Kaggle has some nice datasets available, including the classic Iris dataset. Take a look and pick one that looks interesting.

There are some impactful real-world data sets there, including COVID-19 related data sets. Something on the lighter side might be this scrubbed Iris data set posted not long ago.

EDIT: to elaborate on COVID-19, Kaggle has the COVID-19 Open Research Dataset (CORD-19), a nice 2 GB data set created by the Allen Institute for AI (Allen as in Paul Allen of Microsoft fame) with many partners. It's a great first place to start. They also have a nice COVID-19 data set from John Hopkins University. There must be 100+ COVID-19 data sets. This link should bring up the search feature.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange