Question

We want to test the performance of some fuzzy clustering algorithms that some collaborators have developed. Our interest lies in 2D datasets with a lot of data, where we could benchmark these algorithms. Do you know where can one find such datasets?

Was it helpful?

Solution

One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange

You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.

OTHER TIPS

You can generate dataset from http://www.mockaroo.com/. It is pretty good an you can have many option.

There are many large "open data" collections with scientific data around the web. Some have rather, shall we say, nontrivial data set sizes of well over a Terabyte. So, depending on which size you need, take a look at genome sites like Proteomecommons or the datasets from Stanford's Natural Language Processing group.

Smaller dumps can be found in the geologists' projects like this one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top