Datasets for benchmarking Fuzzy Clustering method with millions of data

Question 1

One excellent dataset is the one provided by this website. StackExchange provides an anonymized dump of all publicly available data found on their sites here: https://archive.org/details/stackexchange

You can read about the data schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

I have a copy of the data from a year ago and it has over 16 million records just for this site (StackOverflow.com) and the dump has all of their sites.

Question 2

You can generate dataset from http://www.mockaroo.com/. It is pretty good an you can have many option.

Question 3

There are many large "open data" collections with scientific data around the web. Some have rather, shall we say, nontrivial data set sizes of well over a Terabyte. So, depending on which size you need, take a look at genome sites like Proteomecommons or the datasets from Stanford's Natural Language Processing group.

Smaller dumps can be found in the geologists' projects like this one.