Question

I'm always looking for large data sets to test various types of programs on. Does anyone have any suggestions?

Was it helpful?

Solution

Check out the netflix contest. I believe they exposed their database, or a large subset, to facilitate the contest.

UPDATE: Their faq says they have 100 million entries in the subset you can download.

OTHER TIPS

You might want to have a look at the data for the American Statistical Association data expo - it's flight details for all commercial flights in the US for the last 20 years - 120 million records, 11 gig of data.

I've done some work with the Wikimedia download sets, which are huge XML files. Unfortunately, their download server appears to be currently having disk space issues so many of the data sets aren't available. But when it's available, the entire English Wikipedia data set with full history is 2.8 TB (18 GB compressed).

A number of del.icio.us users (including myself) tag pages that contain public data using the "publicdata" tag. You can find that archive here and subscribe to an RSS feed for that tag here. Subscribe to the feed and you'll see a steady stream of interesting datasets that pop up on the web.

Not all of those datasets are large, but they're often interesting.

You might want to look at generating random data for Fuzz Testing. That would give you a pretty much unlimited amount of test data, and you're more likely to hit edge cases.

Maybe some more information on what kind of test data you want, what format, and for what types of applications?

I don't know what your target platform is, but if you're developing against a MSSQL database check out Visual Studio for Database Professionals. It has a very cool feature where it can generate data for your schema using a data plan that you can define.

Redgate also has a datageneration tool, but I haven't used it.

The advantage is that you can create a data generation plan and use it to populate your database with consistent, large amounts of data which can be tuned to test specific areas of your schema.

You might also want to check out theinfo by Aaron Swartz.

From the site

This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects.

If you're interested in personalizing the type of data you're getting, check out Kimono Labs. It's web-scraping software you can use to scrape just about any site for free with no rows returned limit. Just set up an API on it (you can use their url generator to scrape a bunch of urls at once) and then utilize your personal dataset as JSON, CSV, or RSS.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top