How to produce massive amount of data?

https://stackoverflow.com/questions/8668175

09-04-2021
|

Question

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.

The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

Solution

This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).

However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.

I can offer several answers:

If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.

OTHER TIPS

If you only need to avoid exact duplicates, you could try a combination of your two ideas---create corrupted copies of a relatively small data set. "Corruption" operations might include: replacement, insertion, deletion, and character swapping.

I would write a simple program to do it. The program doesn't need to be too clear as the speed of writing to disk is likely to be your bottle neck.

Just about the long time comment: I've recently extended a disk partition and I know well how long can it take to move or create a great number of files. It would be much faster to request the OS a range of free space on disk, and then create a new entry in the FAT for that range, without writing a single bit of content (reusing the previously existing information). This would serve your purpose (since you don't care about file content) and would be as fast as deleting a file.

The problem is that this might be difficult to achieve in Java. I've found an open source library, named fat32-lib, but since it doesn't resort to native code I don't think it is useful here. For a given filesystem, and using a lower level language (like C), if you have the time and motivation I think it would be achievable.

Have a look at TPC.org, they have different Database Benchmarks with data generators and predefined queries.

The generators have a scale-factor which allows to define the target data size.

There is also the myriad research project (paper) that focuses on distributed "big data" data generation. Myriad has a steep learning curve, so you might have to ask the authors of the software for help.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow