Fastest Way of Storing Data [closed]

Question 1

To do this efficiently, I would multi-thread your application (c++).

The main thread of your application will make these web-requests and push them to the back of a std::list. This is all your main application thread will do.

Spawn (and keep it running, do not spawn repeatedly) a pthread(my preferred threading method, even on windows...) and set it up to check the same std::list in a while loop. In the loop, make sure to check the size of the list and if there are things to be processed, pop the front item off of the list (these can be done in different threads without needing a mutex... most of the time...) and write it to disk.

This will allow you to queue up the responses in memory and at the same time be asynchronously saving the files to disk. If your server really is as quick as you say it is, you might run out of memory. Then I would implement some 'waiting' if the number of items to be processed are over a certain threshold, but this will only run a little better than doing it serially.

The real way to 'improve' the speed of this is to have many worker threads (each with their own std::list and 'smart' pushing onto the list with the least items or one std::list shared with a mutex) processing the files. If you have a multi-core machine with multiple hard drives, this will greatly increase the speed of saving these files to disk.

The other solution is to off-load the saving of the files to many different computers as well (if the number of disks on your current computer is limiting the writes). By using a message passing system such as ZMQ/0MQ, you'd be able to very easily push off the saving of files to different systems (which are setup in a PULL fashion) with more hard drives being accessible than just what is currently on one machine. Using ZMQ makes the round-robin style message passing trivial as a fan-out architecture is built in and is literally minutes to implement.

Yet another solution is to create a ramdisk (easy done natively on linux, for windows... I've used this). This will allow you to parallelize the writing of the files with as many writers as you want without issue. Then you'd need to make sure to copy those files to a real storage location before you restart or you'd lose the files. But during the running, you'd be able to store the files in real-time without issue.

Question 2

So you're downloading 20 million files from a server, and the speed at which you can save them to disk is a bottleneck? If you're accessing the server over the Internet, that's very strange. Perhaps you're downloading over a local network, or maybe the "server" is even running locally.

With 20 million files to save, I'm sure they won't all fit in RAM, so buffering the data in memory won't help. And if the maximum speed at which data can be written to your disk is really a bottleneck, using MS SQL or any other DB will not change anything. There's nothing "magic" about a DB -- it is limited by the performance of your disk, just like any other program.

It sounds like your best bet would be to use multiple disks. Download multiple files in parallel, and as each is received, write it out to a different disk, in a round-robin fashion. The more disks you have, the better. Use multiple threads OR non-blocking I/O so downloads and disk writes all happen concurrently.

Question 3

Probably it helps to access the disk sequentially. Here is a simple trick to do this: Stream all incoming files to an uncompressed ZIP file (there are libraries for that). This makes all IO sequential and there is only one file. You can also split off a new ZIP file after 10000 images or so to keep the individual ZIPs small.

You can later read all files by streaming out of the ZIP file. Little overhead there as it is uncompressed.

Question 4

It sounds like you are trying to write an application which downloads as much content as you can as quickly as possible. You should be aware that when you do this, chances are people will notice as this will suck up a good amount of bandwidth and other resources.

Since this is Windows/NTFS, there are some things you need to keep in mind: - Do not have more than 2k files in one folder. - Use async/buffered writes as much as possible. - Spread over as many disks as you have available for best I/O performance.

One thing that wasn't mentioned that is somewhat important is file size. Since it looks like you are fetching JPEGs, I'm going to assume an average files size of ~50k.

I've recently done something like this with an endless stream of ~1KB text files using .Net 4.0 and was able to saturate a 100mbit network controller on the local net. I used the TaskFactory to generate HttpWebRequest threads to download the data to memory streams. I buffered them in memory so I did not have to write them to disk. The basic approach I would recommend is similar - Spin off threads that each make the request, grab the response stream, and write it to disk. The hardest part will be generating the sequential folders and file names. You want to do this as quickly as possible, make it thread safe, and do your bookkeeping in memory to avoid hitting the disk with unnecessary calls for directory contents.

I would not worry about trying to sequence your writes. There are enough layers of the OS/NTFS that will try and do this for you. You should be saturating some piece of your pipe in no time.