Web Scraping: Multiple small files or one large file?

https://datascience.stackexchange.com/questions/74879

11-12-2020
|

Question

I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following:

Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR
Maintain 1 single large file, whereby the scraper simply appends the new output to the end of the file. Resulting in 1 file of size approx 1.9Gb after 3 months

Does anyone have any recommendations or warnings from their experience about either approach and how this affected pre-processing? I am cautions that having a pre-processing script work on a larger file might take longer to process but then again, opening and closing thousands of files will also be time-consuming.

Solution

The danger of one large file is that you may not be able to read the whole thing into memory at once. Since you already know that 3 months of data will be ~2GB, this should not be a problem, so I would recommend option 2.

Perhaps an even better option is to save the data as several medium-sized files. For example, you could combine all the observations for a week into a single file. This way, you eliminate nearly all of the file I/O overhead (~13,000 files down to 12-14) but also free yourself from potential memory problems. This is certainly what I would do if I were collecting data for a very long time (years instead of months).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange