Web Scraping: Multiple small files or one large file?
-
11-12-2020 - |
Question
I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following:
- Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR
- Maintain 1 single large file, whereby the scraper simply appends the new output to the end of the file. Resulting in 1 file of size approx 1.9Gb after 3 months
Does anyone have any recommendations or warnings from their experience about either approach and how this affected pre-processing? I am cautions that having a pre-processing script work on a larger file might take longer to process but then again, opening and closing thousands of files will also be time-consuming.
Solution
The danger of one large file is that you may not be able to read the whole thing into memory at once. Since you already know that 3 months of data will be ~2GB, this should not be a problem, so I would recommend option 2.
Perhaps an even better option is to save the data as several medium-sized files. For example, you could combine all the observations for a week into a single file. This way, you eliminate nearly all of the file I/O overhead (~13,000 files down to 12-14) but also free yourself from potential memory problems. This is certainly what I would do if I were collecting data for a very long time (years instead of months).