Question

I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following:

  1. Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR
  2. Maintain 1 single large file, whereby the scraper simply appends the new output to the end of the file. Resulting in 1 file of size approx 1.9Gb after 3 months

Does anyone have any recommendations or warnings from their experience about either approach and how this affected pre-processing? I am cautions that having a pre-processing script work on a larger file might take longer to process but then again, opening and closing thousands of files will also be time-consuming.

Was it helpful?

Solution

The danger of one large file is that you may not be able to read the whole thing into memory at once. Since you already know that 3 months of data will be ~2GB, this should not be a problem, so I would recommend option 2.

Perhaps an even better option is to save the data as several medium-sized files. For example, you could combine all the observations for a week into a single file. This way, you eliminate nearly all of the file I/O overhead (~13,000 files down to 12-14) but also free yourself from potential memory problems. This is certainly what I would do if I were collecting data for a very long time (years instead of months).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top