Question

Using the modules pypdf2 and urllib I plan to do a fairly large scale (textual) analysis of many .pdf files in Python. My current plan is to download the files using urllib, save them to my computer, then open them/extract text using pypdf2.

The .pdf files range in size from 10-500 MB, which (as there are ~16000 .pdf files) means that the scale of the project will be on the GB to TB scale. The extracted data will not be large, just a tagged set/count of word associations, but the .pdf files themselves will be an issue.

I'm not planning to download them all at once, but iteratively so that my system is not overwhelmed. Below is a high-level workflow:

for pdf_url in all_list:

    download_using_urllib(pdf_url)
    text = read_text(pypdf2.pdf.PdfFileReader(pdf_url+'.pdf'))
    store_word_assoc(text)
    delete_file(pdf_url)

Much of the code is already written and I can post it if it is relevant. My question is: will storing and then deleting up to 8 TB of data on my HD cause any issues with my computer? As you can see I'm not storing it all at once, but I'm just a little worried because I've never done anything of this scale before. If this will be an issue, how can I otherwise structure my project to avoid this?

Thank you!

Was it helpful?

Solution

I would say you might look into just storing the PDFs in memory as you download them. NamedTemporaryFiles might be a good way to handle this. You would hold the file in memory and read from it, and then discard the file. This would save your HD from doing a lot of write intensive stuff.

You might also consider using requests as opposed to urllib, it is much more intuitive than urllib. Oh, and as a bonus, both work on Python 2 and 3.

OTHER TIPS

Assuming you have a couple of GBs of memory, I would recommend just keeping them in memory. It will be slow enough as it is to download that much data. To save it to disk unnecessarily would only add to that painful process.

Since this will be a very long running process, I would also recommend that you keep track of the files that were extracted. That way when it crashes, you can startup where you left off.

I am going to use requests, because it is very developer friendly.

Pseudo Code:

for pdf_url in pdf_urls: if already_got_it(pdf_url): continue

req = requests.get(pdf_url)
if req.status_code < 400:
    text = read_text(req.content)
    store_word_assoc(text)
    mark_completed(pdf_url)

If you do not have enough memory, your proposed solution will work and will not affect your disk much. It is a good bit of writing, but assuming you do not have an SSD that should have little ill effects.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top