Downloading then deleting many files which together are very large, issues?

Question 1

I would say you might look into just storing the PDFs in memory as you download them. NamedTemporaryFiles might be a good way to handle this. You would hold the file in memory and read from it, and then discard the file. This would save your HD from doing a lot of write intensive stuff.

You might also consider using requests as opposed to urllib, it is much more intuitive than urllib. Oh, and as a bonus, both work on Python 2 and 3.

Question 2

Assuming you have a couple of GBs of memory, I would recommend just keeping them in memory. It will be slow enough as it is to download that much data. To save it to disk unnecessarily would only add to that painful process.

Since this will be a very long running process, I would also recommend that you keep track of the files that were extracted. That way when it crashes, you can startup where you left off.

I am going to use requests, because it is very developer friendly.

Pseudo Code:

for pdf_url in pdf_urls: if already_got_it(pdf_url): continue

req = requests.get(pdf_url)
if req.status_code < 400:
    text = read_text(req.content)
    store_word_assoc(text)
    mark_completed(pdf_url)

If you do not have enough memory, your proposed solution will work and will not affect your disk much. It is a good bit of writing, but assuming you do not have an SSD that should have little ill effects.