Multi-threading for downloading NCBI files in Python

Question 1

To download files using multiple threads:

#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen

def generate_urls(some, params): #XXX pass whatever parameters you need
    for restart in range(*params):
        # ... generate url, filename
        yield url, filename

def download((url, filename)):
    try:
        with closing(urlopen(url)) as response, open(filename, 'wb') as file:
            shutil.copyfileobj(response, file)
    except Exception as e:
        return (url, filename), repr(e)
    else: # success
        return (url, filename), None

def main():
    pool = Pool(20) # at most 20 concurrent downloads
    urls = generate_urls(some, params)
    for (url, filename), error in pool.imap_unordered(download, urls):
        if error is not None:
           print("Can't download {url} to {filename}, "
                 "reason: {error}".format(**locals())

if __name__ == "__main__":
   main()

Question 2

You should use multithreading, it's the right way for downloading tasks.

"these files take more than 10seconds to download and I do not know how to handle stalling",

I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.

Anyway, you'd better at least try and see what happen.

Question 3

Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.

502 error is not your script's fault. Simply, following pattern could be used to do retry

try_count = 3
while try_count > 0:
    try:
        download_task()
    except urllib2.HTTPError:
        clean_environment_for_retry()
    try_count -= 1

In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.