Fastest way to download thousand files using python? [closed]

Question 1

Most likely the reason it takes so long is that it takes time to open a connection make the request, get the file and close the connection again.

A thousand files in an hour is 3.6 seconds per file, which is high, but the site you are downloading from may be slow.

The first thing to do is to use HTTP/2.0 and keep one conection open for all the files with Keep-Alive. The easiest way to do that is to use the Requests library, and use a session.

If this isn't fast enough, then you need to do several parallel downloads with either multiprocessing or threads.

Question 2

You should try using multithreading to download many files in parallel. Have a look at multiprocessing and especially the worker-pools.

Question 3

The issue is very unlikely to be bandwidth (connection speed) because any network connection can maintain that bandwidth. The issue is latency - the time it takes to establish a connection and set up your transfers. I know nothing about Python, but would suggest you split your list and run the queries in parallel if possible, on multiple threads or processes - since the issue is almost certainly neither CPU, nor bandwidth-bound. So, I am saying fire off multiple requests in parallel so a bunch of setups can all be proceeding at the same time and the time each takes is masked behind another.

By the way, if your thousand files amount to 5MB, then they are around 5kB each, rather than the 20kB to 350kB you say.

Question 4

You are probably not going to be able to top that speed without either a) a faster internet connection both for you and the provider or b) getting the provider to provide a zip or tar.gz format of the files that you need.

The other possibility would be to use a cloud service such as Amazon to get the files to your cloud location, zip or compress them there and then download the zip file to your local machine. As the cloud service is on the internet backbone it should have faster service than you. The downside is you may end up having to pay for this depending on the service you use.