Checking if a file is being downloaded by Python Requests library
-
21-12-2019 - |
Question
I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.
What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].
Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?
Thanks in advance.
Solution
Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.
By default, however, the request is downloaded in its entirety before the .get()
or .post()
, etc. call returns. Set the stream=True
keyword to defer loading the response:
response = requests.get(url, stream=True)
Now you can inspect the headers and just discard the request if you don't like what you find:
length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
print 'Response larger than 1MB, discarding
Subsequently accessing the .content
or .text
attributes, or the .json()
method will trigger a full download of the response.