Question

I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.

What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].

Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?

Thanks in advance.

Was it helpful?

Solution

Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.

By default, however, the request is downloaded in its entirety before the .get() or .post(), etc. call returns. Set the stream=True keyword to defer loading the response:

response = requests.get(url, stream=True)

Now you can inspect the headers and just discard the request if you don't like what you find:

length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
    print 'Response larger than 1MB, discarding

Subsequently accessing the .content or .text attributes, or the .json() method will trigger a full download of the response.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top