download file over https query with python headless browser

https://stackoverflow.com/questions/12965562

09-07-2021
|

Question

I try to do web scraping in python on a website (using spynner and BeautifulSoup). At some point I want to test a zip file download, triggered by the following html query:

https://mywebsite.com/download?from=2011&to=2012

If explicitly used in a browser (chrome) this will trigger the download of a zip file with a given name. I have not been able to reproduce this behavior with my headless browser. I know it's not the right way to do it but using something like spynner:

from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
output = b.load("https://website.com/download?from=2011&to=2012")
print b.html
>> ...

does not work of course (no zip file download). The last print statement shows I end up on an error page, with a java exception stack.

Is there a way to

properly call the html query without using the spynner load mechanism?
capture the resulting zip file?
download it with a chosen name?

Thanks for your help.

One last thing that came after some testing on chrome with the java debugger, I have the following warning when doing it in the browser:

Resource interpreted as Document but transferred with MIME type application/zip "https://mywebsite.com/download?from=2011&to=2012"

Edited:

Found out that the call made was:

https://mywebsite.com/download?from=10%2F18%2F2011&to=10%2F18%2F2012

which can be used in a browser and should be replaced by

https://mywebsite.com/download?from=10/18/2011&to=10/18/2012

which could not be used in python because the URL encoding would map %2F into %252F

Solution

I'm not sure if this will handle your case, but give it a try:

def download_finished(reply):
    try:
        with open('filename.ext', 'wb') as downloaded_file:
            downloaded_file.write(reply.readAll())
    except Exception:
        pass

    b.manager.finished.disconnect(download_finished)

download_url = spynner.QUrl(url)
request = spynner.QNetworkRequest(download_url)

# requires: from PyQt4.QtCore import QByteArray
request.setRawHeader('Accept', QByteArray(
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'))

b.manager.finished.connect(download_finished)
reply = b.manager.get(request)
b.wait_requests(1)

OTHER TIPS

You've made a mistake with spynner.

The script should looks like :

from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
b.load("https://website.com/download?from=2011&to=2012")
# print b.html
f = open("/tmp/foo.zip", "w")
f.write(b.html)
f.close()

See spynner doc

Does the following code work?

import urllib, os, urlparse

url = YOUR_URL

file = urllib.URLopener()
file.retrieve(url, os.path.basename(urlparse.urlparse(url).path))
print 'downloading:', url

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow