Question

Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it ! Anyways, here is the simplest possible code to retrieve files in a loop:

import urlgrabber

for i in range(1, 100):
    url = "http://www.iana.org/domains/example/"
    urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>@<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)

Hello all !

I am trying to write a very simple python script to grab a bunch of files via urllib2.

This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).

Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !

Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.

Here is a simplified version of my script (with identified information stripped, obviously):

import urllib2

passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)

proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)

for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
    f = urllib2.urlopen("http://www.iana.org/domains/example/")
    outfile.write(f.read())

Thanks in advance !

Was it helpful?

Solution

You can minimize the number of connection by using the keepalive handler from the urlgrabber module.

import urllib2
from keepalive import HTTPHandler
keepalive_handler = HTTPHandler()
opener = urllib2.build_opener(keepalive_handler)
urllib2.install_opener(opener)

fo = urllib2.urlopen('http://www.python.org')

I am unsure that this will work correctly with your Proxy setup. You may have to hack the keepalive module.

OTHER TIPS

The proxy might be throttling your requests. I guess it thinks you look like a bot.

You could add a timeout, and see if that gets you through.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top