How to know if urllib.urlretrieve succeeds?

https://stackoverflow.com/questions/987876

13-09-2019
|

Question

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.

Can anybody provide some information about this problem?

Solution

Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.

You can detect any HTTP errors easily:

>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found

resp is actually HTTPResponse object that you can do a lot of useful things with:

>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"

OTHER TIPS

I keep it simple:

# Simple downloading with progress indicator, by Cees Timmerman, 16mar12.

import urllib2

remote = r"http://some.big.file"
local = r"c:\downloads\bigfile.dat"

u = urllib2.urlopen(remote)
h = u.info()
totalSize = int(h["Content-Length"])

print "Downloading %s bytes..." % totalSize,
fp = open(local, 'wb')

blockSize = 8192 #100000 # urllib.urlretrieve uses 8192
count = 0
while True:
    chunk = u.read(blockSize)
    if not chunk: break
    fp.write(chunk)
    count += 1
    if totalSize > 0:
        percent = int(count * blockSize * 100 / totalSize)
        if percent > 100: percent = 100
        print "%2d%%" % percent,
        if percent < 100:
            print "\b\b\b\b\b",  # Erase "NN% "
        else:
            print "Done."

fp.flush()
fp.close()
if not totalSize:
    print

According to the documentation is is undocumented

to get access to the message it looks like you do something like:

a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')

b is the message instance

Since I have learned that Python it is always useful to use Python's ability to be introspective when I type

dir(b)

I see lots of methods or functions to play with

And then I started doing things with b

for example

b.items()

Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.

Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.

Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:

needparam=[]
for each in dir(b):
    x='b.'+each+'()'
    try:
        eval(x)
        print x
    except:
        needparam.append(x)

You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:

How to catch 404 error in urllib.urlretrieve

I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.

import tempfile
import pycurl
import os

def get_filename_parts_from_url(url):
    fullname = url.split('/')[-1].split('#')[0].split('?')[0]
    t = list(os.path.splitext(fullname))
    if t[1]:
        t[1] = t[1][1:]
    return t

def retrieve(url, filename=None):
    if not filename:
        garbage, suffix = get_filename_parts_from_url(url)
        f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
        filename = f.name
    else:
        f = open(filename, 'wb')
    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(url))
    c.setopt(pycurl.WRITEFUNCTION, f.write)
    try:
        c.perform()
    except:
        filename = None
    finally:
        c.close()
        f.close()
    return filename

class MyURLopener(urllib.FancyURLopener):
    http_error_default = urllib.URLopener.http_error_default

url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
    pass
    ...

try:
    (f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
    print e

:) My first post on StackOverflow, have been a lurker for years. :)

Sadly dir(urllib.urlretrieve) is deficient in useful information. So from this thread thus far I tried writing this:

a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b

which produced this:

A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close

I guess one can check:

if b.Content-Length > 0:

My next step is to test a scenario where the retrieve fails...

Results against another server/website - what comes back in "B" is a bit random, but one can test for certain values:

A: get_good.jpg
B: Date: Tue, 08 Mar 2016 00:44:19 GMT
Server: Apache
Last-Modified: Sat, 02 Jan 2016 09:17:21 GMT
ETag: "524cf9-18afe-528565aef9ef0"
Accept-Ranges: bytes
Content-Length: 101118
Connection: close
Content-Type: image/jpeg

A: get_bad.jpg
B: Date: Tue, 08 Mar 2016 00:44:20 GMT
Server: Apache
Content-Length: 1363
X-Frame-Options: deny
Connection: close
Content-Type: text/html

In the 'bad' case (non-existing image file) "B" retrieved a small chunk of (Googlebot?) HTML code and saved it as the target, hence Content-Length of 1363 bytes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow