Question

I have a script written, using BeautifulSoup and urllib, that iterates through a list of URLs and downloads items of certain file types.

I iterate through a list of URLs, creating a soup object out of each and parsing for links.

The issue I'm experiencing is that I've found that sometimes links in the source are different, even though all the links I'm working through are within the same website. For example, sometimes it'll be '/dir/pdfs/file.pdf' or 'pdf/file.pdf' or '/pdfs/file.pdf'.

So, if there's a full URL, urlretrieve() knows how to handle it, but if it's just a subdirectory like listed above, it returns an error. I of course can follow the link from the source manually, but urlretrieve() doesn't know what to do with it, so I have to add a base URL, (like www.example.com/ or www.example.com/dir/) to the urlretrieve() call.

I'm having trouble creating a situation where if a download fails, it will try to add different base URLs until it works, print the URL, and if none of them work, print out an error message with the file in question so I can grab it manually.

Could someone point me in the right direction?

URLs = []
BASEURL = []
FILETYPE = ['\.pdf$','\.ppt$', '\.pptx$', '\.doc$', 
            '\.docx$', '\.xls$', '\.xlsx$', '\.wmv$']

def main():
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    filename = file.split('/')[-1]

    urlretrieve(filename)
    print file

if __name__ == "__main__":
for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    for types in FILETYPE:
        main()
Was it helpful?

Solution

A better option would be to build the correct absolute URLs to start with:

def main(soup, domain, path, types):
    for link in soup.findAll(href = compile(types)):
        file = link.get('href')

        # Make file URL absolute here
        if '://' not in file and not file.startswith('//'):
            if not file.startswith('/'):
                file = urlparse.urljoin(path, file)
            file = urlparse.urljoin(domain, file)

        try:
            urlretrieve(file)
        except:
            print 'Error retrieving %s using URL %s' % (
                link.get('href'), file)

for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    urlinfo = urlparse.urlparse(url)
    domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
    path = urlinfo.path.rsplit('/', 1)[0]

    for types in FILETYPE:
        main(soup, domain, path, types)

The urlparse function is used to split the source URL into two segments: domain contains the URI scheme and the domain name, path contains the "directory" of the target file on the server. For example:

>>> url = "http://www.example.com/some/web/page.html"
>>> urlinfo = urlparse.urlparse(url)
>>> urlinfo
ParseResult(scheme='http', netloc='www.example.com',
            path='/some/web/page.html', params='', query='', fragment='')
>>> domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
>>> domain
'http://www.example.com'
>>> path = urlinfo.path.rsplit('/', 1)[0]
>>> path
'/some/web'

Then domain and path are used as base path for the hrefs encountered:

  • if the href contains "://" or starts with "//", assume it is absolute: no modification needed,
  • else if the href starts with "/", it is relative to the domain: prepend the domain,
  • otherwise the href is relative to the path: prepend the domain and the base path.

OTHER TIPS

Assumin download method will download the file and return True if its successfully downloaded, or False if it failed... then this goes through the all the possible file paths given by urls and files.

def download(url, file):
    print url + file;
    //assuming download failed, returning False, so it will loop through all the files for this demo purpose.
    return False;

def main():
    urls = ["example.com/", "example.com/docs/", "example.com/dir/docs/", "example.com/dir/doocs/files/"]

    files = ["file1.pdf", "file2.pdf", "file3.pdf"]

    for file in files:
        for url in urls:
            success = download(url, file, False)
            if success:
                 break


main()

You need to catch the exception and try the next base url. That said, you can also attempt to make the links absolute before issuing requests. I believe that is the best approach since it avoids making lots of unnecessary requests. lxml has a handy make_links_absolute() function for this purpose.

Also check out urlparse.urljoin for this. Continuing with the approach you are already using...

html_data = urlopen(url)
soup = BeautifulSoup(html_data)
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    for domain in (url, 'http://www.one.com', 'http://www.two.com'):
        path = urlparse.urljoin(domain, file)
        try:
            req = urllib.urlretrieve(url)
            break  # stop trying new domains
        except:
            print 'Error downloading {0}'.format(url)
            # will go to the next domain

If I were doing this with lxml it would be something like:

req = urlopen(url)
html = req.read()
root = lxml.html.fromstring(root)
root.make_links_absolute()  # automatically add the domain to the links
for a in root.iterlinks():
    if a[2].endswith('pdf'):
        # download link ending with pdf
        req = urlretrieve(a[2])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top