A better option would be to build the correct absolute URLs to start with:
def main(soup, domain, path, types):
for link in soup.findAll(href = compile(types)):
file = link.get('href')
# Make file URL absolute here
if '://' not in file and not file.startswith('//'):
if not file.startswith('/'):
file = urlparse.urljoin(path, file)
file = urlparse.urljoin(domain, file)
try:
urlretrieve(file)
except:
print 'Error retrieving %s using URL %s' % (
link.get('href'), file)
for url in URLs:
html_data = urlopen(url)
soup = BeautifulSoup(html_data)
urlinfo = urlparse.urlparse(url)
domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
path = urlinfo.path.rsplit('/', 1)[0]
for types in FILETYPE:
main(soup, domain, path, types)
The urlparse
function is used to split the source URL into two segments: domain
contains the URI scheme and the domain name, path
contains the "directory" of the target file on the server. For example:
>>> url = "http://www.example.com/some/web/page.html"
>>> urlinfo = urlparse.urlparse(url)
>>> urlinfo
ParseResult(scheme='http', netloc='www.example.com',
path='/some/web/page.html', params='', query='', fragment='')
>>> domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
>>> domain
'http://www.example.com'
>>> path = urlinfo.path.rsplit('/', 1)[0]
>>> path
'/some/web'
Then domain
and path
are used as base path for the hrefs encountered:
- if the href contains
"://"
or starts with"//"
, assume it is absolute: no modification needed, - else if the href starts with
"/"
, it is relative to the domain: prepend the domain, - otherwise the href is relative to the path: prepend the domain and the base path.