How to fix broken relative links in offline webpages?

https://stackoverflow.com/questions/3611961

26-09-2019
|

Question

I wrote a simple Python script to download a web page for offline viewing. The problem is that the relative links are broken. So the offline file "c:\temp\webpage.html" has a href="index.aspx" but when opened in a browser it resolves to "file:///C:/temp/index.aspx" instead of "http://myorginalwebsite.com/index.aspx".

So I imagine that I would have to modify my script to fix each of the relative links so that it points to the original website. Is there an easier way? If not, anyone have some sample Python code that can do this? I'm a Python newbie so any pointers will be appreciated.

Thanks.

Solution

If you just want your relative links to refer to the website, just add a base tag in the head:

<base href="http://myoriginalwebsite.com/" />

OTHER TIPS

lxml makes this braindead simple!

>>> import lxml.html, urllib
>>> url = 'http://www.google.com/'
>>> e = lxml.html.parse(urllib.urlopen(url))
>>> e.xpath('//a/@href')[-4:]
['/intl/en/ads/', '/services/', '/intl/en/about.html', '/intl/en/privacy.html']
>>> e.getroot().make_links_absolute()
>>> e.xpath('//a/@href')[-4:]
['http://www.google.com/intl/en/ads/', 'http://www.google.com/services/', 'http://www.google.com/intl/en/about.html', 'http://www.google.com/intl/en/privacy.html']

From there you can write the DOM out to disk as a file.

So you want to check all links that start with http:// but any that don't you want to append http://myoriginalwebsite.com to the front of the string, then test for connection?

Sounds easy enough. Or is it the python code proper you're having issues with?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow