Question
Does anyone know of a library for fixing "broken" urls. When I try to open a url such as
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
urllib2.urlopen chokes and gives me an HTTPError traceback. Does anyone know of a library that can fix these sorts of things?
Solution
See also this question.
OTHER TIPS
What about something like...:
import re
import urlparse
urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()
def main():
for u in urls:
pieces = list(urlparse.urlparse(u))
pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
pieces[-1] = ''
print urlparse.urlunparse(pieces)
main()
it does emit, as you desire:
http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html
and would appear to roughly match your needs, if I understood them correctly.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow