Keep non-Latin characters when scraping page in python

https://stackoverflow.com/questions/13963153

11-12-2021
|

Question

I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. The file name used to save is part of the url of the page. So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml.

Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. (All pages are from a single site.)

It works well, until I encounter a non-Latin character in a url. The site uses utf-8, so when I download the original page and decode it, it works fine. But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong. I've tried using .encode() on the filename to change it back, but it doesn't change anything.

I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here. Can anyone help me?

Thanks a lot, bsg

Here's some code. I don't get any errors; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist. Of course it doesn't - there's no such page as abc/x54.

To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , e.g., but it shows up as Mehmet_Kenan_Dalba%C5%9Far. When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank. How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to?

try:
    params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
    req = urllib2.Request(url='foobar.com',data=params, headers=headers)
    f = urllib2.urlopen(req)

    encoding = f.headers.getparam('charset')

    temp = f.read() .decode(encoding)

    #lots of code to parse out the links

    for line in links:
    try:
        pagename = line
        pagename = pagename.replace('\n', '')
        print pagename

        newpagename = pagename.replace(':', '_')
        newpagename = newpagename.replace('/', '_')
        final = os.path.join(fullpath, newpagename)
        print final
        final = final.encode('utf-8')
        print final

         ##only download the page if it hasn't already been downloaded
        if not os.path.exists(final + ".xml"):
                print "doesn't exist"
                save = open(final + ".xml", 'w')
                save.write(f.read())
                save.close()

Solution 2

If you have a url with e.g. the code '%C5' and want to obtain it with the actual character \xC5, then call urllib.unquote() on the url.

OTHER TIPS

As you said, you can use requests instead of urllib.

Let's say you get the url "www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far", and then just pass it to requests as an argument as follows:

import requests
r=requests.get("www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far")

Now you can get the content using r.text.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow