Unicode problem Django-Python-URLLIB-MySQL

https://stackoverflow.com/questions/1101715

12-09-2019
|

Question

I am fetching a webpage (http://autoweek.com) and trying to process it but getting encoding error. Autoweek declares "iso-8859-1" encoding and has the word "Nürburgring" (u with umlaut)

I do:

# -*- encoding: utf-8 -*-
import urllib
webpage = urllib.urlopen(feed.crawl_url).read()
webpage.decode("utf-8")

it gives me the following error:

'utf8' codec can't decode bytes in position 7768-7773: unsupported Unicode code range"

if I bypass .decode step and do some parsing with lxml library, it raises an error when I am saving parsed title to database:

'utf8' codec can't decode bytes in position 45-50: unsupported Unicode code range

My database has character set utf8 and collation utf-general-ci

My settings:
Django
Python 2.4.3
MySQL 5.0.22
MySQL-python 1.2.1
mod_python 3.2.8

Solution

autoweek.com seems confused about it's own encoding. It declares conflicting charset definitions:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

and later...

<meta charset=iso-8859-1"/>.

iso-8859-1 is the correct one since this is returned in the header from the web server and by the .info() method (and it actually decodes), but this demonstrates that you can't necessarily rely on the Content-Type declaration in web pages. You should follow the method described by lavinio.

OTHER TIPS

If the webpage declares encoding iso-8859-1, can't you just do webpage.decode("iso-8859-1")?

At that point, webpage is decoded for your app. When it is written into the database, the mapping there should handle the char-to-utf8 encoding.

To get the correct encoding, either tell the webserver that you only accept, say, UTF-8 and then that's what you'll (hopefully) always get, since just about everyone reads UTF-8 (or you could try it with ISO-8859-1); or use .info to inspect the encoding name of the stream returned.

See urllib2 - The Missing Manual and Quick reference to HTTP headers for details.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow