python ascii codes to utf

https://stackoverflow.com/questions/10172431

31-05-2021
|

Frage

So when i post a name or text in mod_python in my native language i get:

&#1084;&#1072;&#1082;&#1077;&#1076;&#1086;&#1085;&#1080;&#1112;&#1072;

And i also get:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

When i use:

hparser = HTMLParser.HTMLParser() 
    req.write(hparser.unescape(text))

How can i decode it?

Lösung

It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of

Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)

In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character¹. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.

The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.

To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.

Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.

¹Sort of.

EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.

Anyway, if you step through what you're doing in the shell you'll see

>>> from HTMLParser import HTMLParser
>>> text = "&#1084;&#1072;&#1082;&#1077;&#1076;&#1086;&#1085;&#1080;&#1112;&#1072;"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'

I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like

>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'

But we could also pick a different encoding!

>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'

You'll need to decide what encoding you want to use.

What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try

>>> text.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

you just get an error, because you can't encode those code points in ASCII.

When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.

So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow