What is the meaning of calling u'\1234'.decode('utf-8') in Python 2.7x?

https://stackoverflow.com/questions/23571500

19-07-2023
|

Question

A co-worker asked me this earlier today and I couldn't figure out a reasonable answer. S/O seems to have a few close answers but I couldn't turn up something that answered specifically this.

If I run this with a 2.7x interpreter on 64 bit Ubuntu 12.04, I get:

>>> u'\u1234'.decode('utf-8')
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in position 0: ordinal not in range(128)

The Python docs on the subject indicate that Python represents Unicode strings as 16 or 32 bit integers. When decoding this with utf-8, does Python attempt to read those ints as it would 8 bit chars encoded with utf-8? If so why is the error a UnicodeEncodeError and not a UnicodeDecodeError?

I'd love to have a better understanding of this. What are the steps that are taken when decode is called on a Unicode string? The meaning of decoding a string with utf-8 that was already decoded from its utf-8 encoding is unclear to me.

Solution

This is a Python 2 wart. Calling decode on a Unicode string first encodes it using the default encoding, before re-decoding it with the specified encoding. Since the default encoding is usually ASCII, you get the error you see.

OTHER TIPS

Calling decode on a string tries to take it from the specified encoding and turn it into the default encoding. In your case, the default encoding is ASCII and there is no way to represent '\u1234' in ASCII.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow