Question

Why does this work:

a = 'a'.encode('utf-8')
print unicode(a)
>>> u'a'

And this will give me an Error:

b = 'b'.encode('utf-8_sig')
print unicode(b)

Saying:
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Was it helpful?

Solution

Because you haven't told unicode what encoding to use:

>>> a = 'a'.encode('utf-8')
>>> print unicode(a)
a
>>> b = 'b'.encode('utf-8_sig')
>>> print unicode(b)

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print unicode(b)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
>>> print unicode(b, 'utf-8_sig')
b

OTHER TIPS

'ascii' codec can't decode byte 0xef says two things:

  1. unicode(b) uses ascii (sys.getdefaultencoding()) character encoding
  2. \xef byte is not in ascii range. It is the first byte in BOM introduced by 'utf-8-sig' encoding (used on Windows)

The first example works because 'a' bytestring is ascii. 'a'.encode('utf-8') is equivalent to 'a'.decode(sys.getdefaultencoding()).encode('utf-8') and in this case it is equal to 'a' itself.

In general, use bytestring.decode(character_encoding) = unicode_string and unicode_string.encode(character_encoding) = bytestring. bytestring is a sequence of bytes. Unicode string is a sequence of Unicode codepoints.

Do not call .encode() on bytestrings. 'a' is a bytestring literal in Python 2. u'a' is a Unicode literal.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top