Question

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.

Here is the code I have for adding these keywords to the list:

print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list

If the keyword in this case is université, then my output is:

Adding: université
['universit\xc3\xa9']

It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.

How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.

Was it helpful?

Solution

You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.

When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.

If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:

>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université

Compare this to byte strings:

>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université

Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.

I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:

OTHER TIPS

When you print a list, you get the repr of the items it contains, which for strings is different from their contents:

>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'

The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.

>>> print '[', ', '.join(a), ']'
[ foo, bär ]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top