Question

I'm trying to understand how python 2.5 deals with unicode strings. Although by now I think I have a good grasp of how I'm supposed to handle them in code, I don't fully understand what's going on behind the scenes, particularly when you type strings at the interpreter's prompt.

So python pre 3.0 has two types for strings, namely: str (byte strings) and unicode, which are both derived from basestring. The default type for strings is str.

str objects have no notion of their actual encoding, they are just bytes. Either you've encoded a unicode string yourself and therefore know what encoding they are in, or you've read a stream of bytes whose encoding you also know beforehand (indeally). You can guess the encoding of a byte string whose encoding is unknown to you, but there just isn't a reliable way of figuring this out. Your best bet is to decode early, use unicode everywhere in your code and encode late.

That's fine. But strings typed into the interpreter are indeed encoded for you behind your back? Provided that my understanding of strings in Python is correct, what's the method/setting python uses to make this decision?

The source of my confusion is the differing results I get when I try the same thing on my system's python installation, and on my editor's embedded python console.

 # Editor (Sublime Text)
 >>> s = "La caña de España"
 >>> s
 'La ca\xc3\xb1a de Espa\xc3\xb1a'
 >>> s.decode("utf-8")
 u'La ca\xf1a de Espa\xf1a'
 >>> sys.getdefaultencoding()
 'ascii'

 # Windows python interpreter
 >>> s= "La caña de España"
 >>> s
 'La ca\xa4a de Espa\xa4a'
 >>> s.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 5: unexpected code byte
 >>> sys.getdefaultencoding()
 'ascii'
Was it helpful?

Solution

Let me expand Ignacio's reply: In both cases there is an extra layer between Python and you: in one case it is Sublime Text and in the other it's cmd.exe. The difference in behaviour you see is not due to Python but by the different encodings used by Sublime Text (utf-8, as it seems) and cmd.exe (cp437).

So, when you type ñ, Sublime Text sends '\xc3\xb1' to Python, whereas cmd.exe sends \xa4. [I'm simplyfing here, omitting details that are not relevant to the question.].

Still, Python knows about that. From cmd.exe you'll probably get something like:

>>> import sys
>>> sys.stdin.encoding
'cp437'

whereas within Sublime Text you'll get something like

>>> import sys
>>> sys.stdin.encoding
'utf-8'

OTHER TIPS

The interpreter uses your command prompt's native encoding for text entry. In your case it's CP437:

>>> print '\xa4'.decode('cp437')
ñ

You're getting confused because the editor and the interpreter are using different encodings themselves. The python interpreter uses your system default (in this case, cp437), while your editor uses utf-8.

Note, the difference disappears if you specify a unicode string, like so:

# Windows python interpreter
>>> s = "La caña de España"
>>> s
'La ca\xa4a de Espa\xa4a'
>>> s = u"La caña de España"
>>> s
u'La ca\xf1a de Espa\xf1a'

The moral of the story? Encodings are tricky. Be sure you know what encoding your source files are in, or play it safe by always using the escaped version of special characters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top