Question

The following code examines the behaviour of the float() method when fed a non-ascii symbol:

import sys

try:
  float(u'\xbd')
except ValueError as e:
  print sys.getdefaultencoding() # in my system, this is 'ascii'
  print e[0].decode('latin-1') # u'invalid literal for float(): ' followed by the 1/2 (one half) character
  print unicode(e[0]) # raises "UnicodeDecodeError: 'ascii' codec can't decode byte 0xbd in position 29: ordinal not in range(128)"

My question: why is the error message e[0] encoded in Latin-1? The default encoding is Ascii, and this seems to be what unicode() expects.

Platform is Ubuntu 9.04, Python 2.6.2

Was it helpful?

Solution

e[0] isn't encoded with latin-1; it just so happens that the byte \xbd, when decoded as latin-1, is the character U+00BD.

The conversion occurs in Objects/floatobject.c.

First, the unicode string must be converted to a byte string. This is performed using PyUnicode_EncodeDecimal():

if (PyUnicode_EncodeDecimal(PyUnicode_AS_UNICODE(v),
                            PyUnicode_GET_SIZE(v),
                            s_buffer,
                            NULL))
        return NULL;

which is implemented in unicodeobject.c. It doesn't perform any sort of character set conversion, it just writes bytes with values equal to the unicode ordinals of the string. In this case, U+00BD -> 0xBD.

The statement formatting the error is:

PyOS_snprintf(buffer, sizeof(buffer),
              "invalid literal for float(): %.200s", s);

where s contains the byte string created earlier. PyOS_snprintf() writes a byte string, and s is a byte string, so it just includes it directly.

OTHER TIPS

Very good question!

I took the liberty to dig into Python's source code, which is a mere command away on properly set up linux distributions (apt-get source python2.5)

Damn, John Millikin beat me to it. That's right, PyUnicode_EncodeDecimal is the answer it does this here:

/* (Loop ch in the unicode string) */
    if (Py_UNICODE_ISSPACE(ch)) {
        *output++ = ' ';
        ++p;
        continue;
    }
    decimal = Py_UNICODE_TODECIMAL(ch);
    if (decimal >= 0) {
        *output++ = '0' + decimal;
        ++p;
        continue;
    }
    if (0 < ch && ch < 256) {
        *output++ = (char)ch;
        ++p;
        continue;
    }
    /* All other characters are considered unencodable */
    collstart = p;
    collend = p+1;
    while (collend < end) {
        if ((0 < *collend && *collend < 256) ||
            !Py_UNICODE_ISSPACE(*collend) ||
            Py_UNICODE_TODECIMAL(*collend))
            break;
    }

See, it leaves all unicode code points < 256 in place, which are the latin-1 characters, based on Unicode's backward compatibility.


Addendum

With this in place, you can verify by trying other non-latin-1 characters, it will throw a different exception:

>>> float(u"ħ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character u'\u0127' in position 0: invalid decimal Unicode string

The ASCII encoding only includes the bytes with values <= 127. The range of characters represented by these bytes is identical in most encodings; in other words, "A" is chr(65) in ASCII, in latin-1, in UTF-8, and so on.

The one half symbol, however, is not part of the ASCII character set, so when Python tries to encode this symbol into ASCII, it can do nothing but fail.

Update: Here's what happens (I assume we're talking CPython):

float(u'\xbd') leads to PyFloat_FromString in floatobject.c being called. This function, giving a unicode object, in turn calls PyUnicode_EncodeDecimal in unicodeobject.c being called. From skimming over the code, I get it that this function turns the unicode object into a string by replacing every character with a unicode codepoint <256 with the byte of that value, i.e. the one half character, having the codepoint 189, is turned into chr(89).

Then, PyFloat_FromString does its work as usual. At this moment, it's working with a regular string, which happens to be containing a non-ASCII range byte. It doesn't care about this; it just finds a byte that's not a digit, a period or the like, so it raises the value error.

The argument to this exception is a string

"invalid literal for float(): " + evil_string

That's fine; an exception message is, after all, a string. It's only when you try to decode this string, using the default encoding ASCII, that this turns into a problem.

From experimenting with you code snippet, it would seem I have the same behavior on my platform (Py2.6 on OS X 10.5).

Since you established that e[0] is encoded with latin-1, the correct way to convert it unicode is to do .decode('latin-1'), and not unicode(e[0]).

Update: So it sounds like e[0] does not have a valid encoding. Definetely not latin-1. Because of that, as mentioned elsewhere in the comments, you'll have to call repr(e[0]) if you need to display this error message w/o causing a cascading exception.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top