Question

Due to some bug in a C extension, I'm getting unicode data with str instances, or in order words, str with no encoding at all and an unicode literal.

So, for instance, this is a valid unicode literal

>>> u'\xa1Se educado!'

And the UTF-8 encoded str would be:

>>> '\xc2\xa1Se educado!'

However, I get an str with the unicode literal:

>>> '\xa1Se educado!'

And I need to create an unicode instance from that. Using unicode() doesn't work, since it expects an encoding. I figured that ''.join(unichr(ord(x)) for x in s) does what I need, but it's really ugly. There has to be a better solution. Any ideas?

Was it helpful?

Solution

As I suspected, there has to be a way to decode it with whatever "encoding" python uses for unicode, and that's raw_unicode_escape.

>>> unicode('\xa1Se educado!', 'raw_unicode_escape')
u'\xa1Se educado!'

OTHER TIPS

I get an str with the unicode literal: '\xa1Se educado!'

Not really, \xa1 is not a Unicode-specific escape. \xa1 in a byte string means byte number 161 and \xa1 in a Unicode string means character (code point) number 161—same as \u00A1.

What you have is a byte string containing an ISO-8859-1 encoding of ¡Se educado! instead of the UTF-8 encoding. In the ISO-8859-1 encoding each byte number happens to match the Unicode character of the same code point number. To decode an ISO-8859-1 byte string to a Unicode string use:

>>> '\xa1Se educado!'.decode('iso-8859-1')
u'\xa1Se educado!'

although actually if you are using Windows then the encoding is likely to be code page 1252 ('windows-1252') rather than ISO-8859-1. They're similar encodings but not quite the same. Code page 1252 is the default ‘ANSI’ code page that Windows uses for non-Unicode applications in the Western European and US locales. If you are getting this data from a Windows non-Unicode application running on the same machine, you should decode it using the encoding 'mbcs' which corresponds to whatever the locale-specific default code page is.

These are legacy encodings that cannot hold all Unicode characters. You will probably find the C extension cannot cope with characters outside of the current code page set at all.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top