Question

I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.

So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

It seems IronPython isn't a fan either:

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

If somebody could tell me what, exactly, is going on here, it'd be much appreciated.

Was it helpful?

Solution

This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.

The '\xff\xfe' you can see is just a Byte Order Mark.

Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.

OTHER TIPS

But it decodes okay:

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':

>>> '00' == '\x30\x30'
True

You are being confused by two things here (threw me off too):

  1. utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
  2. '00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:

    >>> '\0\0'
    '\x00\x00'
    

There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

which translates to the following steps:

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.

Remember, you can't reach to the same character if you encode with one and decode with another encoding:

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top