Question

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

u'Rafa\xc5\x82'

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8") 

This results in correct string which is:

u'Rafa\u0142' 

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

Was it helpful?

Solution

>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>

OTHER TIPS

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top