Question

I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:

u'\xc3\x85sum'

now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:

UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>

To test what's wrong I did the following

'Åsum'.encode('utf8') 
'\xc3\x85sum'

print '\xc3\x85sum'.decode('utf8')
Åsum

So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:

print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>

I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.

Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.

Edit:

The original string I receive look like this, or the part that causes trouble:

"\\u00c3\\u0085sum"
Was it helpful?

Solution

You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.

It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:

>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum

The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:

json.dumps(u'Åsum')
'"\\u00c5sum"'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top