Multibyte Characters Between Java and Python (UnicodeEncodeError)

Question

You already have Unicode data on your server; response.json() produces Unicode values for any JSON string. There is no need to try and decode it.

It is the browser that is producing this Latin 1 Mojibake. The browser is sent UTF-8 (a multi-byte encoding) and the browser is interpreting individual bytes as Latin 1 characters instead. Your title, for example, starts with the Cyrilic text Со, which is encoded to UTF-8, then misinterpreted as Latin 1:

>>> u'Со'
u'\u0421\u043e'
>>> u'Со'.encode('utf8')
'\xd0\xa1\xd0\xbe'
>>> print u'Со'.encode('utf8').decode('latin1')
Ð¡Ð¾

So the D0 A1 bytes in UTF-8, which form one codepoint, are being printed as two Latin-1 characters instead.

The Ñ character is the D1 byte, which can be followed by about 33 non-printable second UTF-8 bytes to make a character in the range р through to Ѡ. Next is Ð¸ which is really и, etc.

You need to figure out why the browser thinks your data is Latin 1.

Usually this is determined from the Content-Type header sent to the browser; if it is set to text/html; charset=ISO-8851-1 then the browser will behave as if all text is Latin 1. It could be the HTML page has a <meta> tag, one of <meta charset="ISO-8851-1"> or <meta http-equiv="Content-Type" content="text/html; charset="ISO-8851-1"> or similar, where there are several closely related encodings that all have similar Mojibake effects.

Another option is that you encoded it to UTF-8 explicitly, then managed to decode it somewhere to Latin-1 again before sending it to the browser.

And a 3rd option is that the JSON service you used itself sent you Latin-1 bytes in a JSON unicode string, giving you a Mojibake source. In that case you can still repair it by encoding to Latin 1 then decoding from UTF-8:

fixed = broken.encode('latin1').decode('utf8')

but do so only after you have verified that your data on the server is already Mojibaked.