Question

I currently am developing a Django app that calls to a Java REST API and retrieves multilingual results (the results are coming from Elasticsearch to begin with). I can retrieve the results and store them into an object just fine, but displaying them within Javascript is giving me junk - this is supposed to be Russian:

a junk

When converting it to a string or trying to convert to unicode, I get:

UnicodeEncodeError at /getObjectArticles
'ascii' codec can't encode characters in position 23-24: ordinal not in range(128)

I know the API is returning good data because calling with a Java app works fine. Any idea how to handle the incoming string so it will be recognizable characters?

EDIT: My ingest code..

g = requests.post(baseUrl, query_string)

except requests.exceptions.RequestException as e:
    print e

try:
    obj = g.json()
    articleTitle = obj['hit']['title']
    str(articleTitle)   # This results in a Unicode error
    articleTitle.decode("UTF-8")   # This results in a Unicode error

EDIT: My Javascript/JQUERY

// Load article text
function getArticleText(articleId, index) {
    console.log($('#result_number').val());
    var es_url = gu.webapp_url + '/getArticle?articleId=' + encodeURIComponent(articleId) + "&index=" + encodeURIComponent(index);

    $.get(es_url).success(function(data) {
        console.log(data);
        var decodedText = $("<div/>").html(data.text).text();
        var decodedTitle = $("<div/>").html(data.articleTitle).text();

        // Close Article View Button
        $('#g2i2-article-info').html("<div id=\"closeArticleInfo\" class=\"closeWindow\">X</div>");

        // Article Info Table
        var articleTable = "<table class=\"table table-striped table-bordered table-condensed\">";
        articleTable = articleTable + "<tr><td>Title</td><td>" + decodedTitle + "</td></tr>";
        articleTable = articleTable + "<tr><td>Publication Date</td><td>" + data.pubDate + "</td></tr>";
        articleTable = articleTable + "<tr><td>Source Name</td><td>" + data.sourceName + "</td></tr>";
        articleTable = articleTable + "<tr><td>Location</td><td>" + data.locationName + "</td></tr>";
        articleTable = articleTable + "<tr><td>URL</td><td>" + data.url + "</td></tr>";
        articleTable = articleTable + "</table>"
        $('#g2i2-article-info').append(articleTable);

        // Article Text
        $('#g2i2-article-info').append(decodedText);
        $('#g2i2-article-info').css('display', 'block');

    }).error(function(jqXHR, textStatus, errorThrown) {
        console.log(textStatus + " " + errorThrown);
    });

}
Was it helpful?

Solution

You already have Unicode data on your server; response.json() produces Unicode values for any JSON string. There is no need to try and decode it.

It is the browser that is producing this Latin 1 Mojibake. The browser is sent UTF-8 (a multi-byte encoding) and the browser is interpreting individual bytes as Latin 1 characters instead. Your title, for example, starts with the Cyrilic text Со, which is encoded to UTF-8, then misinterpreted as Latin 1:

>>> u'Со'
u'\u0421\u043e'
>>> u'Со'.encode('utf8')
'\xd0\xa1\xd0\xbe'
>>> print u'Со'.encode('utf8').decode('latin1')
Со

So the D0 A1 bytes in UTF-8, which form one codepoint, are being printed as two Latin-1 characters instead.

The Ñ character is the D1 byte, which can be followed by about 33 non-printable second UTF-8 bytes to make a character in the range р through to Ѡ. Next is и which is really и, etc.

You need to figure out why the browser thinks your data is Latin 1.

Usually this is determined from the Content-Type header sent to the browser; if it is set to text/html; charset=ISO-8851-1 then the browser will behave as if all text is Latin 1. It could be the HTML page has a <meta> tag, one of <meta charset="ISO-8851-1"> or <meta http-equiv="Content-Type" content="text/html; charset="ISO-8851-1"> or similar, where there are several closely related encodings that all have similar Mojibake effects.

Another option is that you encoded it to UTF-8 explicitly, then managed to decode it somewhere to Latin-1 again before sending it to the browser.

And a 3rd option is that the JSON service you used itself sent you Latin-1 bytes in a JSON unicode string, giving you a Mojibake source. In that case you can still repair it by encoding to Latin 1 then decoding from UTF-8:

fixed = broken.encode('latin1').decode('utf8')

but do so only after you have verified that your data on the server is already Mojibaked.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top