Why does this Unicode / UTF-8 "En Dash" character in my JSON feed get mangled when I download it?

https://stackoverflow.com/questions/23532778

17-07-2023
|

Question

My JSON feed is here:

http://america.aljazeera.com/bin/ajam/api/story.json?path=/content/ajam/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland

It is a JSON representation of this HTML page, you can see the same En Dash character in the subtitle of the page.

http://america.aljazeera.com/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland.html

The En Dash is in the 2nd key (description):

description: "In a North Dakota town that was once dying, oil and money are flowing – and bringing big-city problems",

after the word "flowing".

The page has the following HTTP header:

Content-Type: application/json;charset=UTF-8

which can be seen by requesting it via curl -v or curl -I

Downloading it in Ruby using HTTParty like so:

> r = HTTParty.get('http://america.aljazeera.com/bin/ajam/api/story.json?path=/content/ajam/watch/shows/america-tonight/articles/2014/4/28/the-dark-side-oftheoilboomhumantraffickingintheheartland')
> r['description']
 => "In a North Dakota town that was once dying, oil and money are flowing –\u0080\u0093 and bringing big-city problems"

mangles it, as seen above. After much research I realized is a representation of the hex utf-8 unicode value as seen here:

http://www.fileformat.info/info/unicode/char/2013/index.htm

specifically, this:

UTF-8 (hex) 0xE2 0x80 0x93 (e28093)

This data is later fed into an iPhone app and an Android app. On the the Android app it looks like the attached . On an iPhone it looks fine - I think because only the first character is rendered and that is a regular Ascii dash, and the next two characters are skipped.

Finally, downloading it in JavaScript using AJAX does seem to handle it correctly:

> r = json['description'].match(/flowing (.*) and/)[1]
> "–"
> r
> "–"
> r.length
> 3
> r.toString(16)
> "–"

So...what is going on? What can I do to fix it? Is the fault with the server or with my code?

Solution

The JSON feed you're using failed to interpret \u2013 correctly. Instead of generating the desired UTF-8 encoded byte sequence:

E2 80 93

it generated:

E2 80 93 C2 80 C2 93

The reason why the iPhone app works fine may be that it ignores the control character C2 80 and C2 93. However, Android app just render it as some special figure.

You'll need to manually clean those wrong sequence, if you don't have control of the JSON feed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow