What is the form of unicode representation called?

https://stackoverflow.com/questions/10444875

05-06-2021
|

Question

I've been going around in circles on this problem where the JSON UTF-8 strings returned from a server contain unicode pairs like this:

\u00c3\u00bc

which is being rendered as two individual characters. However, It should be rendered as a single character. According to a table I found at this link, here are some more examples:

0xc3,0xa0 agrave
0xc3,0xa1 aacute
0xc3,0xa2 acircumflex
0xc3,0xa3 atilde
0xc3,0xa4 adiaeresis
0xc3,0xa5 aring
0xc3,0xa6 ae
0xc3,0xa7 ccedilla
0xc3,0xa8 egrave
0xc3,0xa9 eacute
0xc3,0xaa ecircumflex
0xc3,0xab ediaeresis
0xc3,0xac igrave
0xc3,0xad iacute
0xc3,0xae icircumflex
0xc3,0xaf idiaeresis
0xc3,0xb0 eth
0xc3,0xb1 ntilde
0xc3,0xb2 ograve
0xc3,0xb3 oacute

(Every case where I see this in my data would convert to an appropriate single character.)

Many of these apparently are 'aliases' of singlet forms like '\uxxxx', but I receive them this way as doublets. The raw data bytes show that this is actually how it is transmitted from the server.

(Once I have received them in UTF-8, there is no reason for me to keep them that way in local representation in memory.)

I don't know what to call this, so I'm having difficulty finding much information on it and I'm not able to communicate clearly on the subject. I would like to know why it's used and where I can find code that will convert it to something that my UIWebView can render correctly, but knowing what it's called is the point of my question.

My question then is what is this doublet or paired form called?

(If it's helpful, I am working in Objective-C and CocoaTouch.)

Solution

The notation '\u00c3\u00bc' denotes a two-character sequence “Ã¼”, using the normal JavaScript escape notation: within a string literal, '\uhhhh' stands for the character (or, technically, Unicode code unit) with Unicode number hhhh in hexadecimal.

This is a virtually certain sign of character data conversion error. Such errors occur frequently when UTF-8 encoded data is misinterpreted as ISO-8859-1 encoded (or in some other 8-bit encoding).

Probably the real, uncorrupted data contains u with umlaut, ü, U+00FC, for which the UTF−8 encoding consists of bytes c3 and bc, see http://www.fileformat.info/info/unicode/char/fc/index.htm

The document you are referring to, http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc, appears to show UTF-8 encoded representations of characters, presented in text format by displaying the bytes as hexadecimal number.

OTHER TIPS

\u00c3\u00bc

which is being rendered as two individual characters.

That does explicitly mean the two characters Ã¼. If you expected to see ü, then what you have is incorrect processing further upstream, either in the JSON generator or in the input fed into it. Someone has decoded a series of bytes as ISO-8859-1 where they should have used UTF-8.

You can work around the problem by reading the JSON, encoding to ISO-8859-1, then decoding to UTF-8. But this will mangle any actual correct input, and it's impossible to tell from the example whether the ‘wrong’ charset is actually ISO-8859-1 or Windows code page 1252. Could be either.

You really need to fix the source of the problem rather than trying to work around it, though. Is it your server generating the JSON? Where does the data come from? Because \u00c3\u00bc to mean ü is explicitly incorrect.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow