.NET 3.5 C# StreamReader Reading ISO-8859-1 Characters Incorrectly

https://stackoverflow.com/questions/8485195

14-03-2021
|

Вопрос

In summary I retrieve a HTTP Web Response containing JSON formatted data with unicode characters such as "\u00c3\u00b1" which should translate to "ñ". Instead these characters are converted to "Ã±" by the JSON parser I am using. The behavior I'm looking for is for those characters to be translated to "ñ".

Taking the following code and executing it...

string nWithAccent = "\u00c3\u00b1";

Encoding iso = Encoding.GetEncoding("iso8859-1");
byte[] isoBytes = iso.GetBytes(nWithAccent);

nWithAccent = Encoding.UTF8.GetString(isoBytes);

nWithAccent outputs "ñ". This is the result I am looking for. I took the above code and used it on the "response_body" variable below which contained the same characters as above (from what I could see using the Visual Studio 2008 Text Analyzer) and did not get the same result... it does nothing with the characters "\u00c3\u00b1".

In my application I execute the following code against an external system retrieving data in JSON format. Upon examining the "response_body" variable using the text analyzer in Visual Studio I see "\u00c3\u00b1" instead of ñ. E.g. the word "niño" would be seen in the Text Analyzer as "ni\u00c3\u00b1o".

using (HttpWResponse = (HttpWebResponse)this.HttpWRequest.GetResponse())
{
    using (StreamReader reader = new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
    {
        // token will expire 60 min from now.
        this.TimeTillTokenExpiration = DateTime.Now.AddMinutes(60);

        // read response data
        response_body = reader.ReadToEnd();
    }
}

I then use an open source JSON parser which replaces "\u00c3" with "Ã" and "\u00b1" with "±" with an end result of "Ã±" instead of "ñ". Is something wrong with the JSON parser or am I applying the wrong encoding to the response stream? The headers in the response indicate the charset as being UTF-8. Thanks for any replies!

Решение

The JSON response you are receiving is invalid. "\u00c3\u00b1" isn't the correct encoding for ñ.

Instead it's a sort of double encoding. It has first been encoded as an UTF-8 byte sequence and then the bytes above 128 have been escaped with the \u sequence.

Since a JSON response is usally UTF-8 anyway, there's no need to escape the two byte sequence for ñ. If escaping is used, it must not be applied to the two byte sequence but rather to the single Unicode character itself. It would then result in "\u00f1".

You can test it with an online JSON validator (such as JSONLint or JSON Format) by pasting the following JSON data:

{
    "unescaped": "ñ",
    "escaped": "\u00f1",
    "wrong": "\u00c3\u00b1"
}

Другие советы

Replace

new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))

with

new StreamReader(HttpWResponse.GetResponseStream(), Encoding.GetEncoding("iso8859-1")))

What happens if you pass this string to the JSON parser?

string s = "\\u00c3\\u00b1";

I suspect you'll get "Ã±".

Is there a way you can tell your JSON parser to interpret characters in the string as though they're UTF-8 bytes?

You're probably better off reading raw bytes from the response stream and passing that to the JSON parser.

I think the problem is that you're converting the raw bytes to a string, which contains the encoded characters. The JSON parser doesn't know if you want that "\u00c3\u00b1" converted to a single UTF-8 character, or two characters.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow