.NET 3.5 C# StreamReader Reading ISO-8859-1 Characters Incorrectly
-
14-03-2021 - |
Вопрос
In summary I retrieve a HTTP Web Response containing JSON formatted data with unicode characters such as "\u00c3\u00b1" which should translate to "ñ". Instead these characters are converted to "ñ" by the JSON parser I am using. The behavior I'm looking for is for those characters to be translated to "ñ".
Taking the following code and executing it...
string nWithAccent = "\u00c3\u00b1";
Encoding iso = Encoding.GetEncoding("iso8859-1");
byte[] isoBytes = iso.GetBytes(nWithAccent);
nWithAccent = Encoding.UTF8.GetString(isoBytes);
nWithAccent outputs "ñ". This is the result I am looking for. I took the above code and used it on the "response_body" variable below which contained the same characters as above (from what I could see using the Visual Studio 2008 Text Analyzer) and did not get the same result... it does nothing with the characters "\u00c3\u00b1".
In my application I execute the following code against an external system retrieving data in JSON format. Upon examining the "response_body" variable using the text analyzer in Visual Studio I see "\u00c3\u00b1" instead of ñ. E.g. the word "niño" would be seen in the Text Analyzer as "ni\u00c3\u00b1o".
using (HttpWResponse = (HttpWebResponse)this.HttpWRequest.GetResponse())
{
using (StreamReader reader = new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
{
// token will expire 60 min from now.
this.TimeTillTokenExpiration = DateTime.Now.AddMinutes(60);
// read response data
response_body = reader.ReadToEnd();
}
}
I then use an open source JSON parser which replaces "\u00c3" with "Ã" and "\u00b1" with "±" with an end result of "ñ" instead of "ñ". Is something wrong with the JSON parser or am I applying the wrong encoding to the response stream? The headers in the response indicate the charset as being UTF-8. Thanks for any replies!
Решение
The JSON response you are receiving is invalid. "\u00c3\u00b1"
isn't the correct encoding for ñ
.
Instead it's a sort of double encoding. It has first been encoded as an UTF-8 byte sequence and then the bytes above 128 have been escaped with the \u
sequence.
Since a JSON response is usally UTF-8 anyway, there's no need to escape the two byte sequence for ñ
. If escaping is used, it must not be applied to the two byte sequence but rather to the single Unicode character itself. It would then result in "\u00f1"
.
You can test it with an online JSON validator (such as JSONLint or JSON Format) by pasting the following JSON data:
{
"unescaped": "ñ",
"escaped": "\u00f1",
"wrong": "\u00c3\u00b1"
}
Другие советы
Replace
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.UTF8))
with
new StreamReader(HttpWResponse.GetResponseStream(), Encoding.GetEncoding("iso8859-1")))
What happens if you pass this string to the JSON parser?
string s = "\\u00c3\\u00b1";
I suspect you'll get "ñ"
.
Is there a way you can tell your JSON parser to interpret characters in the string as though they're UTF-8 bytes?
You're probably better off reading raw bytes from the response stream and passing that to the JSON parser.
I think the problem is that you're converting the raw bytes to a string, which contains the encoded characters. The JSON parser doesn't know if you want that "\u00c3\u00b1" converted to a single UTF-8 character, or two characters.