I have a program that inputs text and sorts through it using a number of functions and the text should be readable regardless of the format, however, when a file saved to the Extended ASCII encoding is imported, any characters over 127 are ignored. Looking around, I can't seem to see how to overcome this. The files are read fine in UTF-8 and Unicode. I've tried converting the strings to UTF-8, but the letters in question still just come up as question-mark like shapes instead. I can see that the values are correct: 0xBF for û, but they aren't being interpreted as value.

Can anyone help me here, I've not done lots of work with this sort of thing before. I'm working in C# if that helps.

My current code for converting looks like this:

System.Text.UTF8Encoding u = new System.Text.UTF8Encoding();
byte[] asciiBytes = Encoding.UTF8.GetBytes(sd);
sd = u.GetString(asciiBytes);

Where sd is the string. When I import this string, I do not specify the text encoding:

string input = File.ReadAllText(fname);
...
parser(input);
有帮助吗?

解决方案

I can see that the values are correct: 0xBF for û

That is not the utf-8 encoding for û, that would be a two byte sequence, 0xC3 + 0xBB. Clearly you guessed the file encoding wrong. The encoding for that character in Windows code page 1252, common in Western Europe and the Americas is 0xFB. Common in the UK as well, your country of residence. Did you reverse the digits?

Use Encoding.Default instead.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top