UTF32 and C# problems

https://stackoverflow.com/questions/9987706

28-05-2021
|

Question

So I've got some troubles with character encoding. When I put the following two characters into a UTF32 encoded text file:

𩸕
鸕

and then run this code on them:

System.IO.StreamReader streamReader = 
    new System.IO.StreamReader("input", System.Text.Encoding.UTF32, false);
System.IO.StreamWriter streamWriter = 
    new System.IO.StreamWriter("output", false, System.Text.Encoding.UTF32);

streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

I get:

鸕
鸕

(same character twice, i.e the input file != output)

A few things that might help: Hex for the first character:

15 9E 02 00

And for the second:

15 9E 00 00

I am using gedit for the text file creation, mono for the C# and I'm using Ubuntu.

It also doesn't matter if I specify the encoding for the input or output file, it just doesn't like it if it's in UTF32 encoding. It works if the input file is in UTF-8 encoding.

The input file is as follows:

FF FE 00 00 15 9E 02 00 0A 00 00 00 15 9E 00 00 0A 00 00 00

Is it a bug, or is it just me?

Thanks!

La solution

K, so I figured it out I think, it seems to work now. Turns out, since the codes for the characters were 15 9E 02 00 and 15 9E 00 00, then there's no way that they can be held in one, single UTF-16 char. So, instead UTF16 uses these surrogate pairs things where there's two different characters that act as one 'element'. To get elements, we can use:

StringInfo.GetTextElementEnumerator(string fred);

and this returns a string with the surrogate pairs. Treat it as one character.

See here:

http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.gettextelement.aspx

Hope it helps someone :D

Autres conseils

I tried this and it works well on my PC.

System.IO.StreamReader streamReader = new System.IO.StreamReader("input", true);
System.IO.StreamWriter streamWriter = new System.IO.StreamWriter("output", false);

streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

Maybe the text you think is in UTF32 is not.

When writing you're not specifying UTF-32 so it defaults to Encoding.UTF8.

From MSDN:

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), so its GetPreamble method returns an empty byte array. To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).

I think you need to specify the same encoding (Encoding.UTF32) also for your StreamWriter.

EDIT:

Normally it is not needed between UTF codepages but I would also try this:

Encoding utf8 = Encoding.UTF8;
Encoding utf32 = Enconding.UTF32;
byte[] utf8Bytes = utf8.GetBytes(yourText);
byte[] utf32Bytes = Encoding.Convert(utf8, utf32, utf8Bytes);
string utf32Text = iso.GetString(utf32Text);

From the Remarks section of MSDN for StreamReader's constructor:

This constructor initializes the encoding as specified by the encoding parameter, and the internal buffer size to 1024 bytes. The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

Very likely the byte order marks at the beginning of your file are actually indicating UTF 16 (or something), and so it's not using your explicitly stated UTF 32 encoding.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow