Why does TextReader.Read return an int, not a char?

https://stackoverflow.com/questions/18884491

29-06-2022
|

Pergunta

Consider the following code ( .Dump() in LinqPad simply writes to the console):

var s = "𤭢"; //3 byte code point. 4 byte UTF32 encoded
s.Dump();
s.Length.Dump(); // 2
TextReader sr = new StringReader("𤭢");
int i;
while((i = sr.Read()) >= 0)
{
    // notice here we are yielded two
    // 2 byte values, but as ints
    i.ToString("X").Dump(); // D852, DF62
}

Given the outcome above, why does TextReader.Read() return an int and not a char. Under what circumstances might it read a value greater than 2 bytes?

Solução

TextReader.Read() will never read greater than 2 bytes; however, it returns -1 to mean "no more characters to read" (end of string). Therefore, its return type needs to go up to Int32 (4 bytes) from Char (2 bytes) to be able to express the full Char range plus -1.

Outras dicas

TextReader.Read() probably uses int to allow returning -1 when reaching the end of the text:

The next character from the text reader, or -1 if no more characters are available. The default implementation returns -1.

And, the Length is 2 because Strings are UTF-16 sequences, which require surrogate pairs to represent code points above U+FFFF.

{ 0xD852, 0xDF62 } <=> U+24B62 (𤭢)

You can get the UTF-32 code point from them with Char.ConvertToUtf32():

Char.ConvertToUtf32("𤭢", 0).ToString("X").Dump(); // 24B62

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow