Question

Consider the following code ( .Dump() in LinqPad simply writes to the console):

var s = "𤭢"; //3 byte code point. 4 byte UTF32 encoded
s.Dump();
s.Length.Dump(); // 2
TextReader sr = new StringReader("𤭢");
int i;
while((i = sr.Read()) >= 0)
{
    // notice here we are yielded two
    // 2 byte values, but as ints
    i.ToString("X").Dump(); // D852, DF62
}

Given the outcome above, why does TextReader.Read() return an int and not a char. Under what circumstances might it read a value greater than 2 bytes?

Was it helpful?

Solution

TextReader.Read() will never read greater than 2 bytes; however, it returns -1 to mean "no more characters to read" (end of string). Therefore, its return type needs to go up to Int32 (4 bytes) from Char (2 bytes) to be able to express the full Char range plus -1.

OTHER TIPS

TextReader.Read() probably uses int to allow returning -1 when reaching the end of the text:

The next character from the text reader, or -1 if no more characters are available. The default implementation returns -1.

And, the Length is 2 because Strings are UTF-16 sequences, which require surrogate pairs to represent code points above U+FFFF.

{ 0xD852, 0xDF62 } <=> U+24B62 (𤭢)

You can get the UTF-32 code point from them with Char.ConvertToUtf32():

Char.ConvertToUtf32("𤭢", 0).ToString("X").Dump(); // 24B62
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top