string and 4-byte Unicode characters

https://stackoverflow.com/questions/14010736

12-12-2021
|

Question

I have one question about strings and chars in C#. I found that a string in C# is a Unicode string, and a char takes 2 bytes. So every char is in UTF-16 encoding. That's great, but I also read on Wikipedia that there are some characters that in UTF-16 take 4 bytes.

I'm doing a program that lets you draw characters for alphanumerical displays. In program there is also a tester, where you can write some string, and it draws it for you to see how it looks.

So how I should work with strings, where the user writes a character which takes 4 bytes, i.e. 2 chars. Because I need to go char by char through the string, find this char in the list, and draw it into the panel.

Solution

You you could do:

for( int i = 0; i < str.Length; ++i ) {
    int codePoint = Char.ConvertToUTF32( str, i );
    if( codePoint > 0xffff ) {
        i++;
    }
}

Then the codePoint represents any possible code point as a 32 bit integer.

OTHER TIPS

Work entirely with String objects; don't use Char at all. Example using IndexOf:

var needle = "ℬ";    // U+1D49D (I think)
var hayStack = "a code point outside basic multi lingual plane: ℬ";
var index = heyStack.IndexOf(needle);

Most methods on the String class have overloads which accept Char or String. Most methods on Char have overrides which use String as well. Just don't use Char.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow