Question

I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.

To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose 𝛃 and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:

char ch1 = '얤'; // No problem
char ch2 = '𝛃'; // Compilation error "Too many characters in character literal"

What's going on?

A corollary of this question is, if I have the string "얤𝛃얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.

Am I missing something here?

Was it helpful?

Solution 2

Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.

If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.

Typically, it's better to specify special characters using a character sequence.

This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).

The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.

As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator

OTHER TIPS

MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).

If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top