Converting Strings to their code points

https://stackoverflow.com/questions/17812427

03-06-2022
|

Question

I have to convert a large amount of characters to their Unicode Code Point equivalents. I was using the following code to do this conversion:

string sample = "b";
int utf32 = char.ConvertToUtf32(sample, 0);
string codePoint = string.Format("{0:X}", utf32);

This works for the more normal characters, but then I have characters like these ǎ where the actual string is comprises 2 chars a (U-0061) and '̌' (U-030C). The function ConverToUtf32(string, int) there only returns the first (or the other depending on the index) character where I was actually expecting U-0103. Using ConvertToUtf32(char, char) does not work since that requires chars at a higher code point.

Is there another function that I can use to convert strings to their code points, or maybe a calculation that I can perform?

Solution

I have to convert a large amount of characters to their Unicode Code Point equivalents.

That does not seem to be what you're actually asking for. If you have characters from the Basic Multilingual Plane (BMP), then each char corresponds to exactly one code point. Converting to UTF-32 won't change anything about that.

The ConvertToUtf32() method, and especially the overload that takes two chars are for handling codepoints that are not from the BMP. But that doesn't seem to be what you need.

What you actually seem to be asking for is to normalize the string into “Normalization Form Canonical Composition” (NFC). To do that, use the string.Normalize() method:

string decomposed = "\x0061\x030C";
string composed = decomposed.Normalize(NormalizationForm.FormC);
foreach (char c in composed)
    Console.WriteLine("U+{0:X4}", (int)c);

This will print “U+01CE”, that's LATIN SMALL LETTER A WITH CARON, which seems to be what you wanted. (U+030C is COMBINING CARON, so the result is U+01CE LATIN SMALL LETTER A WITH CARON. If you really wanted U+0103 LATIN SMALL LETTER A WITH BREVE, you would need to use U+0306 COMBINING BREVE instead.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow