I have to convert a large amount of characters to their Unicode Code Point equivalents.
That does not seem to be what you're actually asking for. If you have characters from the Basic Multilingual Plane (BMP), then each char
corresponds to exactly one code point. Converting to UTF-32 won't change anything about that.
The ConvertToUtf32()
method, and especially the overload that takes two char
s are for handling codepoints that are not from the BMP. But that doesn't seem to be what you need.
What you actually seem to be asking for is to normalize the string into “Normalization Form Canonical Composition” (NFC). To do that, use the string.Normalize()
method:
string decomposed = "\x0061\x030C";
string composed = decomposed.Normalize(NormalizationForm.FormC);
foreach (char c in composed)
Console.WriteLine("U+{0:X4}", (int)c);
This will print “U+01CE”, that's LATIN SMALL LETTER A WITH CARON, which seems to be what you wanted. (U+030C is COMBINING CARON, so the result is U+01CE LATIN SMALL LETTER A WITH CARON. If you really wanted U+0103 LATIN SMALL LETTER A WITH BREVE, you would need to use U+0306 COMBINING BREVE instead.)