What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?

StackOverflow https://stackoverflow.com/questions/9162595

Question

Updated question ¹

With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?

Original question

I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:

string s = "\u1D7D9"; // ("Mathematical double-struck digit one") 

and it stores the string "ᵽ9".

I'm basically looking for definitive references of answers to the following:

  • If it isn't true UTF-16 in .NET, what is it?
  • What version of Unicode is supported by .NET?
  • If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?

¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.

Was it helpful?

Solution

Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.

The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.

You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.

As for Unicode version, from the MSDN documentation:

"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."

Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0

Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.

Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).

Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.

OTHER TIPS

That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:

string text = "\U0001D7D9"

If you create a WPF app with that character in a text block, it should render the double-one character perfectly.

MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx

I tried this:

    static void Main(string[] args) {
        string someText = char.ConvertFromUtf32(0x1D7D9);
        using (var stream = new MemoryStream()) {
            using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
                writer.Write(someText);
                writer.Flush();
            }
            var bytes = stream.ToArray();
            foreach (var oneByte in bytes) {
                Console.WriteLine(oneByte.ToString("x"));
            }
        }
    }

And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:

  • UTF8
  • UTF32
  • Unicode (UTF-16)

So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)

.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0 - The Unicode Standard, version 5.0 .NET Framework 2.0 and 1.1 - The Unicode Standard, Version 3.1

The complete answers can be found here under the section Remarks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top