Unicode Normalization in Windows

https://stackoverflow.com/questions/7041013

24-12-2020
|

Question

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN is UTF-16 (although the "wide char" terminology comes from the fact that it used to be UCS-2, which is not Unicode). However, it makes almost no mention of Unicode Normalization.

MSN has a few pages about Unicode and Unicode Normalization Forms and functions to change the normalization form. The page on normalization even says:

Win32 and the .NET Framework support all four normalization forms.

However, I haven't found anywhere in the docs what normalization form is used (or understood) by the Win32 API.

Question 1: what normalization form is used by default for user input (such as an Edit control) and conversion through MultiByteToWideChar()?

Question 2: must the strings passed to Win32API functions be in a particular normalization form, or are the kernel and file system normalization-agnostic?

Solution

From the MSDN article Using Unicode Normalization to Represent Strings.

Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.

Update: I've included some specific details relating to Question #2.

In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.

In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

OTHER TIPS

what normalization form is used by default for user input

Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.

Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.

For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â) and some are typed as a combining diacritical (eg grave à). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.

This isn't in normal form C (which would be ầ U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ U+0061,U+0302,U+0300).

There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.

are the kernel and file system normalization-agnostic?

Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt, ầ.txt and ầ.txt in the same folder.

First of all, thanks for an excellent question. I found the answer in Michael Kaplan's blog:

But since all of the methods of text input on Windows tend to use the same normalization form already (form C), ...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow