Question

If I apply Unicode Normalization Form C to a string, will the number of code points in the string ever increase?

Was it helpful?

Solution

Yes, there are code points that expand to multiple code points after applying NFC normalization. Within the Basic Multilingual Plane, for example, there are 70 code points that expand to 2 code points after applying NFC normalization, and there are 2 code points (U+FB2C and U+FB2D within the Alphabetic Presentation Forms block) that expand to 3 code points.

One guarantee that you have for this so-called "expansion factor" is that no string will ever expand more than 3 times in length (in terms of number of code units) after NFC normalization is applied:

There is also a Unicode Consortium stability policy that canonical mappings are always limited in all versions of Unicode, so that no string when decomposed with NFC expands to more than 3× in length (measured in code units). This is true whether the text is in UTF-8, UTF-16, or UTF-32. This guarantee also allows for certain optimizations in processing, especially in determining buffer sizes.

Section 9, Detecting Normalization Forms. UAX #15: Unicode Normalization Forms.

I have written a Java program to determine which code points within a Unicode block expand to multiple code points: http://ideone.com/9PUOCb

Alternatively, Tom Christiansen's unichars utility, part of the Unicode::Tussle CPAN module, can be used. (Note: Mac users may see an error at the make test installation step saying that the Perl version is too old. If you see this error, you can install the module by running notest install Unicode::Tussle within a CPAN shell.)

Examples:

  • Print the code points in the BMP that expand to 3 code points:

    unichars 'length(NFC) == 3'
    ‭‭ שּׁ  U+FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT
    ‭ שּׂ  U+FB2D HEBREW LETTER SHIN WITH DAGESH AND SIN DOT
  • Count the number of code points in all planes that expand to more than one code point:

    unichars -a 'length(NFC) > 1' | wc -l
          85

See also the frequently asked question What are the maximum expansion factors for the different normalization forms?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top