Canonical Unicode string form

https://stackoverflow.com/questions/8972223

18-04-2021
|

Question

I have a Unicode string encoded, say, as UTF8. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3) etc. Can e.g. ICU or any other C/C++ library do that?

Solution

You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.

OTHER TIPS

Comparing Unicode codepoint sequences:

UTF-8 is a canonical representation itself. Two Unicode strings that are composed of the same Unicode codepoints will always be encoded to exactly the same UTF-8 byte sequence and thus can be compared with memcmp. It is a necessary property of the UTF-8 encoding, otherwise it would not be easily decodable. But we can go further, this is true for all official Unicode encoding schemes, UTF-8, UTF-16 and UTF-32. They encode a string to different byte sequences, but they always encode the same string to the same sequence. If you consider endianness and platform independence, UTF-8 is the recommended encoding scheme because you don't have to deal with byte orders when reading or writing 16-bit or 32-bit values.

So the answer is that if two strings are encoded with the same encoding scheme (eg. UTF-8) and endiannes (it's not an issue with UTF-8), the resulting byte sequence will be the same.

Comparing Unicode strings:

There's an other issue that is more difficult to handle. In Unicode some glyphs (the character you see on the screen or paper) can be represented with a single codepoint or a combination of two consecutive codepoints (called combining characters). This is usually true for glyphs with accents, diacritic marks, etc. Because of the different codepoint representation, their corresponding byte sequence will differ. Comparing strings while taking these combining characters into consideration can not be performed with simple byte comparison, first you have to normalize it.

The other answers mention some Unicode normalization techniques, canonical forms and libraries that you can use for converting Unicode strings to their normal form. Then you will be able to compare them byte-by-byte with any encoding scheme.

You're looking to normalize the string to one of the Unicode normalization forms. libicu can do this for you, but not on a UTF-8 string. You have to first convert it to UChar, using e.g. ucnv_toUChars, then normalize with unorm_normalize, then convert back using ucnv_fromUChars. I think there's also some specific version of ucnv_* for UTF-8 encoding.

If memcmp is your only goal you can of course do that directly on the UChar array after unorm_normalize.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow