Compare Chinese unicode strings, when multiple code points are the same character?

https://stackoverflow.com/questions/9795415

25-05-2021
|

Question

I'm writing some Java code that deals with Chinese characters, and I got some unexpected results -- strings that should be equal were not. Here is one of the offending characters, which means "six" (pinyin: liù): 六. This character can be represented with either of two code points:

F9D1 in the block: CJK Compatibility Ideographs
516D in the block: CJK Unified Ideographs

Wikipedia has a page about these character ranges, and the short section on compatibility ideographs does mention some duplicates, but the list omits this specific character.

So I'm wondering:

Is there a list of duplicate unicode characters somewhere so I can transform Strings before trying to compare them?
Is this normal when dealing with CJK characters, or have I done something else wrong?

Solution

Just normalize them. U+F9D1 becomes U+516D under any of the four normalization schemes:

$ export PERL_UNICODE=S

$ perl -le 'print "\x{F9D1}\x{516D}"' | uniquote -v
\N{CJK COMPATIBILITY IDEOGRAPH-F9D1}\N{CJK UNIFIED IDEOGRAPH-516D}

$ perl -le 'print "\x{F9D1}\x{516D}"' | nfd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}

Many essential Unicode tools, including those, are available here.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow