Question

The unicode character U+FA8E CJK COMPATIBILITY IDEOGRAPH-FA8E is a compatibility character mapped to U+641C [CJK Unified Ideographs]. In Java 6 NFC normalization leaves it U+FA8E, while in Java 7 it does decompose it to U+641C?

When running this small snippet:

String fancyChar = "\uFA8E";
String normalized = Normalizer.normalize(fancyChar, Normalizer.Form.NFC);
System.out.printf("%04x == %04x\n", (int)(fancyChar.charAt(0)), (int)(normalized.charAt(0)));
System.out.println(fancyChar.equals(normalized));

In Java 6 (latest versions of both Sun/Oracle and OpenJDK):

fa8e == fa8e
true

In Java 7 (latest versions of both Sun/Oracle and OpenJDK):

fa8e == 641c
false

So my question is, why has this changed?

Reading the UNICODE NORMALIZATION FORMS it seems NFC should not decompose characters with compatibility mapping?

But the fact that both Oracle and OpenJDK have switched this for Java 7 makes me wonder.

Was it helpful?

Solution

The character U+FA8E has canonical mapping to U+641C. The authoritative reference on this is the UnicodeData.txt file in the Unicode Character Database. Thus, the correct NFC form of U+FA8E is U+641C.

So this is apparently a bug fix. It seems to affect other characters in the same group, too.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top