문제

I've been testing alphabetical sorting in Chinese (if I may call it so). This is how Excel sorts some example words:

啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只

0<2<85<!<@<版本<标记<成员<错误<导出<导航<Excel 文件<访问<分类<更改<规则<HTML<基本<记录<可选<快捷方式<类别<历史记录<密码<目录<内联<内容<讨论<文件<页面<只读

and this is what came out of Collections.sort(list, simplified_chinese_collator_comparator) (the first offending character in bold):

啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<

!<@<0<2<85<Excel 文件<HTML<版本<标记<成员<错误<导出<导航<访问<分类<更改<规则<基本<记录 <可选<快捷方式<类别<历史记录<密码<目录<内联<内容<讨论<文件<页面<只读

I don't know anything about Chinese. Does anyone know why Collator output it's different, or what is it based on?

Are there any other libraries for language-based sorting?

도움이 되었습니까?

해결책

Why it is different? Because there are several different methods of sorting ideographic characters or even entire words. The ones that stuck in my mind are:

  • by number of strokes
  • by using Latin transliteration and then ordering it "naturally" (according to rules specific for Chinese language of course)

There are other methods as well, for example Unicode Technical Report #35 mentions some of them (more by coincidence, not necessary on purpose), but you'd have to have plenty of time to go through it.

To answer your question, on why these sorting orders are different, it just because Java contains its own collation rules and it does not rely on Operating System's ones (as Excel does). These rules might be different. You might also want to try out ICU, which is the source of classes and rules in Java (and is usually a step ahead than JDK).

다른 팁

There isn't a Collator in Java 6 or 7 which will sort the Chinese in the same order as the first sample.

public static void main(String... args) {
    String text1 = "啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只";
    findLocaleForSortedOrder(text1);
    String text2 = "啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<馍";
    findLocaleForSortedOrder(text2);
}

private static void findLocaleForSortedOrder(String text) {
    System.out.println("For " + text + " found...");
    String[] preSorted = text.split("<");
    for (Locale locale : Collator.getAvailableLocales()) {
        String[] sorted = preSorted.clone();
        Arrays.sort(sorted, Collator.getInstance(locale));
        if (Arrays.equals(preSorted, sorted))
            System.out.println("Locale " + locale + " has the same sorted order");
    }
    System.out.println();
}

prints

For 啊<波<词<的<俄<佛<歌<和<及<课<了<馍<呢<票<气<日<四<特<瓦<喜<以<只 found...

For 啊<波<词<的<俄<佛<歌<和<及<课<了<呢<票<气<日<四<特<瓦<喜<以<只<馍 found...
Locale zh_CN has the same sorted order
Locale zh has the same sorted order
Locale zh_SG has the same sorted order
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top