Question

I have searched so much now and hope that somebody can help me. I want to get the Unicode Blocks of every language in Java. What I have found so far is:

  • Character.UnicodeBlock.ARABIC; Character.UnicodeBlock.Cyrillic;
  • Character.UnicodeBlock.LATIN_1_SUPPLEMENT; ....

But this is not enough. I also want to know, which letters are in the German, French, Russian alphabet. I can only get that they correspond to Latin or Cyrillic, but not language specific alphabets like this.

Was it helpful?

Solution

Check out ICU class LocaleData. It gives access to CLDR elements such as exemplarCharacters, by locale.

Beware that exemplarCharacters is rather vaguely defined (the concept of being a character used in a language is inherently vague, too), and hence the values for it have not been defined on a solid basis, and many choices made there are rather arguably. But the data there is probably still be best basis we have in general.

Also note that Unicode blocks are rather coarse units in this context. For example, the Latin 1 Supplement block contains characters used in many languages, but no language uses all the letters in it.

OTHER TIPS

I also want to know, which letters are in the german, french, russian alphabet.

I don't think Unicode supports this. For example, nothing in Unicode says which Latin-based characters are used in which Western European language.

In fact, I have a feeling that it is not even possible to make that call definitively. For instance, I recall reading an edition of a 19th century English classic in which the author / publisher spelled the word "role" as "rôle". It happens quite a lot when languages borrow words from others.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top