How to get character by its (unicode) name in Java? I need the reverse of Character.getName(int codePoint)

StackOverflow https://stackoverflow.com/questions/23671346

  •  23-07-2023
  •  | 
  •  

Question

How do I look up a character or int codepoint in Java using its Unicode name?

For example, if

Character.getName('\u00e4')

returns "LATIN SMALL LETTER A WITH DIAERESIS", how do I perform the reverse operation (i.e. go from "LATIN SMALL LETTER A WITH DIAERESIS" to '\u00e4') using "plain" Java?

Edit: To stop the torrent of comments what I want or I don't want, here is what I would do in Python:

"\N{LATIN SMALL LETTER A WITH DIAERESIS}" # this gives me what I want as a literal

unicodedata.lookup("LATIN SMALL LETTER A WITH DIAERESIS") # a dynamic version

Now, the question is: do the same in Java.

And, BTW, I don't want to "print unicode escapes" -- actually getting hex for char is easy, but I want a char bearing given name.

To put it in other words I want to do the reverse of what Character.getName(int) does.

Was it helpful?

Solution 2

For release JDK 9 and later, using the static method Character.codePointOf(String name) is the simplest approach:

public static int codePointOf​(String name)

Returns the code point value of the Unicode character specified by the given Unicode character name.

This works for all Uniocde characters, and not just those in the Basic Multilingual Plane. For example, running this code on Java 12 ...

String s1 = "LATIN SMALL LETTER A WITH DIAERESIS";
int cp1 = Character.codePointOf(s1);
System.out.println("Unicode name \"" + Character.getName(cp1) + "\" => code point " + cp1 + " => character " + Character.toString(cp1));

String s2 = "EYES";
int cp2 = Character.codePointOf(s2);
System.out.println("Unicode name \"" + Character.getName(cp2) + "\" => code point " + cp2 + " => character " + Character.toString(cp2));

String s3 = "DNA Double Helix"; // Only works with JDK12 and later. Otherwise java.lang.IllegalArgumentException is thrown.
int cp3 = Character.codePointOf(s3);
System.out.println("Unicode name \"" + Character.getName(cp3) + "\" => code point " + cp3 + " => character " + Character.toString(cp3));

...produces this output...

Unicode name "LATIN SMALL LETTER A WITH DIAERESIS" => code point 228 => character ä
Unicode name "EYES" => code point 128064 => character 👀
Unicode name "DNA DOUBLE HELIX" => code point 129516 => character 🧬

To summarize the conversions:

  • For code point => Unicode name, use Character.getName(codepoint)
  • For code point => character representation, use Character.toString(codepoint)
  • For Unicode name => code point, use Character.codePointOf(name)
  • For Unicode name => character representation, no JDK method currently exists. Instead, do it indirectly, using the code point of the Unicode name, as shown above. For example: Character.toString(Character.codePointOf("LATIN SMALL LETTER A WITH DIAERESIS"));.

Notes:

  • Be sure that the JDK release being used supports the specified Unicode names. For example, the character with the Unicode name "DNA Double Helix" was added to Unicode 11 which is only supported by JDK releases >= 12. If you run using an earlier JDK release you will get an IllegalArgumentException when calling Character.codePointOf("DNA Double Helix").
  • If a white square is being shown in place of the Unicode character then try changing the font (e.g. Segoe UI Emoji for rendering Emoji characters).

OTHER TIPS

The ICU4J library can help you here. It has a class UCharacter with getCharFromName and other related methods that can map from various types of character name strings back to the int code points they represent.

However, if you are working with hard coded character names (i.e. quoted string literals in the source code) then it would be far more efficient to do the translation once - use the \u escape in the source code and add a comment with the full name if necessary - rather than incur the cost of parsing the name tables at runtime every time. If the character names are coming from reading a file or similar then obviously you will have to convert at runtime.

Well, looking at the source code for Character.class:

public static String getName(int codePoint) {
    if (!isValidCodePoint(codePoint)) {
        throw new IllegalArgumentException();
    }
    String name = CharacterName.get(codePoint);
    if (name != null)
        return name;
    ...
}

CharacterName is a package-private class which lazily initializes a SoftReference<byte[]> pool of character names (I think). One line in particular is of interest though, buried inside a series of different input stream constructors:

private static synchronized byte[] initNamePool() {
    ...
        return getClass().getResourceAsStream("uniName.dat");
    ...
}

Now, I've been doing some digging, and for some reason this uniName.dat doesn't seem to exist in OpenJDK's source. I did find a uniName.dat -- as part of my TeX Live distribution, strangely enough. Opening it up in a hex editor reveals jumbles of bytes -- so the contents are encoded somehow. How, I have no clue. I'll take a second look at the source code, but it might take a while to decode, if I can figure it out at all.

In addition, the debugger in my copy of Eclipse appears to be broken (can't resolve variables for some reason or another), so I can't inspect the input stream to try to see where it's reading from.

So in short, doesn't seem you can do this in native Java unless you feel like copy-pasting the name pool code from CharacterName, or rolling your own code that deciphers this file (assuming you can find it)


Edit: Found uniName.dat! On my machine, located in resources.jar in the Java installation. Still a bunch of bytes. So you can either parse this file yourself (not a lot of fun, involves a lot of bit twiddling), or use a library (suggested above). So if you're restricted to native Java, you might want to take a look at the CharacterName class and see if you can get something into a HashMap<String, Character>.

I hope this class relying only on "plain" Java will be useful to someone. It utilizes lazily populated lookup table that may be cleared at any time via reset(false) call to free memory (with the possibility to automatically fill the table and use it again if needed). If the characters being looked for are at the lower Unicode blocks (as is usually the case), then fill time of this table is almost unnoticeable. I added optional possibility to pre-fill the whole table via call to reset(true).

Also note that there is known Unicode name collision between U+0007 and U+1F514. Java's Character.getName() still returns "BELL" for the former. The class being presented tries to fix this at least for the reverse operation, returning U+0007 for the approved unique name "ALERT" assigned to it.

import java.util.Map;
import java.util.HashMap;

public class UnicodeTable {
    public static final char INVALID_CHAR = '\uFFFF';
    private static final Map<String, Integer> charMap = new HashMap<>();
    private static boolean incomplete;
    private static int lastLookup;

    static {
        reset(false);
    }

    public static int getCodePoint(String name) {
        Integer cp = charMap.get(name);
        if (cp == null && incomplete) {
            while (++lastLookup <= Character.MAX_CODE_POINT) {
                String uName = Character.getName(lastLookup);
                if (uName != null) {
                    charMap.put(uName, lastLookup);
                    if (uName.equals(name))
                        return lastLookup;
                }
            }
            incomplete = false;
        }
        return cp == null ? INVALID_CHAR : cp;
    }

    public static char getChar(String name) {
        int cp = getCodePoint(name);
        return Character.isBmpCodePoint(cp) ? (char)cp : INVALID_CHAR;
    }

    private static final int ALERT = 0x000007;
    private static final int BELL = 0x01F514;

    public static void reset(boolean fillUp) {
        if (!fillUp) {
            charMap.clear();
            incomplete = true;
            lastLookup = Character.MIN_CODE_POINT - 1;
            charMap.put("ALERT", ALERT);
            String bName = Character.getName(BELL);
            if (bName.equals(Character.getName(ALERT))) {
                getCodePoint(bName);
                charMap.put(bName, BELL);
            }
        } else if (incomplete) {
            while (++lastLookup <= Character.MAX_CODE_POINT) {
                String uName = Character.getName(lastLookup);
                if (uName != null)
                    charMap.put(uName, lastLookup);
            }
            incomplete = false;
        }
    }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top