Escaping unicode surrogate characters?

https://stackoverflow.com/questions/22921933

29-06-2023
|

Question

I have the following line of text (see in code as well:

What I'm trying to do do is escape that emoticon (phone icon) as two \u chars then back to its original phone icon? The first method below works fine but I essentially want to escape by a range so that I can escape any chars like this. I don't know how this is possible using the first method below.

How can I achieve this range based escape using the UnicodeEscaper as the same output as StringEscapeUtils (i.e. escape to two \uxx \uxx then unescape back to phone icon)?

import org.apache.commons.lang3.text.translate.UnicodeEscaper;
import org.apache.commons.lang3.text.translate.UnicodeUnescaper;

    String text = "Unicode surrogate here-> 📱<--here";
    // escape the entire string...not what I want because there could
    // be \n \r or any other escape chars that I want left in tact (i just want  a range)
    String text2 = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
    System.out.println(text2);   // "Unicode surrogate here-> \uD83D\uDCF1<--here"
    // unescape it back to the phone emoticon
    text2 = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
    System.out.println(text2); // "Unicode surrogate here-> 📱<--here"

    // How do I do the same as above but but looking for a range of chars to escape (i.e. any unicode surrogate)
    // , which is what i want  and not to escape the entire string
    text2 = UnicodeEscaper.between(0x10000, 0x10FFFF).translate(text);
    System.out.println(text2); // "Unicode surrogate here-> \u1F4F1<--here"
    // unescape .... (need the phone emoticon here)
    text2 = (new UnicodeUnescaper().translate(text2));
    System.out.println(text2);// "Unicode surrogate here-> ὏1<--here"

Solution 2

Your string:

"Unicode surrogate here-> \u1F4F1<--here"

does not do what you think it does.

A char is basically a UTF-16 code unit, therefore 16 bits. So what happens here is that you have \u1f41 1; and that explains your output.

I don't know what you call "escape" here, but if this is replacing surrogate pairs by "\u\u", then have a look at Character.toChars(). It will return the char sequence necessary to represent one Unicode code point, whether it is in the BMP (one char) or not (two chars).

For code point U+1f4f1, it will return a two-element char array with characters 0xd83d and 0xdcf1 in that order. And this is what you want.

OTHER TIPS

Too late answer. But I've found you need

org.apache.commons.lang3.text.translate.JavaUnicodeEscaper

class instead UnicodeEscaper.

Using it, it prints:

Unicode surrogate here-> \uD83D\uDCF1<--here

And the unescaping works well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow