Question

I tried to decode the following string,

String str  = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";

System.out.println(StringEscapeUtils.unescapeHtml(str));
try {
    System.out.println("res:"+java.net.URLDecoder.decode(str, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

Both methods fail as below,

AT%26amp%3BT%20Network%20Client%20%u2013%20IBM
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u2"
    at java.net.URLDecoder.decode(URLDecoder.java:173)
    at decrypt.DecryptHtml.main(DecryptHtml.java:19)

The source of the string is a VBS script that uses the Escape function. How can I decode this string?

Was it helpful?

Solution

Unfortunately, from reading the documentation, it appears that Microsoft Has Done It Again (tm): "non standard xxx", where here "xxx" is "escaping format".

Specifically, in the documentation of the VBScript function, it is said that:

[...]Unicode characters that have a value greater than 255 are stored using the %uxxxx format.

(Hey, MS: there is no such thing as "Unicode characters"; those are called code points)

Great. So you need your own decoding function.

Fortunately, we use Java. And since this proprietary escape sequence only covers Unicode code points in the Basic Multilingual Plane (U+0000 to U+FFFF), and since char is a UTF-16 code unit, and since there is a 1 to 1 mapping between BMP and UTF-16, this makes our job a little easier.

Here is the code:

public final class MSUnescaper
{
    private static final char PERCENT = '%';
    private static final char NONSTANDARD_PCT_ESCAPE = 'u';

    private MSUnescaper()
    {
    }

    public static String unescape(final String input)
    {
        final StringBuilder sb = new StringBuilder(input.length());
        final CharBuffer buf = CharBuffer.wrap(input);

        char c;

        while (buf.hasRemaining()) {
            c = buf.get();
            if (c != PERCENT) {
                sb.append(c);
                continue;
            }
            if (!buf.hasRemaining())
                throw new IllegalArgumentException();
            c = buf.get();
            sb.append(c == NONSTANDARD_PCT_ESCAPE
                ? msEscape(buf) : standardEscape(buf, c));
        }

        return sb.toString();
    }

    private static char standardEscape(final CharBuffer buf, final char c)
    {
        if (!buf.hasRemaining())
            throw new IllegalArgumentException();
        final char[] array = { c, buf.get() };
        return (char) Integer.parseInt(new String(array), 16);
    }

    private static char msEscape(final CharBuffer buf)
    {
        if (buf.remaining() < 4)
            throw new IllegalArgumentException();
        final char[] array = new char[4];
        buf.get(array);
        return (char) Integer.parseInt(new String(array), 16);
    }

    public static void main(final String... args)
    {
        final String input = "AT%26amp%3BT%20Network%20Client%20%u2013%20IBM";
        System.out.println(unescape(input));
    }
}

Output:

AT&amp;T Network Client – IBM

OTHER TIPS

String str = "AT%26amp%3BT%20Network%20Client%20%[here]u[here]2013%20IBM" I think this string is invalid. %u20 is not valid charecter. If you remove u from your string you can encode it. For reference: w3schools html url encodeing

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top