URL decoding in Java for non-ASCII characters

Question

Anv%E4ndare

As PopoFibo says this is not a valid UTF-8 encoded sequence.

You can do some tolerant best-guess decoding:

public static String parse(String segment, Charset... encodings) {
  byte[] data = parse(segment);
  for (Charset encoding : encodings) {
    try {
      return encoding.newDecoder()
          .onMalformedInput(CodingErrorAction.REPORT)
          .decode(ByteBuffer.wrap(data))
          .toString();
    } catch (CharacterCodingException notThisCharset_ignore) {}
  }
  return segment;
}

private static byte[] parse(String segment) {
  ByteArrayOutputStream buf = new ByteArrayOutputStream();
  Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
                          .matcher(segment);
  int last = 0;
  while (matcher.find()) {
    appendAscii(buf, segment.substring(last, matcher.start()));
    byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
    buf.write(hex);
    last = matcher.end();
  }
  appendAscii(buf, segment.substring(last));
  return buf.toByteArray();
}

private static void appendAscii(ByteArrayOutputStream buf, String data) {
  byte[] b = data.getBytes(StandardCharsets.US_ASCII);
  buf.write(b, 0, b.length);
}

This code will successfully decode the given strings:

for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
    "Anv%E4ndare")) {
  String result = parse(test, StandardCharsets.UTF_8,
      StandardCharsets.ISO_8859_1);
  System.out.println(result);
}

Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.

If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.

Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.

I've written a bit more about URLs and Java here.