ESAPI XSS prevention for user supplied url property

Question

The problem that you're facing here, is that there are different rules for encoding different parts of a URL--to memory there's 4 sections in a URL that have different encoding rules. First, understand why in Java, you need to build URLs using the UriBuilder class. The URL specification will help with nitty-gritty details.

Now since the problem is that since the entire URL is coming as input from the user, I cannot simply parse out the Query parameters and sanitize them individually, since malicious input can be created combining the two query parameters and sanitizing them individually wont work in that case.

The only real option here is java.net.URI.

Try this:

URI dirtyURI = new URI("http://example.com/alpha?abc=def&phil=key%3dbdj");

String cleanURIStr = enc.canonicalize( dirtyURI.getPath() );

The call to URI.getPath() should give you a non-percent encoded URL, and if enc.canonicalize() detects double-encoding after that stage then you really DO have a double-encoded string and should inform the caller that you will only accept single-encoded URL strings. The URI.getPath() is smart enough to use decoding rules for each part of the URL string.

If its still giving you some trouble, the API reference has other methods that will extract other parts of the URL, in the event that you need to do different things with different parts of the URL. IF you ever need to manually parse parameters on a GET request for example, you can actually just have it return the query string itself--and it will have done a decoding pass on it.

=============JUNIT Test Case============

package org.owasp.esapi;

import java.net.URI;
import java.net.URISyntaxException;

import org.junit.Test;

public class TestURLValidation {

    @Test
    public void test() throws URISyntaxException {
        Encoder enc = ESAPI.encoder();
        String input = "http://example.com/alpha?abc=def&phil=key%3dbdj";
        URI dirtyURI = new URI(input);
        enc.canonicalize(dirtyURI.getQuery());
        
    }

}

=================Answer for updated question=====================

There's no way around it: Encoder.canonicalize() is intended to reduce escaped character sequences into their reduced, native-to-Java form. URLs are most likely considered a special case so they were most likely deliberately excluded from consideration. Here's the way I would handle your case--without a whitelist, and it will guarantee that you are protected by Encoder.canonicalize().

Use the code above to get a URI representation of your input.

Step 1: Canonicalize all of the URI parts except URI.getQuery() Step 2: Use a library parser to parse the query string into a data structure. I would use httpclient-4.3.3.jar and httpcore-4.3.3.jar from commons. You'll then do something like this:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Iterator;
import java.util.List;

import javax.ws.rs.core.UriBuilder;

import org.apache.http.client.utils.URLEncodedUtils;
import org.junit.Test;
import org.owasp.esapi.ESAPI;
import org.owasp.esapi.Encoder;

public class TestURLValidation
{

  @Test
  public void test() throws URISyntaxException {
    Encoder enc = ESAPI.encoder();
    String input = "http://example.com/alpha?abc=def&phil=key%3dbdj";
    URI dirtyURI = new URI(input);
    UriBuilder uriData = UriBuilder.fromUri(enc.canonicalize(dirtyURI.getScheme()));
    uriData.path(enc.canonicalize(enc.canonicalize(dirtyURI.getAuthority() + dirtyURI.getPath())));
    println(uriData.build().toString());
    List<org.apache.http.NameValuePair> params = URLEncodedUtils.parse(dirtyURI, "UTF-8");
    Iterator<org.apache.http.NameValuePair> it = params.iterator();
    while(it.hasNext()) {
      org.apache.http.NameValuePair nValuePair = it.next();
      uriData.queryParam(enc.canonicalize(nValuePair.getName()), enc.canonicalize(nValuePair.getValue()));
    }
    String canonicalizedUrl = uriData.build().toString();
    println(canonicalizedUrl);
  }

  public static void println(String s) {
    System.out.println(s);
  }
  
}

What we're really doing here is using standard libraries to parse the inputURL (thus taking all the burden off of us) and then canonicalizing the parts after we've parsed each section.

Please note that the code I've listed won't work for all url types... there are more parts to a URL than scheme/authority/path/queries. (Missing is the possibility of userInfo or port, if you need those, modify this code accordingly.)