Pergunta

'Long time reader, first time poster' here.

I'm in the process of making a bot for a spanish Wiki I administer. I wanted to make it from scratch, since one of the purposes of me making it is to practice Java. However, I ran into some trouble when trying to make GET requests with HttpClient to URIs that contain non-ASCII characters such as á,é,í,ó or ú.

String url = "http://es.metroid.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categoría:Mejoras de las Botas"
method = new GetMethod(url);
client.executeMethod(method);

When I do the above, GetMethod complains about the URI:

Exception in thread "main" java.lang.IllegalArgumentException: Invalid uri 'http://es.pruebaloca.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categoría:Mejoras%20de%20las%20Botas&cmlimit=500&format=xml': Invalid query
    at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
    at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
    at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:69)
    at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:120)
    at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:38)
    at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58)
    at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80)

Note that in the URI shown in the stack trace, spaces are encoded into %20 and the ís are left as is. That exact same URI works perfectly on a browser, but I can't get around into GetMethod accepting it.

I've also tried doing the following:

URI uri = new URI(url, false);
method = new GetMethod(uri.getEscapedURI());
client.executeMethod(method);

This way, URI escaped the is, but double escaped the spaces (%2520)...

http://es.metroid.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categor%C3%ADa:Mejoras%2520de%2520las%2520Botas&cmlimit=500&format=xml

Now, if I don't use any spaces in the query, there's no double escaping and I get the desired output. So if there wasn't any possibility of non-ASCII characters, I wouldn't need to use the URI class and wouldn't get the double escaping. In an attempt to avoid the first escaping of the spaces, I tried this:

URI uri = new URI(url, true);
method = new GetMethod(uri.getEscapedURI());
client.executeMethod(method);

But the URI class didn't like it:

org.apache.commons.httpclient.URIException: Invalid query
    at org.apache.commons.httpclient.URI.parseUriReference(URI.java:2049)
    at org.apache.commons.httpclient.URI.<init>(URI.java:167)
    at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:66)
    at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:121)
    at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:38)
    at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58)
    at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80)
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 0
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:39)
    at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58)
    at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80)

Any input on how to avoid this double escaping would be greatly appreciated. I've lurked all around with absolutely no luck.

Thanks!

Edit: The solution that works best for me is parsifal's one, but, as an addition, I'd like to say that setting the path with method.setPath(url) made HttpMethod reject a cookie I needed to save:

Aug 26, 2011 4:07:08 PM org.apache.commons.httpclient.HttpMethodBase processCookieHeaders
WARNING: Cookie rejected: "wikicities_session=900beded4191ff880e09944c7c0aaf5a". Illegal path attribute "/". Path of origin: "http://es.metroid.wikia.com/api.php"

However, if I send the URI to the constructor and forget about the setPath(url), the cookie gets saved without problem.

String url = "http://es.metroid.wikia.com/api.php";
NameValuePair[] query = { new NameValuePair("action", "query"), new NameValuePair("list", "categorymembers"),
            new NameValuePair("cmtitle", "Categoría:Mejoras de las Botas"), new NameValuePair("cmlimit", "500"),
            new NameValuePair("format", "xml") };
HttpMethod method = null;

...

method = new GetMethod(url);  // Or PostMethod(url)
method.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY); // It had been like this the whole time
method.setQueryString(query);
client.executeMethod(method);
Foi útil?

Solução

Looking at the documentation of HttpMethodBase, it appears that all String parameters have to be pre-encoded. The simplest solution is to constructor your URL in stages, with setPath() and the variant of setQueryString() that takes an array of name-value parameters.

Outras dicas

I would recommend using UrlEncoder to encode your queryString values (not the whole queryString).

UrlEncoder.encode("Categoría:Mejoras de las Botas", "UTF-8");

why don't you try adding the params as NameValuePair, the problem here is that when you escape the URL everything in the URL is escaped including things like http://.. thats why the system is complaining.

you can also escape just the arguments using URLEncoder.encode(), just pass the get params to this & append the return value to the URL.

String url = "http://es.metroid.wikia.com/api.php?"+URLEncoder.encode("action=query&list=categorymembers&cmtitle=Categoría:Mejoras de las Botas");

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top