Java: HttpComponents gets rubbish Response from input Stream from a specific URL

https://stackoverflow.com/questions/7214816

14-01-2021
|

Question

I am currently trying to get HttpComponents to send HttpRequests and retrieve the Response. On most URLs this works without a problem, but when I try to get the URL of a phpBB Forum namely http://www.forum.animenokami.com the client takes more time and the responseEntity contains passages more than once resulting in a broken html file.

For example the meta tags are contained six times. Since many other URLs work I can't figure out what I am doing wrong. The Page is working correctly in known Browsers, so it is not a Problem on their side.

Here is the code I use to send and receive.

        URI uri1 = new URI("http://www.forum.animenokami.com");
    HttpGet get = new HttpGet(uri1);
    get.setHeader(new BasicHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"));
    HttpClient httpClient = new DefaultHttpClient();
    HttpResponse response = httpClient.execute(get);
    HttpEntity ent = response.getEntity();
    InputStream is = ent.getContent();
    BufferedInputStream bis = new BufferedInputStream(is);
    byte[] tmp = new byte[2048];
    int l;
    String ret = "";
    while ((l = bis.read(tmp)) != -1){
        ret += new String(tmp);
    }

I hope you can help me. If you need anymore Information I will try to provide it as soon as possible.

Solution

This code is completely broken:

String ret = "";
while ((l = bis.read(tmp)) != -1){
    ret += new String(tmp);
}

Three things:

This is converting the whole buffer into a string on each iteration, regardless of how much data has been read. (I suspect this is what's actually going wrong in your case.)
It's using the default platform encoding, which is almost never a good idea.
It's using string concatenation in a loop, which leads to poor performance.

Fortunately you can avoid all of this very easily using EntityUtils:

String text = EntityUtils.toString(ent);

That will use the appropriate character encoding specified in the response, if any, or ISO-8859-1 otherwise. (There's another overload which allows you to specify which character encoding to use if it's not specified.)

It's worth understanding what's wrong with your original code though rather than just replacing it with the better code, so that you don't make the same mistakes in other situations.

OTHER TIPS

It works fine but what I don't understand is why I see the same text multiple times only on this URL.

It will be because your client is seeing more incomplete buffers when it reads the socket. Than could be:

because there is a network bandwidth bottleneck on the route from the remote site to your client,
because the remote site is doing some unnecessary flushes, or
some other reason.

The point is that your client must pay close attention to the number of bytes read into the buffer by the read call, otherwise it will end up inserting junk. Network streams in particular are prone not filling the buffer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow