Question

I'm writing a small crawler for sites in English only, and doing that by opening a URL connection. I set the encoding to utf-8 both on the request, and the InputStreamReader but I continue to get gobbledigook for some of the requests, while others work fine.

The following code represents all the research I did and advice out there. I have also tried changing URLConnection to HttpURLConnection with no luck. Some of the returned strings continue to look like this:

??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??@

What am I missing?

My code:

public static String getDocumentFromUrl(String urlString) throws Exception {
    String wholeDocument = null;

        URL url = new URL(urlString);
        URLConnection conn = url.openConnection();
        conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
        conn.setRequestProperty("Accept-Charset", "utf-8");
        conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
        conn.setReadTimeout(60*1000);
        InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
        BufferedReader in = new BufferedReader(isr);

        String inputLine;
        while ((inputLine = in.readLine()) != null) {
            wholeDocument += inputLine;     
        }       
        isr.close();
        in.close();         

    return wholeDocument;
}
Was it helpful?

Solution

The server is sending the document GZIP compressed. You can set the Accept-Encoding HTTP header to make it send the document in plain text.

conn.setRequestProperty("Accept-Encoding", "identity");

Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. What seems to be going on here is that the server is buggy: it does not send the Content-Encoding header to tell you the content is compressed. This behavior seems to depend on the User-Agent, so that the site works in regular web browsers but breaks when used from Java. So, setting the user agent also fixes the issue:

conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top