需要帮助获得HTML的一个网站，在Java

https://stackoverflow.com/questions/3406289

25-09-2019
|

题

我得到了一些代码 java httpurlconnection切断html 和我差不多相同的代码，以获取html网站。除了一个特别网站，我无法让这个代码的工作：

我试图获得HTML从这个网站：

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

但我继续得到垃圾的人物。虽然它工作得很好，任何其他网站像 http://www.google.com.

这是代码，我在使用：

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

我不明白为什么它不工作的网址，我上面提到的。

任何帮助，将不胜感激。

解决方案

该网站被错误地gzip压缩的响应，无论客户端的能力。通常的服务器应仅gzip该响应每当客户支持(通过 Accept-Encoding: gzip).你需要ungzip使用 GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

注意，我也加入适charset的 InputStreamReader 构造。通常情况下你想来提取它从 Content-Type 标头的响应。

为更多的提示，也见如何使用URLConnection火和处理HTTP请求？如果有什么你之后都想要是分析/提取信息，从HTML，那么我强烈建议您使用 HTML分析器像Jsoup代替。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow