java。工具.扫描仪和维基百科

https://stackoverflow.com/questions/538999

22-08-2019
|

题

我试图使用java。工具.扫描仪要采取维基百科的内容和使用这个词基于搜索。事实上，它的所有罚款，但当阅读了一些话，它给我错误。看码，使一些检查，它原来有一些词语看来不要识别编码，或因此，内容是没有更多的可读性。这是代码，用于采取页：

//-开始-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

问题就出现了用的词为"pubblico"对于意大利维基百科。结果释放上个词pubblico是这样的(切割):我¿我¿½]Ksr>我¿½-E 我¿½1Aï¿验¿验¿½Eï¿½ER3tHZï¿½4vï¿验¿½&PZjtcï¿½¿验¿½Dï¿½7_|i¿验¿验¿验¿½=8ï¿验¿½Ø}

你知道为什么吗？但看着页源和标题都是相同的，有同样的编码...

实证明，内容是压缩，因此我可以告诉维基，不要给我摇动页压缩或者它是唯一的方法?谢谢你

解决方案

尝试使用Reader代替InputStream的 - 我认为它的工作原理是这样的：

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

您也可以只通过字符集，直接在另一个答案表示扫描仪的构造。

其他提示

尝试使用扫描仪，与一个指定的字符组：

public Scanner(InputStream source, String charsetName)

为默认构造:

从字节的流转换成字使用基础的平台的默认charset。

扫描仪java.sun.com

你需要使用一个 URLConnection, ，这样就可以确定 content-type header 在响应。这应该告诉你的字符编码使用的时候你创建你 Scanner.

具体地说，看看在"charset"参数的content-type header.

抑制gzip compression, 设定的接受编码头对"身份".看看 HTTP规范更多的信息。

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

编码不改变。为什么呢？

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

所以工作！

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow