java.util.Scanner とウィキペディア

https://stackoverflow.com/questions/538999

22-08-2019
|

質問

java.util.Scanner を使用して Wikipedia のコンテンツを取得し、単語ベースの検索に使用しようとしています。実際のところ、すべて問題ありませんが、いくつかの単語を読むとエラーが発生します。コードを見ると、いくつかの単語でエンコードなどを認識していないようで、コンテンツが読みやすくないように見えることが判明しました。これはページを取得するために使用されるコードです。

// -始める-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

問題は、イタリア語の wikipedia の「pubblico」という単語で発生します。word pubblico での println の結果は次のようになります (切り取られています)。世¿ ¿½½Ø}

理由は何かわかりますか?それでも、ページのソースとヘッダーを見てみると、同じエンコーディングで同じです...

コンテンツが gzip 圧縮されていることがわかったので、ページを zip 圧縮して送信しないように wikipedia に指示できますか? それともそれが唯一の方法ですか?ありがとう

解決

Readerの代わりにInputStreamを使用してみてください - 私はそれがこのような何かをうまくいくと思う。

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

また、単に別の答えに示されているように直接スキャナのコンストラクタに文字セットを渡すことができます。

他のヒント

指定された文字セットを使用してスキャナを使用してみてください。

public Scanner(InputStream source, String charsetName)

デフォルトコンストラクタの場合：

ストリームからバイトを基本となるプラットフォームのデフォルト文字セットを使用して文字に変換されます。

<のhref = "http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html#Scanner(java.io.InputStream,%20java.lang.String java.sun.com の

ON）」のrel = "nofollowをnoreferrer">スキャナ

を使用する必要があります URLConnection, を判断できるようにするため、コンテンツタイプヘッダー応答で。これにより、次の場合に使用する文字エンコーディングがわかります。あなたの Scanner.

具体的には、コンテンツタイプヘッダーの「charset」パラメータを確認してください。

gzip圧縮を禁止するには、 accept-encoding ヘッダーを設定する「アイデンティティ」へ。見る HTTP仕様詳細については。

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

エンコーディングは変更されません。なぜ？

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

だから、動作します!!!

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow