Java.util.Scanner 및 Wikipedia

https://stackoverflow.com/questions/538999

22-08-2019
|

문제

java.util.scanner를 사용하여 Wikipedia 내용을 가져 와서 Word 기반 검색에 사용하려고합니다. 사실은 모두 괜찮지 만 단어를 읽을 때는 오류를줍니다. 코드를 살펴보고 약간의 점검을 보았습니다. 일부 단어로 인코딩을 인식하지 못하는 것처럼 보이며 내용은 더 이상 읽을 수 없습니다. 이것은 페이지를 가져 오는 데 사용되는 코드입니다.

// -시작-

try {
        connection =  new URL("http://it.wikipedia.org
wiki/"+word).openConnection();
                    Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
//          if(word.equals("pubblico"))
//              System.out.println(content);
        System.out.println("Doing: "+ word);
//End

문제는 이탈리아 위키 백과의 "Pubblico"라는 단어로 발생합니다. 단어 pubblico에 대한 println의 결과는 다음과 같습니다. ½dï¿½7_ | ï¿½ï¿½ï¿½ = 8ï¿½ï¿½ø}

왜 그런지 알고 있습니까? 그러나 페이지 소스를 보았고 헤더는 동일한 인코딩으로 동일합니다 ...

콘텐츠가 gzificated다는 것이 밝혀 졌으므로 Wikipedia에게 Teir 페이지를 지핑하지 말라고 말할 수 있습니까? 아니면 유일한 방법입니까? 감사합니다

해결책

a를 사용해보십시오 Reader 대신 InputStream - 나는 그것이 다음과 같이 작동한다고 생각합니다.

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
String ctype = connection.getContentType();
int csi = ctype.indexOf("charset=");
Scanner scanner;
if (csi > 0)
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8)));
else
    scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
scanner.useDelimiter("\\Z");
content = scanner.next();
if(word.equals("pubblico"))
    System.out.println(content);
System.out.println("Doing: "+ word);

다른 답변에 표시된대로 숯을 스캐너 생성자로 전달할 수도 있습니다.

다른 팁

지정된 문자 세트로 스캐너를 사용해보십시오.

public Scanner(InputStream source, String charsetName)

기본 생성자의 경우 :

스트림의 바이트는 기본 플랫폼의 기본 숯을 사용하여 문자로 변환됩니다.

java.sun.com의 스캐너

당신은 a를 사용해야합니다 URLConnection, 당신이 결정할 수 있도록 내용 유형 헤더 응답으로. 이것은 당신이 당신이 할 때 사용하기 위해 인코딩하는 캐릭터를 알려야합니다. 당신의 Scanner.

구체적으로, 컨텐츠 유형 헤더의 "Charset"매개 변수를보십시오.

GZIP 압축을 억제하려면 허가 인코딩 헤더를 설정하십시오 "신원"으로. 보다 HTTP 사양 자세한 내용은.

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
            connection.addRequestProperty("Accept-Encoding","");
            System.out.println(connection.getContentEncoding());
            Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream()));
            scanner.useDelimiter("\\Z");
            content = new String(scanner.next());

인코딩은 변경되지 않습니다. 왜요?

connection =  new URL("http://it.wikipedia.org/wiki/"+word).openConnection();
//connection.addRequestProperty("Accept-Encoding","");
//System.out.println(connection.getContentEncoding());

InputStream resultingInputStream = null;       // Stream su cui fluisce la pagina scaricata
String encoding = connection.getContentEncoding();    // Codifica di invio (identity, gzip, inflate)
// Scelta dell'opportuno decompressore per leggere la sorgente
if (connection.getContentEncoding() != null && encoding.equals("gzip")) {
    resultingInputStream = new GZIPInputStream(connection.getInputStream());
}
else if (encoding != null && encoding.equals("deflate")) {
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true));
}
else {
    resultingInputStream = connection.getInputStream();
}

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa
Scanner scanner = new Scanner(resultingInputStream);
scanner.useDelimiter("\\Z");
content = new String(scanner.next());

그래서 작동합니다 !!!

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow