Any way to return only (clean) text from a Wikipedia article?

https://stackoverflow.com/questions/20339240

07-08-2022
|

Pergunta

My overall goal is to return only clean sentences from a Wikipedia article without any markup. Obviously, there are ways to return JSON, XML, etc., but these are full of markup. My best approach so far is to return what Wikipedia calls raw. For example, the following link returns the raw format for the page "Iron Man":

http://en.wikipedia.org/w/index.php?title=Iron%20Man&action=raw

Here is a snippet of what is returned:

...//I am truncating some markup at the beginning here. 
|creative_team_month =
|creative_team_year =
|creators_series =
|TPB =
|ISBN =
|TPB# =
|ISBN# =
|nonUS =
}}
'''Iron Man''' is a fictional character, a [[superhero]] that appears in\\
[[comic book]]s published by [[Marvel Comics]]. 
...//I am truncating here everything until the end.

I have stuck to the raw format because I have found it the easiest to clean up. Although what I have written so far in Java cleans up this pretty well, there are a lot of cases that slip by. These cases include markup for Wikipedia timelines, Wikipedia pictures, and other Wikipedia properties which do not appear on all articles. Again, I am working in Java (in particular, I am working on a Tomcat web application).

Question: Is there a better way to get clean, human-readable sentences from Wikipedia articles? Maybe someone already built a library for this which I just can't find?

I will be happy to edit my question to provide details about what I mean by clean and human-readable if it is not clear.

My current Java method which cleans up the raw formatted text is as follows:

public String cleanRaw(String input){
    //Next three lines attempt to get rid of references.
    input= input.replaceAll("<ref>.*?</ref>","");
    input= input.replaceAll("<ref .*?</ref>","");
    input= input.replaceAll("<ref .*?/>","");

    input= input.replaceAll("==[^=]*==", "");
    //I found that anything between curly braces is not needed. 
    while (input.indexOf("{{") >= 0){
        int prevLength= input.length();
        input= input.replaceAll("\\{\\{[^{}]*\\}\\}", "");
        if (prevLength == input.length()){
            break;
        }
    }
    //Next line gets rid of links to other Wikipedia pages.
    input= input.replaceAll("\\[\\[([^]]*[|])?([^]]*?)\\]\\]", "$2");
    input= input.replaceAll("<!--.*?-->","");
    input= input.replaceAll("[^A-Za-z0-9., ]", "");

    return input;
}

Solução

I found a couple of projects that might help. You might be able to run the first one by including a Javascript engine in your Java code.

txtwiki.js A javascript library to convert MediaWiki markup to plaintext. https://github.com/joaomsa/txtwiki.js

WikiExtractor A Python script that extracts and cleans text from a Wikipedia database dump http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

Source: http://www.mediawiki.org/wiki/Alternative_parsers

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow