質問

I am trying to use boilerpipe to extract news articles from non-english text. I have already seen this and its not working for me. I made following changes 1) Modified HTMLfetcher.java. Appended following lines before end of method fetch

byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
    cs = Charset.forName("UTF-8"); //set the charset to UFT-8

Or/And then 2) Changes code in class by using UTF-8 charset with Inuts

`URL url = new URL(urls);
        InputSource is = new InputSource();
        is.setEncoding("ISO-8859-1");
        is.setByteStream(url.openStream());


        text = ArticleExtractor.INSTANCE.getText(is);`

Still it did not work Test URL: http://www.sandesh.com/article.aspx?newsid=2905443 Text: મુંબઈ, 30 જાન્યુઆરી

સલમાન ખાને ગુજરાતમાં આવીને નરેન્દ્ર મોદીના વખાણ શુ કર્યા તેની મુસીબતોમાં ખૂબ વધારો થઈ ગયો છે. સલમાન ખાન ફિલ્મ 'જય હો'ના પ્રમોશન માટે ઉત્તરાયણમાં અમદાવાદ આવ્યા હોવાથી અને તે સમયે તેણે નરેન્દ્ર મોદીના વખાણ કર્યા હોવાથી કોંગ્રેસ દ્વારા મુસ્લિમોને તેની ફિલ્મ 'જય હો' ના જોવાની અરજી કરવામાં આવી હતી અને હવે મુસ્લિમ મૌલવીઓ દ્વારા તેના સામે ફતવો જાહેર કરી દેવામાં આવ્યો છે.

Please help me.

役に立ちましたか?

解決

You've clearly been able to get ArticleExtractor to parse utf-8 text. The (likely) problem is that boilerplate's algorithms are specifically tailored to English and aren't working so well on a Gujarati (?) article. The algorithms use verbosity of phrases (eg: number of words per phrase) as well as some specific phrases (comments, have your say, etc) to determine the barriers of the article, as well as what pieces within the article are content or non content.

Have a look in the boilerpipe/filters/english directory of the library for more info on the algorithms. Unfortunately to get the same level of accuracy in non-English languages you would need to repeat their study on each language, or have a list of translated stop words and an idea about verbosity for each language you use.

他のヒント

First - The accepted answer is correct. Boilerpipe's algorithms are specifically tailored to English. However that does not mean it cannot return rough content in other languages. Please read complete accepted answer, below can be a crapshoot and you may not always get good content...

Java-

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class BoilerpipeTest {

    public static void main(String[] args) {
        try{
            //some wrestling match in Russian from Russian newspaper
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Next, if you are using Eclipse-

Click on Run > Run Configurations > and select the Common Tab, then Encoding to Other(UTF-8), then click Run like so:

enter image description here

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top