Boilerpipe to extract non-english news articles

Question 1

You've clearly been able to get ArticleExtractor to parse utf-8 text. The (likely) problem is that boilerplate's algorithms are specifically tailored to English and aren't working so well on a Gujarati (?) article. The algorithms use verbosity of phrases (eg: number of words per phrase) as well as some specific phrases (comments, have your say, etc) to determine the barriers of the article, as well as what pieces within the article are content or non content.

Have a look in the boilerpipe/filters/english directory of the library for more info on the algorithms. Unfortunately to get the same level of accuracy in non-English languages you would need to repeat their study on each language, or have a list of translated stop words and an idea about verbosity for each language you use.

Question 2

First - The accepted answer is correct. Boilerpipe's algorithms are specifically tailored to English. However that does not mean it cannot return rough content in other languages. Please read complete accepted answer, below can be a crapshoot and you may not always get good content...

Java-

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class BoilerpipeTest {

    public static void main(String[] args) {
        try{
            //some wrestling match in Russian from Russian newspaper
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Next, if you are using Eclipse-

Click on Run > Run Configurations > and select the Common Tab, then Encoding to Other(UTF-8), then click Run like so:

enter image description here