Question

I am using https://code.google.com/p/language-detection java library to detect language of given text. Profiles used are as it came with the library. However the result sometimes is surprisingly different from expected. What could be wrong in the code or should I be regenerating profiles?

I have tried with "ld.detect("en");" commented and uncommented. Does white space affect language detection?

    LanguageDetect ld = new LanguageDetect();
    ld.init("C:\\James\\languageTest\\profiles");
    //ld.detect("en");

    String textCurrentLine;
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader("C:\\James\\failcases.txt"));

        while ((textCurrentLine = br.readLine()) != null) {
           System.out.println(ld.detect(textCurrentLine));

        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if (br != null) {
                br.close();
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

Below is what I get for few words

Communication - en
Timing - tl
none - it
user - it
No - pt
Yes - fr
user - no
generated - da
Diagnostic - it
not supported - en
supported - en
Bus Speed - en
Protocol - it
Was it helpful?

Solution

As the FAQ of the library is stating:

Can langdetect handle short texts?

This library requires that a detection text has some length, almost 10-20 words over.

It may return a wrong language for very short text with 1-10 words.

You are trying it on one-word or two-word texts, this is not the use case this library is build for, so you're gonna have wrong results.

For single words without context, you can try to match them with dictionaries of the languages you are targetting.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top