Question

Given this data (relative letter frequency from both languages):

spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83,
english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80,

And then computing the letter frequency for the string "this is a test" gives me:

"t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14

So, what would be a good approach for matching the given string letter frequency with a language (and try to detect the language)? I've seen (and have tested) some examples using levenshtein distance, and it seems to work fine until you add more languages.

"this is a test" gives (shortest distance:) [:english, 13] ...
"esto es una prueba" gives (shortest distance:) [:spanish, 13] ...
Was it helpful?

Solution

Have you considered using cosine similarity to determine the amount of similarity between two vectors? cosine similarity formula

The first vector would be the letter frequencies extracted from the test string (to be classified), and the second vector would be for a specific language.

You're currently extracting single letter frequencies (unigrams). I would suggest extracting higher order n-grams, such as bigrams or trigrams (and even larger if you had enough training data). For example, for bigrams you would compute the frequencies of "aa", "ab", "ac" ... "zz", which will allow you to extract more information than if you were just considering single character frequencies.

Be careful though, because you need more training data when you use higher order n-grams otherwise you will have many 0-values for character combinations you haven't seen before.

In addition, a second possibility is to use tf-idf (term-frequency inverse-document-frequency) weightings instead of pure letter (term) frequencies.

Research

Here is a good slideshow on language identification for (very) short texts, which uses machine learning classifiers (but also has some other good info).

Here is a short paper A Comparison of Language Identification Approaches on Short, Query-Style Texts that you might also find useful.

OTHER TIPS

The examples you gave consisted of a short sentence each. Statistics dictate that if your input was longer (e.g. a paragraph, the unique frequencies should be easier to identify.

If you can't rely on the user giving a longer input, perhaps look for common words (e.g. is, as, and, but ...) in the language as well, if the letter frequencies match?

n-graphs certainly will help with short texts, and help a great deal. With any reasonable length text (a paragraph?), simple letter frequencies work well. As an example, I wrote a short demo of this, and you may download the source at http://georgeflanagin.com/free.code.php

It's the last example on the page.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top