Frage

I currently have a lot comments and text in my database that is mainly in English. However if it isn't in English I want to translate it to English.

I know I can call a translation api to determine the language but I don't want to make millions of translation API calls for text that most likely won't need translating.

I am looking for a way to determine if the text is English or not. I don't need to know what language it is, just that it isn't English, then if it isn't English I will send it to a translation service API.

War es hilfreich?

Lösung

The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.

If your only need is to detect whether or not something is English, then in theory you can use something even more compact.

Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.

The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.

Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example@example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.

Andere Tipps

You can use NTextCat to determine input language.

Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.

If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.

It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.

See this post

More specifically, take a look on Trigrams

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top