Determining if text is english

Question 1

The Chromium project (including its most popular implementation, Google Chrome) solves this problem with https://github.com/google/cld3.

If your only need is to detect whether or not something is English, then in theory you can use something even more compact.

Most good language detectors use trigram frequency (a gram being a single character) or trigram frequency overlaid with word frequency. For your application it seems that you could use a hybrid approach where the first pass is local, but of low accuracy and tuned to be a bit aggressive to not miss any potential English, and the second pass that does hit an API like Google Translate.

The popularity of English and amount of English data is usually helpful for applying NLP solutions to it, but in this case unfortunately you will often find false positives for English, because sources of data that are listed as English contain other languages or un-language like garbage characters or URLs.

Note also that for many queries there is no single correct answer. Good systems will return a weighted list of possibilities, but for a query like [dan], [a], [example@example.com] or [hi! como estas? i'm in class ahorita] the most correct answer will depend on your application and may not exist.

Question 2

You can use NTextCat to determine input language.

Question 3

Research (by a certian Zipf) determined that for the most part, there are some words which are used very frequently, and a lot of words which are rarely used.

If I was given this problem, I'd probably put down a list of the top X used words. Then for each comment I would see if there's a match.

It's not perfect (and if the text is very particular, or mispelt, you've got an issue) - but I think it's an acceptable heuristic.

Question 4

See this post

More specifically, take a look on Trigrams