python website language detection

https://stackoverflow.com/questions/11507279

21-06-2021
|

Question

i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.

Solution

Look into Natural Language Toolkit:

NLTK: http://nltk.org/

What you want to look into is using corpus to extract the default vocabulary set by NLTK:

nltk.corpus.words.words()

Then, compare your text with the above using difflib.

Reference: http://docs.python.org/library/difflib.html

Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.

OTHER TIPS

Since you are using Python, you can try out NLTK. More precisely you can check for NLTK.detect

More information and the exact code snippet is here: NLTK and language detection

You can use the response headers to find out:

Wikipedia

If the sites are multilanguage you can send the "Accept-Language:en-US,en;q=0.8" header and expect the response to be in english. If they are not, you can inspect the "response.headers" dictionary and see if you can find any information about the language.

If still unlucky, you can try mapping the IP to the country and then to the language in some way. As a last resource, try detecting the language (I don't know how accurate this is).

If you are using Python, I highly recommend standalone LangID module written by Marco Lui and Tim Baldwin. The model is pre-trained and the character detection is highly accurate. It can also handle XML/HTML document.

You can use Language Detection API at http://detectlanguage.com It accepts text string via GET or POST and provides JSON output with scores. There is free and premium services.

If a html website is using non English characters it is mentioned in the webpage source code in the meta tag. this helps browsers know how to render the page.

here is an example off an arabic website http://www.tanmia.ae that has both an English page and Arabic page

meta tag in the Arabic page is : meta http-equiv="X-UA-Compatible" content="IE=edge

The same page but in English is meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

maybe have the bot look into the meta tag if its english then proceed else ignore?

If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.

It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams module.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow