How to crawl English site and avoid crawling other languages?

https://stackoverflow.com/questions/12275779

30-06-2021
|

Question

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to understand the language of that I want to leave the site at the first chance we could detect the language. Could you please tell me if its possible? For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them. Thanks for any help.

Solution

If you have a quick look to the HTTP Request parameters (http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) you can ask for the content language and you will get an answer like this: "Content-Language: en".

You do not need to do a GET request (and download the whole page), you could ask for this parameter in a HEAD request (in order to download only headers).

About "For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them." A site could be multi-language. So you can get the 3 first pages in spanish (or whatever) and you will leave the site, although there are some pages in English.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow