Question

Hi I need to crawl only sites that their language is English. I know nutch can detect the langauge of sites by plugins like language detector But I need to prevent nutch from crawling the none English site. Although I know we need to crawl a page to understand the language of that I want to leave the site at the first chance we could detect the language. Could you please tell me if its possible? For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them. Thanks for any help.

Was it helpful?

Solution

If you have a quick look to the HTTP Request parameters (http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) you can ask for the content language and you will get an answer like this: "Content-Language: en".

You do not need to do a GET request (and download the whole page), you could ask for this parameter in a HEAD request (in order to download only headers).

About "For example if two or three pages of a site were fetched and they weren't English nutch should leave the site and abandon those pages and all urls of them." A site could be multi-language. So you can get the 3 first pages in spanish (or whatever) and you will leave the site, although there are some pages in English.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top