Nutch issues with crwaling website where the url differes only in termes of parameters passes

StackOverflow https://stackoverflow.com/questions/1705808

  •  19-09-2019
  •  | 
  •  

Question

I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/) and one other.

The urls on my webiste are basically of this format

http://mysite.com/index.php?main_page=index&params=12

http://mysite.com/index.php?main_page=index&category=tub&param=17

i.e. the urls differ only in terms of parameters appened to the url (the part "http://mysite.com/index.php?" is common to all urls)

Is Nutch unable to crawl such webistes?

What Nutch settings should I do in order to crawl such websites?

Was it helpful?

Solution

I got the issue fixed. It had everything to do with the url filter set as

skip URLs containing certain characters as probable queries, etc

-[?*!@=]

I commented this filter and Nutch crawle dall urls :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top