Is it Possible to Have Nutch Only Crawl Down a Certain File Path?

https://stackoverflow.com/questions/18731223

apache
web-crawler
nutch

28-06-2022
|

Question

I'm trying to user Apache nutch to only crawl down a certain file path. For example if my url is:

www.foo.com/shoes/

I would want to keep crawling urls like: www.foo.com/shoes/nike and www.foo.com/shoes/addidas and www.foo.com/shoes/addidas/soccer but NOT crawl the other directories like www.foo.com/clothes or www.foo.com/watches. Is there anyway nutch can do this?

Solution

The only thing you have to do is to write a regex that matches to your pattern, something like

+.www.foo.com/shoes/

and skip everything else by using

-.*

at the end of your crawl-urlfilter.txt!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow