Question

I'm trying to user Apache nutch to only crawl down a certain file path. For example if my url is:

www.foo.com/shoes/

I would want to keep crawling urls like: www.foo.com/shoes/nike and www.foo.com/shoes/addidas and www.foo.com/shoes/addidas/soccer but NOT crawl the other directories like www.foo.com/clothes or www.foo.com/watches. Is there anyway nutch can do this?

Was it helpful?

Solution

The only thing you have to do is to write a regex that matches to your pattern, something like

+.www.foo.com/shoes/

and skip everything else by using

-.*

at the end of your crawl-urlfilter.txt!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top