The only thing you have to do is to write a regex that matches to your pattern, something like
+.www.foo.com/shoes/
and skip everything else by using
-.*
at the end of your crawl-urlfilter.txt
!
Question
I'm trying to user Apache nutch to only crawl down a certain file path. For example if my url is:
www.foo.com/shoes/
I would want to keep crawling urls like: www.foo.com/shoes/nike and www.foo.com/shoes/addidas and www.foo.com/shoes/addidas/soccer but NOT crawl the other directories like www.foo.com/clothes or www.foo.com/watches. Is there anyway nutch can do this?
Solution
The only thing you have to do is to write a regex that matches to your pattern, something like
+.www.foo.com/shoes/
and skip everything else by using
-.*
at the end of your crawl-urlfilter.txt
!