Nutch Crawling and ignoring new urls

https://stackoverflow.com/questions/19482798

nutch

01-07-2022
|

Question

I have an issue where I try to issue a new crawl on something ive already crawled, but with some new URLS.

so first i have

urls/urls.txt -> www.somewebsite.com

i then issue the command

bin/nutch crawl urls -dir crawl -depth 60 -threads 50

i then update urls/urls.txt -> remove www.somewebsite.com -> add www.anotherwebsite.com

i issue the command

bin/nutch inject crawl urls

bin/nutch crawl urls -dir crawl -depth 60 -threads 50

What i would expect here, is that www.anotherwebsite.com is injected into the existing 'crawl' db, and when crawl is issued again it should only crawl the new website ive added www.anotherwebsite.com (as the refetch for the original is set to 30 days)

What I have experienced is that either

1.) no website is crawled

2.) only the original website is crawled

'sometimes' if i leave it for a few hours it starts working and picks up the new website and crawls both the old website and new one (even though the refetch time is set to 30 days)

its very weird and unpredictable behaviour.

Im pretty sure my regex-urlfilter file is set correctly, and my nutch-site / nutch-default is all setup with defaults (near enough).

Questions:

can anyone explain simply (with commands) what is happening during each crawl, and how to update an existing crawl db with some new urls?

can anyone explain (with commands) how i force a recrawl of 'all' urls in the crawl db? - i have issued a readdb and checked the refetch times, and most are set to a month, but what if i want to refetch again sooner?

Solution

Article Here explains the crawl process in sufficient depth

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow