Question

I am trying to write a crawler to crawl information from a website which contains around 15 GB of data. I crawl the information and store it in my database. Now new pages continue to get added to the site every week and at the same time old pages get updated. Does it mean I have to recrawl the whole 15 GB of data and build my database allover again every time some edit occurs. What is the simplest way to deal with that problem? How does Google News work because they face the similar problem of information getting updated across the globe? Till now I have found the following research paper on this topic:

http://oak.cs.ucla.edu/~cho/papers/cho-tods03.pdf

And is it always necessary to write your custom crawler for that purpose? Cant I use Scrapy or Nutch?

No correct solution

OTHER TIPS

What you can do in Nutch is use AdaptiveFetchSchedule as explained here. It lets Nutch re-crawl a page, detect if it was changed in the meantime, and adapt the schedule to more frequently re-crawl pages that change more often, and vice versa. Of course, you could just check the Last-Modified headers if they exist and are to be trusted and simply not re-crawl if the dates are before the last crawl. Not sure if this is already implemented in Nutch when detecting the changes or if some hash-based solution is used, but it doesn't sound too hard to do manually if need be.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top