Question

I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results, and then access them later via Rails. I have a couple of concerns:

1) At the end of this page, it says that "Note: Every storage engine will clear out existing Anemone data before beginning a new crawl." I would expect this to happen at the end of the crawl if I were using the default memory storage, but shouldn't the records be persisted to MongoDB indefinitely so that duplicate pages are not crawled next time the task is run? If they are wiped "before beginning a new crawl", then should I just run my Rails logic before the next crawl? If so, then I would end up having to check for duplicate records from the previous crawl.

2) This is the first time I have really thought about using MongoDB outside the context of Rails models. It looks like the records are created using the Page class, so can I later just query these as I normally would using Mongoid? I guess it is just considered a "model" once it has an ORM providing the fancy methods?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top