Question

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)

I found the Heritrix Project (http://crawler.archive.org/).

But there are other nice projects like that?

Was it helpful?

Solution

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an enterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title!

These are all Java based. If you are a .net guy (like me!!) then you might be more interested in Lucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.

OTHER TIPS

You May also want to try Scrapy http://scrapy.org/

It is really easy to specify and run your crawlers.

Abot is a good extensible web-crawler. Every part of the architecture is pluggable giving you complete control over its behavior. Its open source, free for commercial and personal use, written in C#.

https://github.com/sjdirect/abot

I've discovered recently one called - Nutch.

If you're not tied down to platform, I've had very good experiences with Nutch in the past.

It's written in Java and goes hand in hand with the Lucene indexer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top