How do I tell Nutch to crawl through a url without storing it?

https://stackoverflow.com/questions/18477167

solr
intranet
search-engine
nutch

26-06-2022
|

Question

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.

Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.

But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).

What's the normal or least painful way to set Nutch->Solr up to work like this?

Solution

Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).

[Will add my sample plugin code here when it's working properly]

References:

http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
http://florianhartl.com/nutch-plugin-tutorial.html
How to filter URLs in Nutch 2.1 solrindex command

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

How do I tell Nutch to crawl *through* a url without storing it?

How do I tell Nutch to crawl through a url without storing it?