Question

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.

Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.

But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).

What's the normal or least painful way to set Nutch->Solr up to work like this?

Was it helpful?

Solution

Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).

[Will add my sample plugin code here when it's working properly]

References:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top