Question

There is a web-accessible file system of thousands and thousands of PDF files that I need to be indexed by Solr (with Lucidworks).

I have an XML file containing data corresponding to each. The XML contains the ID, some simple metadata, and the URL of its corresponding PDF in the file system.

Currently, I am able to format the XML in such a way that Solr reads it and indexes all the metadata I need, including the URL of the PDF.

I would like Solr to, as it's parsing the files, actually follow the URL and index the referenced PDF data along with the XML-supplied metadata. Is this possible?

Was it helpful?

Solution

Your best bet (on pure Solr), would probably be a DataImportHandler with nested entities.

The external processor would be XPathEntityProcessor and within that, you can put TikaEntityProcessor with appropriate data source. Use the variables to construct/pass URL to the inner entity.

Remember to mark the outer (XPath) entity as rootEntity=false to ensure that Solr documents are created for the inner entities.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top