Question

I am new to nutch. nutch 1.7 I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over. How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there. tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?

Any hints on the right direction much appreciated.

thanks.

Was it helpful?

Solution

I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top