nutch parse custom xml with tika using xpath

https://stackoverflow.com/questions/20704068

nutch
apache-tika

20-09-2022
|

Question

I am new to nutch. nutch 1.7 I am looking at ways to parse custom xml files based on xpath and store the data. I did see the xml_parser plugin but that's suspended since tika has taken over. How do I configure tika embedded within nutch 1.7 to parse the url content based on xpath. I have searched all nutch documentation/wiki but there's not much information there. tika tries to parse and extract the content which fails because of the custom format, but I want to store the xml with tags based on the xpath. Where should I put the xpath info in the nutch conf? Or do I have to override the tike parser?

Any hints on the right direction much appreciated.

thanks.

Solution

I don't think you can easily do this with tika but you may use these custom plugins to parse xml files based on xpath:

https://github.com/BayanGroup/nutch-custom-search
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow