How to enable logs or figure out which parsers are being called in Apache Tika

StackOverflow https://stackoverflow.com/questions/23355809

  •  11-07-2023
  •  | 
  •  

質問

I want to know whats going on in the call in

java -jar tika-app-1.5.jar -j -v banana-gif.wbmp

I have tried to use all possible image and auto parsers but it does not match the output received from this command.

 { "Chroma BlackIsZero":"true",
"Content-Length":63552,
"Content-Type":"image/vnd.wap.wbmp",
"Dimension ImageOrientation":"Normal",
"height":534,
"resourceName":"banana-gif.wbmp",
"tiff:ImageLength":534,
"tiff:ImageWidth":950,
"width":950 }

I want to enable full logs when running this command

役に立ちましたか?

解決

Step one - work out what kind of file Tika thinks it is. You can either get that from the metadata if you're already passing it, or from the Tika App using the detect option

$ java -jar tika-app-1.5.jar --detect wireframe.pdf 
application/pdf

Next up, you need to get the list of all the parsers that the Tika App version you're using knows about, along with the mime types that they handle:

$ java -jar tika-app-1.5.jar --list-parser-details | grep -B 2 -A 2 application/pdf
  application/vnd.oasis.opendocument.chart
org.apache.tika.parser.pdf.PDFParser
  application/pdf
org.apache.tika.parser.pkg.CompressorParser
  application/x-bzip

From that, we see that a PDF file would be handled by org.apache.tika.parser.pdf.PDFParser

For your specific case of image/vnd.wap.wbmp and Tika 1.5, we see that the parser being used is org.apache.tika.parser.image.ImageParser

(Note - all of this applies to using the Tika-App executable jar, where you can't change this. If you were using Tika from your own Java code you'd have more options available, but where you also have to make sure you include all the dependency jars for it to work properly!)

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top