Apache Solr - indexing PDF files

https://stackoverflow.com/questions/9934043

27-05-2021
|

質問

Hi I have tried doing this with the binary distribution as well as compiled the source code my self. Tried running this with Apache Tomcat as well. But I am always getting the following error when I use a pdf file for indexing purposes. I am using post.jar provided in the example project with Solr.

SimplePostTool: version 1.3
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file 4538a001.pdf
SimplePostTool: FATAL: Solr returned an error #400 Invalid UTF-8 middle byte 0xe
3 (at char #10, byte #-1)

I have also tried running this on both Win 7 (JDK 1.7) and Centos (1.6) as well.

I searched the internet and on the bug tracker found patched versions of Jetty jar files, but even after replacing those the problem still persists.

I would really appreciate help, since I am stuck here I cannot proceed forward with further tasks.

Thanks

解決

Solr updates are a specific XML format, so it is rejecting the PDF file.

You can configure an extracting request handler that will parse the PDF file, then process the extracted text as an update.

See: http://wiki.apache.org/solr/ExtractingRequestHandler

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow