ContentExtraction of PDF file in solr using Apache Tika

https://stackoverflow.com/questions/18767945

28-06-2022
|

문제

I am trying to index the PDF file in the solr using the following tutorial http://wiki.apache.org/solr/ExtractingRequestHandler But everytime i am firing the command

java -jar post.jar *.pdf

it says some org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 Error Kindly help me in indexing the PDF to solr server.Is there any other integration then tika which can help me.

해결책

Post.jar is just an utility to upload files to Solr.
Solr uses Extract handler so you need to provide as url. e.g.

java -Durl=http://localhost:8983/solr/update/extract?literal.id=1 -Dtype=application/pdf -jar post.jar 1.pdf

For encrpted files check link
For Password Protected Files check link

다른 팁

There is obviously some encoding issue here.

I remember doing something like this a few months ago, and it is fairly easy if you can write your own piece of Java code. These are mostly simple to write, and they work like a charm!

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow