Solr tika not storing any data

https://stackoverflow.com/questions/18264162

24-06-2022
|

質問

I am faced with a peculiar problem. I configured my data config and schema as per the solr wiki here : Tika DIH

Data config is like :

<dataConfig>
<dataSource type="BinURLDataSource" name="bin" />
    <document>
        <entity name="tika-test" processor="TikaEntityProcessor"
                 url = "http://adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_open_parameters.pdf" dataSource= "bin" format="text" >
                <field column="Author" name="author" meta="true"/>

                <field column="title" meta="true" name="title"/>
                <field column="text" name="text"/>
        </entity>
    </document>
</dataConfig>

Schema is like this :

 <fields>
   <field name="title" type="string" indexed="true" stored="true"/>

   <field name="author" type="string" indexed="true" stored="true" />


   <field name="text" type="text" indexed="true" stored="true" />


 </fields>
 <uniqueKey>text</uniqueKey>

I have an executable jar of tika as well, the above document is processed just prefectly when I use the jar version from the command line. However, with solr the data import imports an empty set of fields. It succeeds but the document created is completely empty for all fields. Where am I going wrong?

I tried using the ExtractingRequestHandler as well. This is how my request handler is setup :

 <requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">ignored_</str>
    </lst>
  </requestHandler>

Attempting the following request :

curl "http://localhost:3533/solr/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=/home/superq/Downloads/tutorial.html"

I get an empty response like:

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int></lst></response>

Even the log files don't have anything which might help.And the document is not indexed yet. Moreover it seems that nothing is being worked on as changing the target file name to a file which does not exist DOES NOT throw an error as it should.

My question is :

1) For solr tika integration I just need to copy the respective tika files(build artifacts) into the solr library path or do I need to install it as a service as well?

2) For converting files do I need to create a binary version of the .doc/.pdf file and then feed it to solr? I saw some literature on this which was rather confusing. Shouldn't tika be taking care of this?

解決

My article on Setting up Tika & Extracting Request Handler may be of use to you:

http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow