Retrieving extracted text with Apache Solr

https://stackoverflow.com/questions/4948587

07-11-2019
|

質問

I'm new to Apache Solr, and I want to use it for indexing pdf files. I managed to get it up and running so far and I can now search for added pdf files.

However, I need to be able to retrieve the searched text from the results.

I found an xml snippet in the default solrconfig.xml concerning exactly that:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
<lst name="defaults">
  <!-- All the main content goes into "text"... if you need to return
       the extracted text or do highlighting, use a stored field. -->
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>

From what I get from here (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika), I think I have to add a new field to schema.xml (e.g. "content") that has stored="true" and indexed="true". However, I'm not really sure how to accomplish this exactly?

any help appreciated, thx

正しい解決策はありません

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow