Question

I'm extracting content from Microsoft Word 97-07 documents (.doc) and storing them into a field in Solr (in order to show context snippets for highlighting). It seems like the content that is extracted is not properly filtered; lots of special characters are stored, while I only want to store the content in plaintext. When I print out the snippets it looks like this:

Context Snippets From Ms Word .doc

Is there any way to filter out/strip the special characters? It would also be nice - but not necessary - to be able to remove the text that turns out to be function names as well, like NUMPAGES.

I have the following ExtractingRequestHandler that I use:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

The RequestHandler is used via SolrJ, with these parameters:

up.setParam("fmap.content", "file_content");
up.setParam("fmap.title", "title_text");

and the file_content field is defined like this:

<field name="file_content" type="text_printable" stored="true"/>

and although I don't think the field type matters (because it is not indexed) I will put it here anyway:

<fieldType name="text_printable" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ScandinavianFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ScandinavianFoldingFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  </analyzer>
</fieldType>

Edit: I forgot to mention that I'm using SOLR 4.4.0 which comes with Tika 1.4

Était-ce utile?

La solution

It turns out this is partially fixed in Tika 1.5.

This is what it looks like now

I say partially fixed, because there are still some special characters related to dynamic page numbering in Table of Contents.

According to the nice people on #solr on Freenode, Apache Tika 1.5 is supposed to be packaged with Solr 4.8.0. As a temporary fix before 4.8.0 is released, I simply downloaded Tika 1.5 and put tika-core-1.5.jar and tika-parsers-1.5.jar in the contrib/extraction/lib directory of Solr. I also had to delete the old files, namely tika-core-1.4.jar and tika-parsers-1.4.jar. It seems to work flawlessly so far.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top