Special Characters Stored When Extracting Content From Microsoft Word Documents (.doc)

https://stackoverflow.com/questions/21929040

14-10-2022
|

Question

I'm extracting content from Microsoft Word 97-07 documents (.doc) and storing them into a field in Solr (in order to show context snippets for highlighting). It seems like the content that is extracted is not properly filtered; lots of special characters are stored, while I only want to store the content in plaintext. When I print out the snippets it looks like this:

Context Snippets From Ms Word .doc

Is there any way to filter out/strip the special characters? It would also be nice - but not necessary - to be able to remove the text that turns out to be function names as well, like NUMPAGES.

I have the following ExtractingRequestHandler that I use:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

The RequestHandler is used via SolrJ, with these parameters:

up.setParam("fmap.content", "file_content");
up.setParam("fmap.title", "title_text");

and the file_content field is defined like this:

<field name="file_content" type="text_printable" stored="true"/>

and although I don't think the field type matters (because it is not indexed) I will put it here anyway:

<fieldType name="text_printable" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ScandinavianFoldingFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ScandinavianFoldingFilterFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  </analyzer>
</fieldType>

Edit: I forgot to mention that I'm using SOLR 4.4.0 which comes with Tika 1.4

La solution

It turns out this is partially fixed in Tika 1.5.

This is what it looks like now

I say partially fixed, because there are still some special characters related to dynamic page numbering in Table of Contents.

According to the nice people on #solr on Freenode, Apache Tika 1.5 is supposed to be packaged with Solr 4.8.0. As a temporary fix before 4.8.0 is released, I simply downloaded Tika 1.5 and put tika-core-1.5.jar and tika-parsers-1.5.jar in the contrib/extraction/lib directory of Solr. I also had to delete the old files, namely tika-core-1.4.jar and tika-parsers-1.4.jar. It seems to work flawlessly so far.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow