Question

I'm using the DataImportHandler from Solr to index certain data from a database. However, the database table scheme uses CHAR-fields, so they have a fixed width and have some trailing spaces.

I'm trying to remove these trailing spaces (trimming them) by using the solr.TrimFilterFactory. In my Solr schema.xml I'm using the following field type to index the data:

<fieldType name="string" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.TrimFilterFactory" updateOffsets="true" />
    </analyzer>
</fieldType>

So now I'm adding a document like:

<add>
    <doc>
        <field name="test">Test       </field>
    </doc>
</add>

And I'm expecting that the trailing spaces from the test-field are removed, but when I query for: test:Test*, I get:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="test">Test       </str>
        </doc>
    </result>
</response>

So as you can see, the trailing spaces are not removed. I must be doing something wrong or misunderstood the concept of filters. But my expectation was that the query would return:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="test">Test</str>
        </doc>
    </result>
</response>

So my question is how I can make sure that when indexing these documents, all trailing spaces get removed.

Was it helpful?

Solution

Solr analyzers/filters do not modify the stored value.
Only the indexed value would be modified.
So the TrimFilterFactory does not change the stored value and would return the same value as input.

If using DIH, Check ScriptTransformer to modify the value before it is fed to Solr.

OTHER TIPS

With newer versions of solr, you can use the TrimFieldUpdateProcessorFactory

<updateRequestProcessorChain name="skip-empty" default="true">

   <processor class="TrimFieldUpdateProcessorFactory" />
   <processor class="RemoveBlankFieldUpdateProcessorFactory" /> 

   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />    
</updateRequestProcessorChain>

I am elaborating this solution based on Solr 8.4 above and latest version. They made is very easy to implement.

I had same problem that most of my fields had trailing spaces and I have many many fields like that in millions of data docs.

I added below line in SolrConfig.xml Search for existing below tag and add TrimFieldUpdateProcessorFactory like below. one line only.

<updateRequestProcessorChain .....>
  <processor class="TrimFieldUpdateProcessorFactory" />

</updateRequestProcessorChain>

Hope that make easy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top