Question

I'm using Solr 3.x with focus on German text, which works well. Searching for umlauts (öäüß) also works well.

The problem is: I received some archived text from the late 80s, were most of the computer/software did not support more than ASCII, especially no German umlauts were supported. For this an alternative notation was used:

ae instead of ä
oe instead of ö
ue instead of ü
ss instead of ß

That means, the name Müller was saved as Mueller.

Back to Solr, I need now to find documents which contains ue - even if the user searched for ü.

Example: If I like to search for all text messages from the person called Müller, Solr has to find text with Mueller and also Müller

How can I handle this?

Is this an adequate feature? --> http://wiki.apache.org/solr/UnicodeCollation (I'm not sure, if I understand the documentation completely)

By the way, it's not an option to change the source-text by "search and replace": all oe to ö.

Was it helpful?

Solution

As Paige Cook already pointed out, you already found the relevant documentation, but since not every Solr user knows Java I decided to create my own answer with a little more detail.

The first step is to add the filter to your field definition:

<fieldType>
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- BEGIN OF IMPORTANT PART -->
    <filter class="solr.CollationKeyFilterFactory"
        custom="customRules.dat"
        strength="primary"
    />
    <!-- END OF IMPORTANT PART -->
  </analyzer>
</fieldType>

The next step is to create the necessary customRules.dat file:

You have to create a tiny Java program in order to follow the documentation. Unfortunately for non-Java programmers this is a little difficult, since the code snippet only shows the important parts. Also it uses a third-party library not distributed with the JDK (Apache Commons IO)

Heres the full Java 7 code necessary to write a customRules.dat without the use of external libraries:

import java.io.*;
import java.text.*;
import java.util.*;

public class RulesWriter {
    public static void main(String[] args) throws Exception {
        RuleBasedCollator baseCollator = (RuleBasedCollator) 
                Collator.getInstance(new Locale("de", "DE"));

        String DIN5007_2_tailorings =
          "& ae , a\u0308 & AE , A\u0308"+
          "& oe , o\u0308 & OE , O\u0308"+
          "& ue , u\u0308 & UE , u\u0308";

        RuleBasedCollator tailoredCollator = new RuleBasedCollator(
                baseCollator.getRules() + DIN5007_2_tailorings);
        String tailoredRules = tailoredCollator.getRules();

        Writer fw = new OutputStreamWriter(
                new FileOutputStream("c:/customRules.dat"), "UTF-8");
        fw.write(tailoredRules);
        fw.flush();
        fw.close();
    }
}

Disclaimer: The above code compiles and creates a customRules.dat file, but I didn't actually test the created file with Solr.

OTHER TIPS

From my interpretation of the link you provided to the Unicode Collation feature, this is the absolute feature as it shows how to solve the exact same issue that you are having.

Looks like you will to write a little Java to generate your appropriate customRules.dat file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top