Question

I need to do some synonym matching with Solr.

For instance in Sweden streetnames usually have the form of Foogatan where gatan is name for street in english. This street name can be written out abbreviated like Foog. (kinda like you write st. for street in english)

I'm familiar with how synonyms.txt works but I don't know how to create a synonym that will check that it contains some letters before gatan or before g..

I would need a synonym that would match *g. and *gatan.

I ended up doing this (seems to work as a rough draft for what I'm after)

public boolean incrementToken() throws IOException {

    // See http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

    if (!input.incrementToken()) return false;

    String string = charTermAttr.toString();

    boolean containsGatan = string.contains("gatan");
    boolean containsG = string.contains("g.");

    if (containsGatan) {

        string = string.replace("gatan", "g.");

        char[] newBuffer = string.toCharArray();

        charTermAttr.setEmpty();
        charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);

        return true;
    }

    if (containsG) {

        string = string.replace("g.", "gatan");

        char[] newBuffer = string.toCharArray();

        charTermAttr.setEmpty();
        charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);

        return true;
    }

    return false;
}

Also a similar problem I have is that you can write phone numbers in the form of 031-123456 and 031123456. When searching for a phone number like 031123456 it should also find 031-123456

How can I achieve this in Solr?

Was it helpful?

Solution

For the first one you could write a custom TokenFilter and hook it up in your analyzers (it's not that hard, take a look at org.apache.lucene.analysis.ASCIIFoldingFilter for some simple example).

Second one could possibly be solved by using PatternReplaceCharFilterFactory: http://docs.lucidworks.com/display/solr/CharFilterFactories

You would have to remove '-' character from numbers and index/search for numbers only. Similar question: Solr PatternReplaceCharFilterFactory not replacing with specified pattern

Simple example removing gatan from end of each token:

public class Gatanizer extends TokenFilter {

    private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);

    /**
     * Construct a token stream filtering the given input.
     */
    protected Gatanizer(TokenStream input) {
        super(input);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (input.incrementToken()) {

            final char[] buffer = termAttribute.buffer();
            final int length = termAttribute.length();

            String tokenString = new String(buffer, 0, length);
            tokenString = StringUtils.removeEnd(tokenString, "gatan");

            termAttribute.setEmpty();
            termAttribute.append(tokenString);

            return true;
        }

        return false;
    }

}

and I've registered my TokenFilter to some Solr field:

    <fieldtype name="gatan" stored="false" indexed="false" multiValued="true" class="solr.TextField">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="gatanizer.GatanizerFactory"/>
        </analyzer>
    </fieldtype>

You'll also need some simple GatanizerFactory that will return your Gatanizer

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top