Question

Requirement

One of my text fields contains (among other things) domain names. Given (e.g.) the text "www.docs.corp.com", I would like to be able to search for "www", "docs", "corp", "com", "www.docs", "docs.corp", "corp.com", "www.docs.corp", "docs.corp.com", or "www.docs.corp.com", and find the relevant document containing "www.docs.corp.com".

What I currently do:

Currently I use a charFilter to change "." to space before tokenizing with StandardTokenizerFactory:

<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory" />
  </analyzer>
</fieldType>

This kind-of works, but when looking for "corp.com" will actually look for "corp com", and will thus find some unrelated matches, such as "... corp. com.company.www will also ..." and of course many other false positives.

Hypothesis

What I think I need is a token filter: something that will take the token "www.docs.corp.com" and produce multiple tokens from it: ["www", "docs", "corp", "com", "www.docs", "docs.corp", "corp.com", "www.docs.corp", "docs.corp.com", "www.docs.corp.com"].

Question

Is this the right approach, or am I missing something elegant, like an existing filter than I can configure to do this?

Was it helpful?

Solution

Answering my own question, for the sake of those who might look for something like in the future.

It appears as though my proposed solution is indeed the way to go. I have gone ahead and implemented it, and am posting it here. It is comprised of 2 classes: a Token Filter and a Token-Filter-Factory. Usage should be obvious for anyone verse in Solr.

A link to a quick write-up I did for this: http://blog.nitzanshaked.net/solr-domain-name-tokenizer/

The files:

DomainNameTokenFilterFactory.java

package com.clarityray.solr.analysis;

import java.util.Map;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import com.clarityray.solr.analysis.DomainNameTokenFilter;

public class DomainNameTokenFilterFactory extends TokenFilterFactory {

    private int minLen;
    private int maxLen;
    private boolean withOriginal;

    public DomainNameTokenFilterFactory(Map<String,String> args) {
        super(args);
        withOriginal = getBoolean(args, "withOriginal", true);
        minLen = getInt(args, "minLen", 2);
        maxLen = getInt(args, "maxLen", -1);
        if (!args.isEmpty())
            throw new IllegalArgumentException("Unknown parameters: " + args);
    }

    @Override
    public TokenStream create(TokenStream ts) {
        return new DomainNameTokenFilter(ts, minLen, maxLen, withOriginal);
    }

}

DomainNameTokenFilter.java

package com.clarityray.solr.analysis;

import java.util.Queue;
import java.util.LinkedList;
import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class DomainNameTokenFilter extends TokenFilter {

    private CharTermAttribute charTermAttr;
    private PositionIncrementAttribute posIncAttr;
    private Queue<String> output;
    private int nextPositionIncrement;

    private int minLen;
    private int maxLen;
    private boolean withOriginal;

    public DomainNameTokenFilter(TokenStream ts, int minLen, int maxLen, boolean withOriginal) {
        super(ts);
        this.charTermAttr = addAttribute(CharTermAttribute.class);
        this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
        this.output = new LinkedList<String>();
        this.minLen = minLen;
        this.maxLen = maxLen;
        this.withOriginal = withOriginal;
    }

    private String join(String glue, String[] arr, int start, int end) {
        if (end < start)
            return "";
        StringBuilder sb = new StringBuilder();
        sb.append(arr[start]);
        for (int i = start+1; i <= end; ++i) {
            sb.append(glue);
            sb.append(arr[i]);
        }
        return sb.toString();
    }

    @Override
    public boolean incrementToken() throws IOException {

        // first -- output and ready tokens
        if (!output.isEmpty()) {
            charTermAttr.setEmpty();
            charTermAttr.append(output.poll());
            posIncAttr.setPositionIncrement(0);
            return true;
        }

        // no tokens ready in output buffer? get next token from input stream
        if (!input.incrementToken())
            return false;

        // get the text for the current token
        String s = charTermAttr.toString();

        // if the input does not look like a domain name, we leave it as is
        if (s.indexOf('.') == -1)
            return true;

        // create all sub-sequences
        String[] subParts = s.split("[.]");
        int actualMaxLen = Math.min(
            this.maxLen > 0 ? this.maxLen : subParts.length,
            subParts.length
        );
        for (int currentLen = this.minLen; currentLen <= actualMaxLen; ++currentLen)
            for (int i = 0; i + currentLen - 1 < subParts.length; ++i)
                output.add(join(".", subParts, i, i + currentLen - 1));

        // preserve original if so asked
        if (withOriginal && actualMaxLen < subParts.length)
            output.add(s);

        // output first of the generated tokens
        charTermAttr.setEmpty();
        charTermAttr.append(output.poll());
        posIncAttr.setPositionIncrement(1);
        return true;
    }

}

Hope this helps someone.

OTHER TIPS

I would add the WordDelimiterFilterFactory with the preserveOriginal option in combination with the WhitespaceTokenizerFactory

preserveOriginal="1" causes the original token to be indexed without modifications (in addition to the tokens produced due to other options)

default is 0

The WhitespaceTokenizerFactory will leave the periods in place. When you then use the WordDelimiterFilterFactory with the preserveOriginal option it should index both the component parts and the original. I'd also consider adding LowerCaseFilterFactory otherwise you may get mixed case into your index which is probably not what you are looking for.

So something like this although you'll need to play with it a bit:

<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <charFilter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <charFilter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
  </analyzer>
</fieldType>

This may not get you all the way there but it should give you a good start. I'd take a look at this page for more details on the WordDelimiterFilterFactory:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top