Lucene SpanNearQuery partial matching

https://stackoverflow.com/questions/2021839

19-09-2019
|

Question

Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}

But this fails.

How do I go around this?

Sample test (using lucene 2.9.1) with the following results:

givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

Solution

SpanNearQuery lets you find terms that are within a certain distance of each other.

Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):

Say we want to find lucene within 5 positions of doug, with doug following lucene (order matters) – you could use the following SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

alt text http://www.lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png

In this sample text, Lucene is within 3 of Doug

But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.

Edit 3 - in response to latest comment, the answer is that you cannot use SpanNearQuery to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, a BooleanQuery will work just fine.

Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a BooleanQuery as mentioned above to combine the individual terms you want as well as the SpanNearQuery, and set a boost on the SpanNearQuery.

So, the query in text form would look like:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).

Programmatically, you would construct this as follows:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow