Lucene SpanNearQuery correspondance partielle

https://stackoverflow.com/questions/2021839

19-09-2019
|

Question

Étant donné un document { 'foo', 'bar', 'baz'}, je veux correspondre à l'aide SpanNearQuery avec les jetons { 'baz', 'extra'}

Mais cela ne fonctionne pas.

Comment puis-je contourner cela?

test de l'échantillon (en utilisant lucene 2.9.1) avec les résultats suivants:

givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

La solution

SpanNearQuery vous permet de trouver des termes qui sont à une certaine distance les uns des autres.

Exemple (de http://www.lucidimagination.com/ blog / 2009/07/18 / la spanquery / ):

Disons que nous voulons trouver Lucene dans les 5 positions de Doug, Doug suivant Lucene (Questions d'ordre) - vous pouvez utiliser le SpanQuery suivant:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

texte alt http: // www. lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png

Dans ce texte exemple, Lucene est à l'intérieur 3 de Doug

Mais pour votre exemple, le seul match que je peux voir est que à la fois votre requête et le document cible ont « cd » (et je fais l'hypothèse que tous ces termes sont dans un seul champ). Dans ce cas, vous n'avez pas besoin d'utiliser tout type de requête spéciale. À l'aide des mécanismes standard, vous obtiendrez une certaine pondération non nulle en fonction du fait qu'ils contiennent tous deux le même terme dans le même domaine.

Modifier 3 - en réponse à la dernière remarque, la réponse est que vous ne pouvez pas utiliser SpanNearQuery pour faire autre chose que ce qu'il est destiné, ce qui est de savoir si plusieurs termes dans un document se produire dans un certain nombre de places de l'autre. Je ne peux pas dire ce que votre cas d'utilisation spécifique / résultats attendus sont (ne hésitez pas à poster), mais dans le dernier cas, si vous voulez seulement savoir si un ou plusieurs ( « baz », « EXTRA ») est en le document, un BooleanQuery fonctionnera très bien.

Modifier 4 - maintenant que vous avez posté votre cas d'utilisation, je comprends ce que vous voulez faire. Voici comment vous pouvez le faire. Utiliser un BooleanQuery comme mentionné ci-dessus pour combiner les termes individuels que vous voulez, ainsi que le SpanNearQuery, et mettre un coup de pouce sur le SpanNearQuery

Ainsi, la requête sous forme de texte ressemblerait à ceci:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(à titre d'exemple - une correspondance avec les documents contenant soit « baz » ou « EXTRA », mais attribuer un score plus élevé aux documents où les termes « baz » et « EXTRA se produisent à moins de 100 places de l'autre, régler la position et d'augmenter que vous le souhaitez. cet exemple est tiré du livre de cuisine Solr donc il ne peut pas analyser dans Lucene, ou peut donner des résultats indésirables. C'est ok, parce que dans la section suivante, je vous montre comment construire cette utilisation de l'API).

Programmatically, vous construirait cela comme suit:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

L'espoir qui aide! À l'avenir, essayez de commencer en affichant exactement ce que vous résultats attendez - même s'il est évident pour vous, il ne peut pas être au lecteur, et d'être explicite peut éviter d'avoir à revenir en arrière tant de fois

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow