Lucene SpanNearQuery corrispondenza parziale

https://stackoverflow.com/questions/2021839

19-09-2019
|

Domanda

Dato un documento { 'foo', 'bar', 'Baz'}, voglio abbinare utilizzando SpanNearQuery con i gettoni { 'Baz', 'extra'}

Ma questo non riesce.

Come posso fare intorno a questo?

prova del campione (utilizzando Lucene 2.9.1) con i seguenti risultati:

givenSingleMatch - PASSA
givenTwoMatches - PASSA
givenThreeMatches - PASSA
givenSingleMatch_andExtraTerm - FAIL

...

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;

public class SpanNearQueryTest {

    private RAMDirectory directory = null;

    private static final String BAZ = "baz";
    private static final String BAR = "bar";
    private static final String FOO = "foo";
    private static final String TERM_FIELD = "text";

    @Before
    public void given() throws IOException {
        directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(
                directory,
                new StandardAnalyzer(Version.LUCENE_29),
                IndexWriter.MaxFieldLength.UNLIMITED);

        Document doc = new Document();
        doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
        doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));

        writer.addDocument(doc);
        writer.commit();
        writer.optimize();
        writer.close();
    }

    @After
    public void cleanup() {
        directory.close();
    }

    @Test
    public void givenSingleMatch() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenTwoMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenThreeMatches() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, FOO)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAR)),
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ))
                }, Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }

    @Test
    public void givenSingleMatch_andExtraTerm() throws IOException {

        SpanNearQuery spanNearQuery = new SpanNearQuery(
                new SpanQuery[] {
                        new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
                        new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
                },
                Integer.MAX_VALUE, false);

        TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);

        Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
    }
}

Soluzione

SpanNearQuery permette di trovare termini che sono entro una certa distanza l'uno dall'altro.

Esempio (da http://www.lucidimagination.com/ blog / 2009/07/18 / il-spanquery / ):

diciamo che vogliamo trovare Lucene entro 5 posizioni di doug doug, con successivo Lucene (questioni di ordine) - si potrebbe usare il seguente SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

alt testo http: // www. lucidimagination.com/blog/wp-content/uploads/2009/07/spanquery-dia1.png

In questo testo di esempio, Lucene è dentro 3 di Doug

Ma per il tuo esempio, l'unica partita che posso vedere è che sia la query e il documento di destinazione hanno "CD" (e sto facendo l'ipotesi che tutti questi termini sono in un unico campo). In tal caso, non è necessario utilizzare qualsiasi tipo di query speciale. Utilizzando i meccanismi standard, si otterrà una certa ponderazione diversa da zero sulla base del fatto che entrambi contengono lo stesso termine nello stesso campo.

Modifica 3 - in risposta al recente commento, la risposta è che non è possibile utilizzare SpanNearQuery per fare qualcosa di diverso da ciò che esso è destinato ad, che è quello di scoprire se più termini in un documento avvenire entro un certo numero di posti l'uno dall'altro. Non posso dire che cosa il vostro caso d'uso specifici / risultati attesi sono (sentitevi liberi di postare esso), ma in quest'ultimo caso, se si desidera solo per scoprire se uno o più dei ( "BAZ", "Extra") è in il documento, un BooleanQuery funzionano bene.

Modifica 4 - ora che avete pubblicato il vostro caso d'uso, ho capito che cosa si vuole fare. Ecco come si può fare:. Utilizzare un BooleanQuery come detto sopra per combinare i singoli termini che desiderate così come il SpanNearQuery, e impostare una spinta sul SpanNearQuery

Quindi, la query in forma di testo sarà simile:

BAZ OR EXTRA OR "BAZ EXTRA"~100^5

(ad esempio - questo corrisponde a tutti i documenti contenenti sia "BAZ" o "EXTRA", ma assegnare un punteggio superiore ai documenti in cui i termini "BAZ" e "si verificano EXTRA entro 100 posti l'uno dall'altro, regolare la posizione e aumentare a piacere. questo esempio è dal ricettario Solr modo che non può analizzare in Lucene, o può dare risultati indesiderati. Va bene, perché nel prossimo paragrafo vi mostro come costruire questo utilizzando l'API).

a livello di programmazione, si dovrebbe costruire questo nel seguente modo:

Query top = new BooleanQuery();

// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");

// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);

// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other.  The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
                              new SpanQuery[] { new SpanTermQuery(bazTerm), 
                                                new SpanTermQuery(extraTerm) }, 
                              100, true);

// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);

// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);

Speranza che aiuta! In futuro, cercare di iniziare con la pubblicazione esattamente quali risultati vi aspettate - anche se è evidente a voi, potrebbe non essere al lettore, e di essere esplicito può evitare di dover andare avanti e indietro tante volte

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow