Lucene prohibited clause "fuzzyfied" where it shouldn't

https://stackoverflow.com/questions/15487463

24-03-2022
|

Question

I'm writing a filter based on Lucene: I have some results from an API and I would like to enforce the results to match a certain query (the API sometimes doesn't work). As the results are gotten from an API, I basically store them in RAM, index it, and filter. If Lucene finds the doc at my index, I consider this doc to be ok, if not, it will be filtered.

Sometimes I want it to be fuzzy, sometimes I don't. There's an aproximation switch. So I use a StandardAnalyzer for approximation = false, and BrazilianAnalyzer for approximation = true. Ok?

The problem is that the BrazilianAnalyzer approximate negation terms, which I think isn't a great approach. For example, if I need "greve -trabalhadores", a doc with "greve do trabalho" matches the query, but it shouldn't. If I use the StandardAnalyzer, it works fine, if I use the BrazilianAnalyzer, it will ignore everything that contains "trabalh", because of the stemming.

My solution was to rewrite the prohibited clauses using the StandardAnalyzer, which don't do stemming/fuzzy. So, the portion of the query which is prohibited, I'll use the StandardAnalyzer, the other portion will use either BrazilianAnalyzer, either Standard (depending on approximation switch).

The problem is that it isn't working (sometimes).

A little test of my code is as follow:

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.logging.Logger;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.br.BrazilianAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

public class Lucene {

    private static Logger log = Logger.getLogger(Lucene.class.getName());

    private String[] fields = new String[] { "title" };

    private BrazilianAnalyzer analyzerBrazil = new BrazilianAnalyzer(Version.LUCENE_41, new CharArraySet(Version.LUCENE_41, Collections.emptyList(), true));
    private StandardAnalyzer analyzerStandard = new StandardAnalyzer(Version.LUCENE_41, new CharArraySet(Version.LUCENE_41, Collections.emptyList(), true));

    private MultiFieldQueryParser parserBrazil = new MultiFieldQueryParser(Version.LUCENE_41, fields , analyzerBrazil);
    private MultiFieldQueryParser parserStandard = new MultiFieldQueryParser(Version.LUCENE_41, fields , analyzerStandard);

    public void filter(String query, boolean fuzzy, List<Result> results) {
        Directory index = null;

        if (results == null || results.size() == 0) {
            return;
        }

        try {
            Analyzer analyzer = fuzzy ? analyzerBrazil : analyzerStandard;
            Query q = fuzzy ? parserBrazil.parse(query) : parserStandard.parse(query);

            // terms to ignore/prohibited shoudn't be fuzzyfied...
            if (fuzzy) {
                Query queryNoFuzzy = parserStandard.parse(query);

                if (q instanceof BooleanQuery) {
                    BooleanClause[] clauses = ((BooleanQuery)queryNoFuzzy).getClauses();
                    if (clauses != null && clauses.length > 0) {
                        BooleanClause clause = null;
                        for (int i = 0; i < clauses.length; i++) {
                            clause = clauses[i];
                            if (clause.isProhibited()) {
                                ((BooleanQuery)q).clauses().set(i, clause);
                            }
                        }
                    }
                }
            }

            log.info(q.toString());

            index = index(results, analyzer);
            IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
            TopDocs resultsFoundDocs = searcher.search(q, results.size());

            List<Result> resultsFound = new ArrayList<Result>();
            for (ScoreDoc resultadoFiltro : resultsFoundDocs.scoreDocs) {
                log.info("Score " + resultadoFiltro.score);
                resultsFound.add(results.get(Integer.parseInt(searcher.doc(resultadoFiltro.doc).get("index"))));
            }

            for (Result result : results) {
                if (!resultsFound.contains(result)) {
                    result.setFiltered(true);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                index.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    private Directory index(List<Result> resultados, Analyzer analyzer) {
        try {
            Directory index = new RAMDirectory();
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, analyzer);

            IndexWriter writer = new IndexWriter(index, config);
            indexResults(writer, analyzer, resultados);

            return index;
        } catch (Exception e) {
            e.printStackTrace();
            return null;
        }
    }

    private void indexResults(IndexWriter w, Analyzer analyzer, List<Result> resultados) throws IOException {
        try {
            Document resultado = null;

            for (int i = 0; i < resultados.size(); i++) {
                resultado = new Document();

                resultado.add(new TextField(fields[0], resultados.get(i).getTitle(), Field.Store.YES));
                resultado.add(new IntField("index", i, Field.Store.YES));

                w.addDocument(resultado, analyzer);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            w.close();
        }
    }   

    public static void main(String[] args) {
        List<Result> ocs = new ArrayList<Result>();

        Result rb = new Result("Vivo Celular - não instalação do produto");
        ocs.add(rb);

        System.out.println("ITEMS ____________________________");
        for (Result oc : ocs) {
            System.out.println(oc.getTitle());
        }
        System.out.println("ITEMS ____________________________");

        String query = "vivo -celular";

        System.out.println("\n >> QUERY " + query);

        new Lucene().filter(query, true, ocs);

        System.out.println("\nFOUND ____________________________");
        for (Result oc : ocs) {
            if (!oc.getFiltered()) {
                System.out.println(oc.getTitle());
            }
        }
        System.out.println("FOUND ____________________________");
    }

}

class Result {

    private String title;
    private Boolean filtered = false;

    public Result(String title) {
        this.title = title;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public Boolean getFiltered() {
        return filtered;
    }

    public void setFiltered(Boolean filtered) {
        this.filtered = filtered;
    }

}

There is a simple doc, with the following title "Vivo Celular - não instalação do produto". I'll query for "vivo -celular", so, as the doc contains celular, it shouldn't be returned by the searcher.search(q, results.size()); call. It happens only with the approximation switch on, even as it prints the query to stemm only the "vivo" to "viv" (the query is "(title:viv) -(title:celular)").

Is that correct???

I'm using version 4.2.2. It happens also at 4.1.0.

Can anyone enlighten me on that one?

Many thanks in advance.

Solution 2

For those who are looking for the answer:

I found a better (and correct) way of doing this: the BrazilianAnalyzer (and most Analyzers) have a overloaded constructor that accepts the stop words and words that shouldn't be stemmed (or fuzzyfied). So what you'll have to do is:

Construct your Analyzer as follows:

new BrazilianAnalyzer(Version.LUCENE_41, stops, getNoStemmingSet(query));

Then, the getNoStemmingSet would be:

private CharArraySet getNoStemmingSet(String query) {
    if (query != null && !query.contains(" -")) {
        return new CharArraySet(Version.LUCENE_41, Collections.emptyList(), true);
    }

    List<String> proihibitedClauses = new ArrayList<String>();

    for (String clause : query.split("\\s")) {
        if (clause.startsWith("-")) {
            proihibitedClauses.add(clause.replace("-", ""));
        }
    }

    return new CharArraySet(Version.LUCENE_41, proihibitedClauses, true);
}

So if the query contains prohibited clauses (minus sign), we take each of then and ignore constructing a new CharArraySet.

Stops is another CharArraySet that you would like to be used as stop words. If you don't need your own stop words set, you can use the default one, using:

BrazilianAnalyzer.getDefaultStopSet()

That's it.

OTHER TIPS

I believe the problem lies in the fact that you are mixing up analyzers.

If your fuzzy flag is set to true, you are indexing the documents using the BrazilianAnalyzer (which does stemming) but you are trying to rewrite part of the query with some non stemmed terms, using the StandardAnalyzer.

In other words, even if you have the query "(title:viv) -(title:celular)", which is correct, the term celular has most likely been stemmed in the directory (that's because you indexed with StandardAnalyzer), and therefore the clause -celular will never work.

A possible workaround for this, albeit it adds some overhead, is to mantain two different indexes: a stemmed one, and a non stemmed one. To do this easily, you can create two different fields, say title (with StandardAnalyzer), and stemmedtitle (with BrazilianAnalyzer). Use a PerFieldAnalyzerWrapper to create an analyzer that works on two different fields. Then, you can rewrite your query as stemmedtitle:viv -title:celular and that should do the trick.

For those who wants to use -"some phrase", this code should do the trick (haven't tested it that well, but you can try it):

private CharArraySet getNoStemmingSet(String query) {

if (query != null && !query.contains(" -")) {
    return new CharArraySet(Version.LUCENE_41, Collections.emptyList(), true);
}

List<String> proihibitedClauses = new ArrayList<String>();
String[] quotedWords = null;

for (int i = 0; i < query.length(); i++) {
    if (query.charAt(i) == '-' && query.charAt(i+1) == '\"') {
        quotedWords = query.substring(i+2, query.indexOf('\"', i+2)).split("\\s");
        for (String quotedWord : quotedWords) {
            proihibitedClauses.add(quotedWord);
        }
    } else if (query.charAt(i) == '-') {
        if (query.indexOf(' ', i+1) > 0) {
            proihibitedClauses.add(query.substring(i+1, query.indexOf(' ', i+1)));
        } else {
            proihibitedClauses.add(query.substring(i+1, query.length()));
        }
    } else {
        continue;
    }
}

return new CharArraySet(Version.LUCENE_41, proihibitedClauses, true);
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow