Lucene search using StopWords in StandardAnalyzer

https://stackoverflow.com/questions/19931921

30-07-2022
|

Question

I have the following issue using Lucene.NET 3.0.3.

My project analyze Documents using StandardAnalyzer with StopWord-List (combined german and english words).
While searching I create my searchterm by hand and parse it using MultiFieldQueryParser. The Parser is initialized with the same analyzer as indexing documents.
The parsed search query initialized a BooleanQuery. The BooleanQuery and a TopScoreDocCollector search in the Lucene index with IndexSearcher.

My code looks like:

using (StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, roxConnectionTools.getServiceInstance<ISearchIndexService>().GetStopWordList()))
{
    ...
    MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, analyzer);
    parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
    parser.AllowLeadingWildcard = true;
    ...
    Query searchQuery = parser.Parse(searchStringBuilder.ToString().Trim);
    ...
    BooleanQuery boolQuery = new BooleanQuery();
    boolQuery.Add(searchQuery, Occur.MUST);
    ...
    TopScoreDocCollector scoreCollector = TopScoreDocCollector.Create(SearchServiceTools.MAX_SCORE_COLLECTOR_SIZE, true);
    ...
    searcher.Search(boolQuery, scoreCollector);
    ScoreDoc[] scoreDocs = scoreCollector.TopDocs().ScoreDocs;
}

If I index a document field with value "Test- und Produktivumgebung" I can´t find this document by searching this term.
I get results if I correct the search term to "Test- Produktivumgebung".
The word "und" is in my StopWord-List.

My search query looks like the following:
Manually generated search query: (+*Test* +*und* +*Produktivumgebung*)
Parsed search query: +(title:*Test*) +(title:*und*) +(title:*Produktivumgebung*)

Why I can´t find the document searching for "Test- und Produktivumgebung"?

Solution

Wildcard Queries are not analyzed (See this question, for an example). Since you are (if I understand correctly), interpreting the query "Test- und Produktivumgebung" to (+*Test* +*und* +*Produktivumgebung*), the analyzer is not used for any of those wildcard queries, and stop words will not be eliminated.

If you eliminate the step that performs that translation, the query "Test- und Produktivumgebung" should be parsed to a phrase query and analyzed, and should work just fine. Another reason to eliminate that step, is that applying a leading wildcard to every term will cause your performance to become very poor. That's why leading wildcards must be manually enabled, because it is generally a bad idea to use them.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow