Is it best to use a lucene KeywordAnalyzer to index text for an auto-suggest text box?

https://stackoverflow.com/questions/20710503

20-09-2022
|

Domanda

I have a text box in a search form that I want to attach a combobox / autocomplete widget to. As users type, I want to auto-suggest relevant place names. So if a user types "Ca", suggest Cambodia, Cameroon, Canada, Cape Verde, etc, ranked higher than North Carolina and South Carolina. If a user types "Sea", return items such as Red Sea, Black Sea, etc, but perhaps not Chelsea (if at all, this should be scored lower). Our database of place names is very granular and complex, with a lot of data and a lot of alternate names / translations for places. The data is stored in SQL Server and we use Entity Framework as the data access layer. Needless to say, effectively querying our Places entity aggregate using LINQ to Entities is slow and inefficient.

Rather than carving custom SQL and indexes to optimize database searches, I am looking at Lucene.Net. Today is my first day testing it out. Most of the Lucene help I've read use a StandardAnalyzer to index. I was having some trouble using that for a couple of my tests. For example, consider the following:

var searchTerms = new[] { "Ca", "China", "Sea", };
searchTerms = searchTerms.Concat(searchTerms.Select(x => x.ToLower())).ToArray();
var reader = IndexReader.Open(_directory, true);
foreach (var searchTerm in searchTerms)
{
    var searcher = new IndexSearcher(reader);
    var query1 = new WildcardQuery(new Term("OfficialName", string.Format("*{0}*", searchTerm)));
    var query2 = new TermQuery(new Term("OfficialName", searchTerm));
    var query3 = new QueryParser(Version.LUCENE_30, "OfficialName", new SimpleAnalyzer()).Parse(searchTerm);
    var query4 = new PrefixQuery(new Term("OfficalName", searchTerm));
    var query5 = new BooleanQuery();
    query5.Add(query1, Occur.SHOULD);
    query5.Add(query2, Occur.SHOULD);
    query5.Add(query3, Occur.SHOULD);
    query5.Add(query4, Occur.SHOULD);
    var queryToRun = query5;
    var results = searcher.Search(queryToRun, int.MaxValue);
    var hits = results.ScoreDocs;

The above code just tries out normal- and lower-cased versions of terms. Interestingly the "Ca" query returns no results, but "ca" returns a ton of them -- Africa, North America, etc. I think I read somewhere that the standard analyzer differentiates terms based on case, so this may be why..? The other search terms return what one might expect.

When the same data is indexed using a keyword analyzer, the results are quite different. One weird thing is that "china" only returns 1 result, "Uchinada-machi". I would have expected it to also return "China" and "East China Sea". Also "sea" returns results like "Royal Borough of Kensington and Chelsea" and "Swansea City and County", but none of the other expected results.

So how should I go about this? Should I have different indexes of the text for different analyzers? Do I need to query against a document field with lowercased text? I read about using NGram tokenizers, but they no longer seem to be in the Lucene.Net.Analysis namespace.

Soluzione

I think the answer to this question is "it depends, but probably not." According to Lucene in Action, the KeywordAnalyzer treats an entire string as a single analysis token. So it won't break up something like "East China Sea" into "East", "China", and "Sea" tokens to search on them separately. Knowing this, it makes sense that I got the above results for the kinds of queries I was trying out.

I am still not totally confident about my understanding of case sensitivity in Lucene, so please correct me if I'm wrong, but it seems like you have to match the search input casing to the field and analyzer used to index. The only way I could really grasp this was by testing out different combinations of analyzers, document fields (normal and lower-cased), and field settings (ANALYZED versus NOT_ANALYZED). The link referenced above refers to the process of lowercasing text as normalization.

I found that searching with combinations of upper- and lower-case input text (like "Ch") returned no results when the field searched was analyzed with the StandardAnalyzer. Now that I've read the above link, this makes more sense. It seems that the StandardAnalyzer will normalize to lowercase when creating search tokens. So if you did something like new QueryParser(Version.LUCENE_30, field, analyzer).Parse("Ch"), most analyzers will actually convert it to lowercase, since the token in the index is lowercase.

For the OP, it seems like a good solution is to normalize (lowercase) the user's input for the queries that are run against fields that are normalized by an analyzer. You can union the results with non-normalized user input run against NOT_ANALYZED fields (or fields indexed with a non-normalizing analyzer) if need be (possibly to give the latter a higher boost factor).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow