I think the answer to this question is "it depends, but probably not." According to Lucene in Action, the KeywordAnalyzer
treats an entire string
as a single analysis token. So it won't break up something like "East China Sea" into "East", "China", and "Sea" tokens to search on them separately. Knowing this, it makes sense that I got the above results for the kinds of queries I was trying out.
I am still not totally confident about my understanding of case sensitivity in Lucene, so please correct me if I'm wrong, but it seems like you have to match the search input casing to the field and analyzer used to index. The only way I could really grasp this was by testing out different combinations of analyzers, document fields (normal and lower-cased), and field settings (ANALYZED
versus NOT_ANALYZED
). The link referenced above refers to the process of lowercasing text as normalization.
I found that searching with combinations of upper- and lower-case input text (like "Ch") returned no results when the field searched was analyzed with the StandardAnalyzer
. Now that I've read the above link, this makes more sense. It seems that the StandardAnalyzer
will normalize to lowercase when creating search tokens. So if you did something like new QueryParser(Version.LUCENE_30, field, analyzer).Parse("Ch")
, most analyzers will actually convert it to lowercase, since the token in the index is lowercase.
For the OP, it seems like a good solution is to normalize (lowercase) the user's input for the queries that are run against fields that are normalized by an analyzer. You can union the results with non-normalized user input run against NOT_ANALYZED
fields (or fields indexed with a non-normalizing analyzer) if need be (possibly to give the latter a higher boost factor).