I figured out what solves the problem but haven't quite figured out exactly why it's a problem.
Basically, my TokenFilter
implementation included in the question is attempting to do too much and doesn't appear to align with the expectations of Lucene.
By limiting the IncrementToken
implementation to perform just the phonetic hash and replace the ITermAttribute.Term
value with the generated hash, it works quite well.
TokenFilter
implementation:
public class SoundexFilter : TokenFilter
{
private readonly ITermAttribute _termAttr;
public SoundexFilter(TokenStream input)
: base(input)
{
_termAttr = AddAttribute<ITermAttribute>();
}
public override bool IncrementToken()
{
if (input.IncrementToken())
{
string currentTerm = _termAttr.Term;
// Any phonetic hash calculation will work here.
var hash = Soundex.For(currentTerm);
_termAttr.SetTermBuffer(hash);
return true;
}
return false;
}
}
The result requires the same filter to be applied at both index and query time, but it works extremely well.
As a side note, performance of this filter doesn't appear to match my expectations so I'll be profiling the solution to identify possible enhancements. I'd recommend anyone looking to use this solution do the same if they expect sub-second response time for an index with > 2 million documents.