Dashes in words, keeping e-mail and wi-fi as single words in Lucene.Net's tokenizer

https://stackoverflow.com/questions/22486158

lucene.net

16-06-2023
|

Question

I'm using Lucene.Net to tokenize blog posts.

var db = new DataClassesDataContext();
var articles = (from article in db.Articles
                select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
    foreach (var article in articles)
    {
        var luceneDocument = new Document();
        luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
        writer.AddDocument(luceneDocument);
    }
    writer.Optimize();
}

The resulting term vectors contain unexpected word splits, for example the word "e-mails" has become the two words "e" and "mails" and "wi-fi" has become "wi" and "fi". Some answers (e.g. this one about Solr) suggests that I need to use the "ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter", but I cannot see where the ClassicAnalyzer lives and the documentation suggests I'd need 3.1 (Lucene.Net goes up to 3.0 I think).

How do I avoid treating dashes in words as word boundaries?

Solution

Try WhitespaceAnalyzer which splits at whitespace only.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow