Try WhitespaceAnalyzer
which splits at whitespace only.
Dashes in words, keeping e-mail and wi-fi as single words in Lucene.Net's tokenizer
-
16-06-2023 - |
Domanda
I'm using Lucene.Net to tokenize blog posts.
var db = new DataClassesDataContext();
var articles = (from article in db.Articles
select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach (var article in articles)
{
var luceneDocument = new Document();
luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
writer.AddDocument(luceneDocument);
}
writer.Optimize();
}
The resulting term vectors contain unexpected word splits, for example the word "e-mails" has become the two words "e" and "mails" and "wi-fi" has become "wi" and "fi". Some answers (e.g. this one about Solr) suggests that I need to use the "ClassicAnalyzer
rather than a StandardAnalyzer
, as StandardAnalyzer
now always treats hyphens as a delimiter", but I cannot see where the ClassicAnalyzer
lives and the documentation suggests I'd need 3.1 (Lucene.Net goes up to 3.0 I think).
How do I avoid treating dashes in words as word boundaries?
Soluzione
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow