用lucene.net索引多语言内容

https://stackoverflow.com/questions/553404

23-08-2019
|

题

我用 lucene.net 对于网站上的索引内容和文档等。索引非常简单，具有这种格式：

LuceneId - unique id for Lucene (TypeId + ItemId)
TypeId   - the type of text (eg. page content, product, public doc etc..)
ItemId   - the web page id, document id etc..
Text     - the text indexed
Title    - web page title, document name etc.. to display with the search results

我有这些选择来调整它以提供多语言内容：

为每种语言创建一个单独的索引。例如Lucene-Engb，Lucene-FRFR等。
保留一个索引，并在其上添加其他“语言”字段以过滤结果。

哪个是最好的选择 - 或其他选择？我之前没有使用过多个索引，所以我倾向于第二个索引。

解决方案

我做[2]，但是我遇到的一个问题是，我不能根据语言使用不同的分析仪。我已经结合了我想要的语言的停止词，但是我失去了分析仪将提供的更高级内容的功能，例如stemming等。

其他提示

You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each: If you have field "Text" might contain arabic or english contents ==>

Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use PreFieldAnalyzerWrapper

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow