Indexing multi-lingual content with Lucene.net
I use Lucene.net for indexing content & documents etc.. on websites. The index is very simple and has this format:
LuceneId - unique id for Lucene (TypeId + ItemId) TypeId - the type of text (eg. page content, product, public doc etc..) ItemId - the web page id, document id etc.. Text - the text indexed Title - web page title, document name etc.. to display with the search results
I've got these options to adapt it to serve multi-lingual content:
- Create a separate index for each language. E.g. Lucene-enGB, Lucene-frFR etc..
- Keep the one index and add an additional 'language' field to it to filter the results.
Which is the best option - or is there another? I've not used multiple indexes before so I'm leaning toward the second.
You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each: If you have field "Text" might contain arabic or english contents ==>
- Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use PreFieldAnalyzerWrapper