Question

Wondering what are the best practice or experiences used for multilingual indexing and search in elasticsearch. I read through a number of resources, and as best as I can distill it the available options for indexing are:

  1. separate index per language;

  2. multi field type for multilingual field;

  3. separate field for all the possible languages.

So, wondering what are the side-effects for choosing one or the other of these options (or some other that I've missed). I guess having more indices does not really slow down the cluster (if it is not some huge number of languages), so not sure what would I get from choosing 2 or 3 except perhaps easier maintenance.

Any help welcomed!

No correct solution

OTHER TIPS

A bit old question, but the info might be helpful anyway. The index/mapping structure mainly depends on your usecase.
Do you need to use all the languages simultaneously or only one language is used at time?

  • Option 1: multilanguage website for example - the users only see and search in the current language they have chosen. In this case my experience is that index-per-lang would be good solution, especially if you need to be able to add and remove languages easily. The data amount is separated between the indices (performance benefit). Easy setup of analyzers for each language, especially if their settings differs only by the language name. Personally I'm currently using this option for one of my projects

General notes for options 2 and 3: Using one of those options gives you the ability to score the documents differently, based on the language as you can define scoring for each language field. You can add new fields to a mapping if you need to add more languages, but there is no way to remove or change the existing fields. Hence you will have to reindex all your content and set the property for the removed language to empty. You will need to add new analyzers for every new language. But it is required to close the index first and open it after the changes are made.

  • Option 2: If you need to search in all languages at once the multi-field gives you the easiest access as you can address all its sub-fields at once:
    "book_title": {
        "type": "multi_field",
        "fields": {
            "english": {
                "type": "string"
            },
            "german": {
                "type": "string"
            },
            "italian": {
                "type": "string"
            },
        }
    }

Here you can search in specific language (ex.: "book_title.english") or in all languages (using "book_title"). You should be careful not to update the field using "book_title" name, but using "book_title.[language]". Using "book_title" will lead to updating all the subfields with identical data (which is probably not what you want)

  • Option 3: Completely separate fields - you will need to put them all in the search query if you need to search as in option 2, more secure in terms of indexing as you cannot overwrite all the languages by mistake

  • Idea for option 4 - use type-per-language: can be used if you have only one type of documents. You can have different fields per language. Not useful if you have multiple document types

In case other people are looking for answers, here's a direct link to the documentation on the ElasticSearch site: https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html

I would go with option 1 (separate index per language) as suggested by the Elasticsearch documentation since it makes sure you avoid term-frequency issues.

If your document contains multiple languages, you can put in multiple indices and use field collapsing query-time to avoid duplicates of the same document being returned.

I think it all depends on the use case. I think option 1 wont be optimal if we have multiple fields with mixed languages(locale) as there would be lot of redundant data for non localizable fields. Option 2 may be better in that case.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top