Use multiple stemming languages with ElasticSearch

https://stackoverflow.com/questions/11042117

14-06-2021
|

Question

I'm building a search engine for a website where users can be of many different countries and post text content.

I'll consider that: - A french generates content in french and english - A german generates content in german and english etc...

What i'd like to know if it is possible to make a search using different snowball stemmer langages in the same time, so that we have appropriate results in the same time.

Do we have to create one index per snowball stemmer langage?

Is there a known pattern for such a case?

Thanks

Solution 4

This new ElasticSearch plugin works fine:

https://github.com/yakaz/elasticsearch-analysis-combo

OTHER TIPS

So quick disclaimer, I'm not an expert in stemming/language morphology but since noone else is responding, here's my understanding. Also, most of my experience is along the lines of solr.

In order to be able to query with stemming against multiple languages with a single, mixed result set, you need to use a multilingual stemmer. I'm not sure what is available for elastisearch.

Trying to apply multiple stemmers designed for single languages to a single index will step on each other's toes and likely not produce expected results (stemming rules vary significantly depending on the language).

Having an index per language with respective stemmers works for queries with single language results. Trying to combine results from multiple queries against multiple indices is usually fairly problematic (you have to attempt to normalize relevancy and deal with paging).

You can create 2 separate indices and search on both ( or all ) at the same time. As long as fields of indices are the same you will get valid results.

Earlier this year Kiju Kim from the elasticsearch team published some good articles on the topic how to work with multiple languages on the elastic.co blog:

You can basically use multiple fields for your content - one for each language you want to support (Part 2) - each utilising language specific analyzers (Part 1). (Part 3) adds some optimisation to use language detection to populate the correct language field instead of all fields making use of an ingest pipeline (using an ingest plugin for language detection).

You can combine stemmers. I assume there will be conflicts and order will matter. Wonder how big of a problem that is.

"settings": {
    "index": {
        "analysis": {
            "filter": {
                "german_stemmer": {
                    "type": "stemmer",
                    "name": "light_german"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "name": "english"
                },
                "french_stemmer": {
                    "type": "stemmer",
                    "name": "light_french"
                },
                "italian_stemmer": {
                    "type": "stemmer",
                    "name": "light_italian"
                }
            }
            "analyzer": {
                "asdfghjkl": {
                    "tokenizer": "standard",
                    "filter": [
                        "english_stemmer",
                        "italian_stemmer",
                        "french_stemmer",
                        "german_stemmer"
                    ]
                }
            }
        }
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow