How to get word trigrams in elasticsearch

https://stackoverflow.com/questions/23356890

11-07-2023
|

Question

I have been trying to get trigrams with elasticsearch tokenizers. I have followed tutorials at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html and http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams

Following these docs and testing the analyzer with

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'

produces nGrams like # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

While what I want is whole word trigrams

for example trigrams for the quick red fox jumps over the lazy brown dog would be.

the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog

In a nutshell how can I create trgrams like above using elasticsearch

Solution

Found it. Answer lies in the shingle filter. This mapping made it work

{
   "settings": {
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "shingle",
               "max_shingle_size": 3,
               "min_shingle_size": 3,
               output_unigrams:false
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   }
}

Here key attributes are type->shingle and min/max shingle sizes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow