Question

I have been trying to get trigrams with elasticsearch tokenizers. I have followed tutorials at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html and http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams

Following these docs and testing the analyzer with

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'

produces nGrams like # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

While what I want is whole word trigrams

for example trigrams for the quick red fox jumps over the lazy brown dog would be.

the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog

In a nutshell how can I create trgrams like above using elasticsearch

Was it helpful?

Solution

Found it. Answer lies in the shingle filter. This mapping made it work

{
   "settings": {
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "shingle",
               "max_shingle_size": 3,
               "min_shingle_size": 3,
               output_unigrams:false
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   }
}

Here key attributes are type->shingle and min/max shingle sizes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top