Question

I know that you can find most used terms in an index with using facets.

For example on following inputs:

"A B C" 
"AA BB CC"
"A AA B BB"
"AA B"

term facet returns this:

B:3
AA:3
A:2
BB:2
CC:1
C:1

But I'm wondering that is it possible to list followings:

AA B:2
A B:1
BB CC:1

....etc...

Is there such a feature in ElasticSearch?

Was it helpful?

Solution

As mentioned in ramseykhalaf's comment, a shingle filter would produce tokens of length "n" words.

"settings" : { 
   "analysis" : {
       "filter" : {
          "shingle":{
              "type":"shingle",
              "max_shingle_size":5,
              "min_shingle_size":2,
              "output_unigrams":"true"
           },
           "filter_stop":{
              "type":"stop",
              "enable_position_increments":"false"
           }
       },
       "analyzer" : {
           "shingle_analyzer" : {
               "type" : "custom",
               "tokenizer" : "whitespace",
               "filter" : ["standard," "lowercase", "shingle", "filter_stop"]
           }
       }
   }
},
"mappings" : {
   "type" : {
       "properties" : {
           "letters" : {
               "type" : "string",
               "analyzer" : "shingle_analyzer"
           }
       }
   }
}

See this blog post for full details.

OTHER TIPS

I'm not sure if elasticsearch will let you do this the way you want natively. But you might be interested in checking out Carrot2 - http://search.carrot2.org to accomplished what you want (and probably more.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top