Join / split search words in elasticsearch (using tire)

Question

You can only search for tokens that are in your index. So let's look at what you are indexing. You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.

If we create that analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "string_analyzer" : {
               "filter" : [
                  "standard",
                  "stop",
                  "snowball"
               ],
               "tokenizer" : "lowercase"
            }
         }
      }
   }
}
'

and use the analyze API to test it out:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'

you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.

If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.

So delete the test index:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

and specify a new analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "ngrams" : {
               "max_gram" : 5,
               "min_gram" : 1,
               "type" : "ngram"
            }
         },
         "analyzer" : {
            "ngrams" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "ngrams"
               ],
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Now, if we test the analyzer, we get:

"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar"  => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]

So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.

Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.

This can be controlled by using the minimum_should_match parameter, eg:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "my_field" : {
            "minimum_should_match" : "60%",
            "query" : "foobar"
         }
      }
   }
}
'

The exact value for minimim_should_match depends upon your data - experiment with it.