You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase
tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard
filter (redundant, because you are not using the standard
tokenizer), the stop
and snowball
filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze
API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar"
produces the terms ["foo","bar"]
and "foobar"
produces the term ["foobar"]
. So indexing "foo bar"
and searching for "foobar"
currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram
analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar"
and we search for "foobar"
using the match
query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat"
(f
,o
,a
). While foobar
will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match
parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match
depends upon your data - experiment with it.