ElasticSearch nGram filters out punctuation

https://stackoverflow.com/questions/21764781

11-10-2022
|

Question

In my ElasticSearch dataset we have unique IDs that are separated with a period. A sample number might look like c.123.5432

Using an nGram I'd like to be able to search for: c.123.54

This doesn't return any results. I believe the tokenizer is splitting on the period. To account for this I added "punctuation" to the token_chars, but there's no change in results. My analyzer/tokenizer is below.

I've also tried: "token_chars": [] <--Per the documentation this should keep all characters.

"settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "my_ngram_analyzer" : {
                    "tokenizer" : "my_ngram_tokenizer"
                }
            },
            "tokenizer" : {
                "my_ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "1",
                    "max_gram" : "10",
                    "token_chars": [ "letter", "digit", "whitespace", "punctuation", "symbol" ]
                }
            }
        }
    }
},

Edit(More info): This is the mapping of the relevant field:

"ProjectID":{"type":"string","store":"yes", "copy_to" : "meta_data"},

And this is the field I'm copying it into(that also has the ngram analyzer):

"meta_data" : { "type" : "string", "store":"yes", "index_analyzer": "my_ngram_analyzer"}

This is the command I'm using in sense to see if my search worked (see that it's searching the "meta_data" field):

GET /_search?pretty=true
{ 
    "query": {
        "match": {
            "meta_data": "c.123.54"
        }
    }
}

Solution

Solution from s1monw at https://github.com/elasticsearch/elasticsearch/issues/5120

By using an index_analyzer search only uses a standard analyzer. To fix it I modified index_analyzer to analyzer. Keep in mind the number of results will increase greatly, so changing the min_gram to a higher number may be necessary.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow