Elasticsearch standard tokenizer not handling "a.b" entries?

https://stackoverflow.com/questions/21191617

29-09-2022
|

Question

I'm using ElasticSearch 0.90.7, so the answer to What exactly does the Standard tokenfilter do in Elasticsearch? I do not think applies (however what I'm seeing is similar).

I build the following:

curl -XDELETE "http://localhost:9200/testindex"
curl -XPOST "http://localhost:9200/testindex" -d'
{
  "mappings" : {
   "article" : {
     "properties" : {
       "text" : {
         "type" : "string"              
       }
     }
   }
 }
}'

I populate the following:

curl -XPUT "http://localhost:9200/testindex/article/1" -d'{
  "text": "file name. pdf"
}'

curl -XPUT "http://localhost:9200/testindex/article/2" -d'{
  "text": "file name.pdf"
}'

Search returns the following:

curl -XPOST "http://localhost:9200/testindex/_search" -d '{
  "fields": [],
  "query": {
    "query_string": {
      "default_field": "text",
      "query": "\"file name\""
    }
  }
}'

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "testindex",
        "_type": "article",
        "_id": "1",
        "_score": 0.30685282
      }
    ]
  }
}

... given this, I'm guessing that the standard tokenizer is changing document #2 from file name.pdf into file namepdf

My questions are:

am I guessing right here ?
if so: any ideas what tokenizer I could use to handle these cases ? (Or will I need to process the texts in my client before submission?

Solution

You can check for yourself using the Analyze API.

This yields the tokens file, name, and pdf for "file name .pdf",

and the tokens file, and name.pdf for "file name.pdf".

The StandardAnalyzer, or rather the StandardTokenizer, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29, which says:

Do not break within sequences, such as “3.2”

So, "name.pdf" is considered a full word by the StandardTokenizer.

For your Query, the SimpleAnalyzer would work. You can use the Analyze API as well as the elasticsearch-inquisitor plugin to test the available analyzers.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow