Elasticsearch: is there a way to declare for all (possibly dynamic) subfields of an object field as string?

https://stackoverflow.com/questions/20401709

29-08-2022
|

Question

I have a doc_type with a mapping similar to this very simplified one:

{
   "test":{
      "properties":{
         "name":{
            "type":"string"
         },
         "long_searchable_text":{
            "type":"string"
         },
         "clearances":{
            "type":"object"
         }
      }
   }
}

The field clearances should be an object, with a series of alphanumeric identifiers for filtering purposes. A typical document will have this format:

{
    "name": "Lord Macbeth",
    "long_searchable_text": "Life's but a walking shadow, a poor player, that..."
    "clearances": {
        "glamis": "aa2862jsgd",
        "cawdor": "3463463551"
    }
}

The problem is that sometimes during indexing, the first indexed content of a new field inside the object field clearances will be completely numerical, as in the case above. This causes Elasticsearch to infer the type of this field as long. But this is an accident. The field might be alphanumeric in another document. When a latter document containing an alphanumeric value in this field arrive, I get a parsing exception:

{"error":"MapperParsingException[failed to parse [clearances.cawdor]]; nested: NumberFormatException[For input string: \"af654hgss1\"]; ","status":400}%

I tried to solve this with a dynamic template defined like this:

{
   "test":{
      "properties":{
         "name":{
            "type":"string"
         },
         "long_searchable_text":{
            "type":"string"
         },
         "clearances":{
            "type":"object"
         }
      }
   },
   "dynamic_templates":[
      {
         "source_template":{
            "match":"clearances.*",
            "mapping":{
               "type":"string",
               "index":"not_analyzed"
            }
         }
      }
   ]
}

But it keeps happening that if the first indexed document have a clearance.some_subfield value that can be parsed as an integer, it would be inferred as an integer and all subsequent documents that have alphanumeric values on that subfield will fail to be indexed.

I could list all current subfields in the the mapping, but they are many and I expect their number to grow in the future (triggering an update of the mapping and a need for a full reindexation...).

Is there a way to make this work without resorting to this full reindexation everytime a new subfield is added?

Solution

You're almost there.

First, your dynamic mapping's path must be on clearances.*, and it must be a path_match and not a plain match.

Here's a runnable example: https://www.found.no/play/gist/df030f005da71827ca96

export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Create indexes

curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
    "settings": {},
    "mappings": {
        "test": {
            "dynamic_templates": [
                {
                    "clearances_as_string": {
                        "path_match": "clearances.*",
                        "mapping": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            ]
        }
    }
}'


# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"test"}}
{"clearances":{"glamis":1234,"cawdor":5678}}
{"index":{"_index":"play","_type":"test"}}
{"clearances":{"glamis":"aa2862jsgd","cawdor":"some string"}}
'

# Do searches

curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
    "facets": {
        "cawdor": {
            "terms": {
                "field": "clearances.cawdor"
            }
        }
    }
}
'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow