Logstash + ElasticSearch: initial type mapping results in missing log lines

Question 1

It sounds like you guys don't care about date types, or any type. I think that the best solution would be to define a dynamic template that will define all types as string:

{
    "_default_" : {
        "dynamic_templates" : [
            {
                "long_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "long",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            },
            {
                "double_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "double",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            },
            {
                "float_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "float",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            },
            {
                "integer_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "integer",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            },
            {
                "date_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "date",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            },
            {
                "boolean_to_string" : {
                    "match" : "*",
                    "match_mapping_type": "boolean",
                    "mapping" : {
                        "type" : "string",
                        "index" : "analyzed"
                    }
                }
            }
        ]
    }
}

From here.

Question 2

After much research, I can sadly declare that an elegant solution for this doesn't currently exist. While you can declare that a field should not be analyzed, you can't tell it to change its type dynamically nor can you just automatically ignore types.

That practically means that whatever type you send first will be the only type you can index into that field. If you predeclared the field with a type, you then won't be able to index anything other than that type. In either case, all mismatching types will drop. Also, notice that this often floods elasticsearch's logfile and you should either set log rotation or configure elasticsearch to not log those errors in its logging yaml.

Your solution is indeed a potential hack (unless you're certain the un-indexed data is irrelevant). It is much like a try: something except: pass in python.

As a general rule (speaking from experience), I suggest that you don't index different types of data (not different elasticsearch types) into fields with the same name as it then becomes extremely hard to analyze in Kibana when trying to use Number/String based queries (you won't be able to sort or display histograms or piecharts on that specific field.) Obviously, it's not always easy to just change your code (or another application's code) to not index to the same field, in which case, I would identify the originating app and use logstash's grok (unless you're already sending a json) and mutate filters to replace the field's name.

Question 3

Wondering why ignore_malformed is considered a hack? They put the feature in, I guess, for the exact same reason - that sometimes a field may not evaluate to the declared data type. Does it really break anything?

Edit/Update:
For log ingestion/processing/indexing, my experience is that ingesting logs directly into a store like ES is a bad idea for many reasons (and feel free to ask what are those if you are curious).

In short, I always use a ingest/parsing engine before I push data into any data repository like ElasticSearch or HDFS. You can use agents like logStash or flume to process/parse data using grok. Or, write a custom Spark app to ingest, validate and structure data before feeding it to ES.

Typically, I build log pipelines like this: Producer (syslog etc) -> Kafka/Kinesis (topic-1) -> Spark Streaming App (applies parsing/structuring rules) -> Kafka/Kinesis (topic-2) -> multiple agents (one agent group per data repository). So for example, I would deploy a bunch of flume agents that subscribe to topic-2 and write to HDFS. In parallel, deploy a bunch of logStash agents and write to ES.

Might look a bit involved but the benefits of cleaner/consistent data are multi-fold. Everyone from casual data explorers to data scientists will thank you :)