Question

I'm using Elasticsearch (with Tire) and need to create a facet which outputs a statistics about daily unique mac addresses per day, week, month. I've been running into problems getting it working correctly.

I need to mean and total figures and couldn't get it working with the date_histogram facet:

def self.search_stats params
  tire.search(page: params[:page], per_page: 50) do |s|
    filter = []
    filter << { :terms => { ... }}

    s.facet('uniques') do
      date :created_at, :interval => 'day', value_field: 'mac.sortable'
      facet_filter :and, filter
    end
  end
end

That gave an error: "Failed to parse source... "

In the end, I've been trying to use a script to get it done but that's not really working out as I can't figure out how to group the values.

I been using:

date :created_at, value_script: "doc['mac.sortable'].values.size()", interval: 'day'

Obvs. size() is wrong as I need the unique values.

My mapping looks like this:

mapping do
  ...
  indexes :mac, type: 'multi_field', fields: {
    raw: {type: 'string', index: 'analyzed'},
    sortable: {type: 'string', index: :not_analyzed}
  }
  ...
end

I don't really want to use a script field at all because of the memory usage warning.

How can I get my macs grouped and sorted daily?

Was it helpful?

Solution

Counting unique values in a distributed environment is difficult. In order to get completely accurate counts, you need to count all unique values on every node, then merge all of those counts into a single list on one node.

For fields with low cardinality this approach can work just fine, but fields with high cardinality will end up using enormous amounts of memory and will more than likely fail.

There are two options available, but you have to choose between speed and accuracy. You can either:

  1. Get slow accurate counts using map-reduce
  2. Get fast estimated counts using Elasticsearch

The estimation approach uses the HyperLogLog algorithm (PDF) which estimates how many unique items are in a set.

With the new aggregations framework available in Elasticsearch 1.0, there are plans to support HLL via the cardinality aggregation. Currently the code is not in the main repository but can be seen on: https://github.com/jpountz/elasticsearch/tree/feature/term_count_aggregations

A HyperLogLog facet is available as a plugin for Elasticsearch but it hasn't been updated for recent versions. There is also this newly released cardinality plugin which uses HLL. I haven't used either plugin so can't vouch for them, but these look like your only options until official support for HLL is added to Elasticsearch.

OTHER TIPS

You can read: http://www.elasticsearch.org/blog/count-elasticsearch/

POST /access/search/_search
{
    "size" : 0,
    "aggs" : {
        "daily" : { 
            "date_histogram" : {"field":"date", "interval" : "day"},
            "aggs" :
                {
                     "query_count" : {"cardinality" : {"field" : "q"} }
                 }
                    }
               }
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top