Counting unique values in a distributed environment is difficult. In order to get completely accurate counts, you need to count all unique values on every node, then merge all of those counts into a single list on one node.
For fields with low cardinality this approach can work just fine, but fields with high cardinality will end up using enormous amounts of memory and will more than likely fail.
There are two options available, but you have to choose between speed and accuracy. You can either:
- Get slow accurate counts using map-reduce
- Get fast estimated counts using Elasticsearch
The estimation approach uses the HyperLogLog algorithm (PDF) which estimates how many unique items are in a set.
With the new aggregations framework available in Elasticsearch 1.0, there are plans to support HLL via the cardinality
aggregation. Currently the code is not in the main repository but can be seen on: https://github.com/jpountz/elasticsearch/tree/feature/term_count_aggregations
A HyperLogLog facet is available as a plugin for Elasticsearch but it hasn't been updated for recent versions. There is also this newly released cardinality plugin which uses HLL. I haven't used either plugin so can't vouch for them, but these look like your only options until official support for HLL is added to Elasticsearch.