Domanda

We're interested here in working with high-cardinality indexes. (Which are known to be a problem for Elastic Search)

We already know from you that for

select count(distinct high_cardinality_field) from my_table

you already have some optimizations to count it. Will it be possible someday to write something like:

select count_via_hyperloglog(high_cardinality_field) from my_table

having count_via_hyperloglog as a UDF or something, as it is possible right now in ES via ES-plugins?

È stato utile?

Soluzione

in crate this feature is on our backlog as an additional aggregation function which uses the hyperlog algorithm. we plan to do the naming derived from presto http://prestodb.io/docs/current/functions/aggregate.html. Your example will then probably look like:

select approx_distinct(high_cardinality_field) from my_table

However, a possible performance improvement for one specific field per table is to cluster your table based on the high cardinality field as described under https://crate.io/docs/current/sql/ddl.html#routing

Altri suggerimenti

High cardinality counting with HyperLogLog is planned for 1.1.0, the documentation is already up: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

Example:

{
    "aggs" : {
        "author_count" : {
            "cardinality" : {
                "field" : "author"
            }
        }
    }
}

As for something like UDF, you can use scripts, .e.g. by combining a filter aggregation with a script filter

{
    "aggs": {
        "in_stock_products": {
            "filter": {
                "script": {
                    "script": "doc['price'].value > minPrice"
                    "params": {
                        "minPrice": 5
                    }
                }
            },
            "aggs": {
                "avg_price": {
                    "avg": {
                        "field": "price"
                    }
                }
            }
        }
    }
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top