Aggregation queries in Cassandra CQL

Question 1

Aggregation is available in cassandra as part of CASSANDRA-4914 which is available in the 2.2.0-rc1 release.

Question 2

In one particular application we're using Cassandra for the write speed and then have the app compact the data down to a more compressed, slightly aggregated summary form. Then we run an hourly job to copy the the summary form to Postgres table. This approach doesn't score highly for elegance, but it's simple and it means that we can run ad-hoc analytic queries without having to complicate the primary data ingress path or having to build bespoke aggregation into the CQL app.

Question 3

Check this out

Native aggregates

Count

The count function can be used to count the rows returned by a query. Example:
SELECT COUNT (*) FROM plays;
SELECT COUNT (1) FROM plays;
It also can be used to count the non null value of a given column:
SELECT COUNT (scores) FROM plays;
Max and Min

The max and min functions can be used to compute the maximum and the minimum value returned by a query for a given column. For instance:
SELECT MIN (players), MAX (players) FROM plays WHERE game = 'quake';
Sum

The sum function can be used to sum up all the values returned by a query for a given column. For instance:
SELECT SUM (players) FROM plays;
Avg

The avg function can be used to compute the average of all the values returned by a query for a given column. For instance:
SELECT AVG (players) FROM plays;

You can also create your own aggregates, more documentation on aggregates here: http://cassandra.apache.org/doc/latest/cql/functions.html?highlight=aggregate

Question 4

Its just a suggestion as we did in our case. To do aggregation on cassandra database, you need to use languages like PIG or HIVE which internally generate map-reduce code which performs very good for large data in the cluster. For that you need to have Hadoop environment set up. After the processing, you can write the processed data in cassandra datbase or sqoop to mysql database.

Question 5

Depending on the nature of your data, if you need to perform aggregation on data such as time series, you should perhaps consider Kdb+.

I was also evaluating Cassandra for storing Timeseries Telemetry data. I thought it was a perfect fit. However, I came to find there are no aggregation functions. Perhaps this is solvable with Pig and Hive. However, if a solution exists that combines data ingest, storage and analytics into a single language, why wouldn't you consider it?

Question 6

I look at Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. The trade off, of course, is flexibility and functionality. It's always going to be a trade off between functionality and performance in the database world.

That being said, Cassandra plays very nice with third party software such as Spark. Spark may prove to be very helpful for your use case. There's an open source connector https://github.com/datastax/spark-cassandra-connector that helps Spark intelligently find and run analytics on Cassandra data.

SparkSQL allows you to run your SELECT sum's as well as most hive compliant queries.

Question 7

You can create CUSTOM INDEXES into Cassandra using Apache Lucene plugin (https://github.com/Stratio/cassandra-lucene-index) or you could use a different software (search engine data store) that fits your purpose like Elasticsearch (https://www.elastic.co/products/elasticsearch) it's also scalable and open source.

Elasticsearch also can be used along Kibana for data visualization based on your aggregated data.