Domanda

We are currently evaluating Cassandra as the data store for an analytical application. The plan was to dump raw data in Cassandra and then run mainly aggregation queries over it. Looking at CQL, it does not seem to support some traditional SQL operators like:

  • Typical aggregation functions like average, sum, count-Distinct etc.
  • Groupby-having operators

I did not find anything that can help achieve the above in the documentation. Also checked if there were any hooks for providing such functions as extensions. Say like in database map-reduce in Mongodb, or user-defined-functions in Relational DBs.

People do talk about the paid Datastax Enterprise Edition, and that too achieves this not via plain Cassandra, but through separate components like Hadoop-Hive-Pig-Hadoop etc. Or there are suggestions about doing needed pre-aggregations before dumping data to the DB since Cassandra writes are fast.

It looked like too much of overheads, at least for basic stuff we need. Am I missing something fundamental here?

Would highly appreciate help on this.

È stato utile?

Soluzione

Aggregation is available in cassandra as part of CASSANDRA-4914 which is available in the 2.2.0-rc1 release.

Altri suggerimenti

In one particular application we're using Cassandra for the write speed and then have the app compact the data down to a more compressed, slightly aggregated summary form. Then we run an hourly job to copy the the summary form to Postgres table. This approach doesn't score highly for elegance, but it's simple and it means that we can run ad-hoc analytic queries without having to complicate the primary data ingress path or having to build bespoke aggregation into the CQL app.

Check this out

Native aggregates

Count

The count function can be used to count the rows returned by a query. Example:

SELECT COUNT (*) FROM plays;
SELECT COUNT (1) FROM plays;

It also can be used to count the non null value of a given column:

SELECT COUNT (scores) FROM plays;

Max and Min

The max and min functions can be used to compute the maximum and the minimum value returned by a query for a given column. For instance:

SELECT MIN (players), MAX (players) FROM plays WHERE game = 'quake';

Sum

The sum function can be used to sum up all the values returned by a query for a given column. For instance:

SELECT SUM (players) FROM plays;

Avg

The avg function can be used to compute the average of all the values returned by a query for a given column. For instance:

SELECT AVG (players) FROM plays;

You can also create your own aggregates, more documentation on aggregates here: http://cassandra.apache.org/doc/latest/cql/functions.html?highlight=aggregate

Its just a suggestion as we did in our case. To do aggregation on cassandra database, you need to use languages like PIG or HIVE which internally generate map-reduce code which performs very good for large data in the cluster. For that you need to have Hadoop environment set up. After the processing, you can write the processed data in cassandra datbase or sqoop to mysql database.

Depending on the nature of your data, if you need to perform aggregation on data such as time series, you should perhaps consider Kdb+.

I was also evaluating Cassandra for storing Timeseries Telemetry data. I thought it was a perfect fit. However, I came to find there are no aggregation functions. Perhaps this is solvable with Pig and Hive. However, if a solution exists that combines data ingest, storage and analytics into a single language, why wouldn't you consider it?

I look at Cassandra as a storage engine that has solved the problems of distribution and availability while maintaining scale and performance. The trade off, of course, is flexibility and functionality. It's always going to be a trade off between functionality and performance in the database world.

That being said, Cassandra plays very nice with third party software such as Spark. Spark may prove to be very helpful for your use case. There's an open source connector https://github.com/datastax/spark-cassandra-connector that helps Spark intelligently find and run analytics on Cassandra data.

SparkSQL allows you to run your SELECT sum's as well as most hive compliant queries.

You can create CUSTOM INDEXES into Cassandra using Apache Lucene plugin (https://github.com/Stratio/cassandra-lucene-index) or you could use a different software (search engine data store) that fits your purpose like Elasticsearch (https://www.elastic.co/products/elasticsearch) it's also scalable and open source.

Elasticsearch also can be used along Kibana for data visualization based on your aggregated data.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top