Question

We have a 32 node Cassandra cluster with around 100Gb per node using Murmur3 partitioner. It has time series data and we have build secondary indexes on two columns to perform range queries. Currently, the cluster is stable with all the data bulk loaded and all the secondary indexes rebuilt. The issue occurs when we are performing range queries using cql client or hector, just the query for count of rows takes a huge amount of time and it most cases causes nodes to fail due to memory issues. The nodes have 8gb memory, Cassandra MAX Heap is allotted to 4 GB. Has anyone else faced such an issue ? Is there a better way to do count queries ?

Was it helpful?

Solution

I've had similar issues and most often this can be solved by redesigning the schema bearing in mind the queries that you plan to execute against the data in Cassandra. For a timeseries data it is better to have wide tables with granularity depending on your queries. If your query requires data at a granularity of 1 hour, then it is best to have a wide table with all timestamped data points stored within a single row for every hour so you can get all the required data for 1 hour by reading just 1 row.

Since you say the data is bulk loaded, I am assuming that you may have put all the data into a single table which is why the get_count query is taking an enormous amount of time. We have a a cluster with 8GB RAM but have set the heap size to 3 GB because at 4GB, the RAM utilization is almost always at 8GB [full utilization].

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top