Ideal Cassandra parameters/settings for inserting and reading streaming data

https://stackoverflow.com/questions/21700601

09-10-2022
|

Question

I am inserting streaming data into 2 separate keyspaces with data insert into 2 column families (standard) in the first keyspace and into 3 column families (2 standard and 1 counter) in the second keyspace.

The data insert rate into these column families are well controlled and it works just fine [60% CPU utilization and CPU load factor of about 8-10] with pure writes. Next, I attempt to continuously read data from these column families via the Pycassa API while the writes are happening in parallel and I notice a severe degradation in write performance.

What system settings are recommended for parallel writes + reads from 2 keyspaces? Currently the data directory is on a single physical drive with RAID10 on each nodes.

RAM: 8GB

HeapSize: 4GB

Quad core Intel Xeon Processor @3.00 GHz

Concurrent Writes = Concurrent Reads = 16 (in cassandra.yaml file)

Data Model

Keyspace1: I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.

CF1:

    Col1    |   Col2    |   Col3(DateType)  |   Col(UUIDType4)  |

RowKey1

RowKey2

CF2 (Wide column family):

RowKey1 (T1, V1) (T2, V3) (T4, V4) ......

RowKey2 (T1, V1) (T3, V3) .....

Keyspace2:

CF1:

    Col1    |   Col2    |   Col3(DateType)  |   Col4(UUIDType)  |   ...  Col10

RowKey1

RowKey2

CF2 (Wide column family):

RowKey1 (T1, V1) (T2, V3) (T4, V4) ......

RowKey2 (T1, V1) (T3, V3) .....

CF3 (Counter Column family):

Counts occurrence of every event stored in CF2.

The data is continuously read from Keyspace 1 and 2, CF2 only (wide column families). Just to reiterate, the reads and writes are happening in parallel. The amount of data queried increases incrementally from 1 to 8 rowkeys using multiget and this process repeats.

Solution

Possible ways to overcome the issue:

Increased the space allocated to younger generation as recommended in this blog post: http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
Made small schema updates and dropped unnecessary secondary indexes. This decreased the compaction overheads.
Reduced the write timeout to 2s in cassandra.yaml as recommended in my previous post: Severe degradation in Cassandra Write performance with continuous streaming data over time

The read client still needs an update to avoid the use of multiget at high workloads. The above improvements have significantly improved the performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow