I am inserting streaming data into 2 separate keyspaces with data insert into 2 column families (standard) in the first keyspace and into 3 column families (2 standard and 1 counter) in the second keyspace.
The data insert rate into these column families are well controlled and it works just fine [60% CPU utilization and CPU load factor of about 8-10] with pure writes. Next, I attempt to continuously read data from these column families via the Pycassa API while the writes are happening in parallel and I notice a severe degradation in write performance.
What system settings are recommended for parallel writes + reads from 2 keyspaces? Currently the data directory is on a single physical drive with RAID10 on each nodes.
RAM: 8GB
HeapSize: 4GB
Quad core Intel Xeon Processor @3.00 GHz
Concurrent Writes = Concurrent Reads = 16 (in cassandra.yaml file)
Data Model
Keyspace1: I am inserting time series data with time stamp (T) as the column name in a wide column that stores 24 hours worth of data in a single row.
CF1:
Col1 | Col2 | Col3(DateType) | Col(UUIDType4) |
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
Keyspace2:
CF1:
Col1 | Col2 | Col3(DateType) | Col4(UUIDType) | ... Col10
RowKey1
RowKey2
:
:
CF2 (Wide column family):
RowKey1 (T1, V1) (T2, V3) (T4, V4) ......
RowKey2 (T1, V1) (T3, V3) .....
:
:
CF3 (Counter Column family):
Counts occurrence of every event stored in CF2.
The data is continuously read from Keyspace 1 and 2, CF2 only (wide column families). Just to reiterate, the reads and writes are happening in parallel. The amount of data queried increases incrementally from 1 to 8 rowkeys using multiget and this process repeats.