cassandra read performance for large number of keys

https://stackoverflow.com/questions/10751490

10-06-2021
|

Question

Here is situation

I am trying to fetch around 10k keys from CF. Size of cluster : 10 nodes Data on node : 250 GB Heap allotted : 12 GB Snitch used : property snitch with 2 racks in same Data center. no. of sstables for cf per node : around 8 to 10

I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.

When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.

So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?

Is super column approach affects the read performance?

Solution

Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here: http://www.datastax.com/docs/0.8/ddl/column_family

OTHER TIPS

Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:

Using the C Extension

The C extension is crucial for phpcassa's performance.

You need to configure and make to be able to use the C extension.

cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install

Add the following line to your php.ini file:

extension=thrift_protocol.so

After doing much of RND about this stuff we figured there is no way you can get this working optimally. When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.

1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory. After data loading run the expected queries to warm up the key cache.

2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).

3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.

Above changes helped to bring down time required for querying within acceptable limits.

Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.

Hope above info is useful.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow