what is the efficient way of handling pycassa multiget for 1 million rowkeys

https://stackoverflow.com/questions/21594345

07-10-2022
|

Question

I am a complete newbie in cassandra.

Right now I have managed to get my code working for my problem scenario on a relatively small set of data.

However when I try to do multiget on 1 million rowkeys it fails with a message as "Retried 6 times. Last failure was timeout: timed out" .

e.g: colfam.multiget([rowkey1,...........,rowkey_Million])

Basically the column family I am trying to query has 1 million records with 28 columns each.

Here I am running a 2-node cassandra cluster on a single ubuntu virtual-box with system config as

RAM: 3GB Processor: 1CPU

So how do I manage to handle multiget on so many rowkeys efficiently and then do bulk insert of the same into another cassandra column family??

Thanks in advance :) :)

No correct solution

OTHER TIPS

I responded to this on the pycassa mailing list as well (please try not to post in multiple places), but I'll copy the answer for anybody else who sees this:

multiget is a very expensive operation for Cassandra. Each row in the multiget can require a couple of disk seeks for Cassandra. pycassa automatically splits the query up into smaller chunks, but this is still really expensive.

If you're trying to read the whole column family, use get_range() instead.

If you're just trying to read a subset of the rows in that column family (based on some attribute) and you need to do this frequently, you need to use a different data model.

Since you're new to this, I would spend some time learning about data modeling in Cassandra: http://wiki.apache.org/cassandra/DataModel. (Note: most of these examples will use CQL3, which pycassa does not support. If you want to work with CQL3 instead, use the new DataStax python driver: https://github.com/datastax/python-driver)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow