Question

Is there a possibility to retrieve random rows from Cassandra (using it with Python/Pycassa)?

Update: With random rows I mean randomly selected rows!

Was it helpful?

Solution

You might be able to do this by making a get_range request with a random start key (just a random string), and a row_count of 1.

From memory, I think the finish key would need to be the same as start, so that the query 'wraps around' the keyspace; this would normally return all rows, but the row_count will limit that.

Haven't tried it but this should ensure you get a single result without having to know exact row keys.

OTHER TIPS

Not sure what you mean by random rows. If you mean random access rows, then sure you can do it very easily:

import pycassa.pool
import pycassa.columnfamily

pool = pycassa.pool.ConnectionPool('keyspace', ['localhost:9160']
cf = pycassa.columnfamily.ColumnFamily(pool, 'cfname')
row = cf.get('row_key')

That will give you any row. If you mean that you want a randomly selected row, I don't think you'd be able to do that very easily without knowing what the keys are. You could generate an index row and then select a random column from that and use that to grab a row from another column family. Basically, you'd need to create a new row where each column value, was a row key from the column family from which you are trying to select a row. Then you could grab a column randomly from that row and you have the key to a random row.

I don't think pycassa offers any support to grab a random, non-indexed row.

This works for my case:

ini = random.randint(0, 999999999)
rows = col_fam.get_range(str(ini), row_count=1, column_count=0,filter_empty=False)

You'll have to adapt to your row key type (string in my case)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top