Question

How does the following statements help in improving program efficiency while handling large number of rows say 500 million.

Random Partitioner:

get_range()

Ordered Partitioner:

get_range(start='rowkey1',finish='rowkey10000')

Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.

Thanks

No correct solution

OTHER TIPS

Also how many rows can be handled at a time while using get_range for ordered partitioner with a column family having more than a million rows.

pycassa's get_range() method will work just fine with any number of rows because it automatically breaks the query up into smaller chunks. However, your application needs to use the method the right way. For example, if you do something like:

rows = list(cf.get_range())

Your python program will probably run out of memory. The correct way to use it would be:

for key, columns in cf.get_range():
    process_data(key, columns)

This method only pulls in 1024 rows at a time by default. If needed, you can lower that with the buffer_size parameter to get_range().

EDIT: Tyler Hobbs points out in his comment that this answer does not apply to the pycassa driver. Apparently it already takes care of all I mentioned below.

==========

If your question is whether you can select all 500M rows at once with get_range(), then the answer is "no" because Cassandra will run out of memory trying to answer your request.

If your question is whether you can query Cassandra for all rows in batches of N rows at a time if the random partitioner is in use, then the answer is "yes". The difference to using the order preserving partitioner is that you do not know what the first key for your next batch will be, so you have to use the last key of your current batch as the starting key and ignore the row when iterating over the new batch. For the first batch simply use the "empty" key as the key range limits. Also, there is no way to say how far you have come in relative terms by looking at a key that was returned, as the order is not preserved.

As for the number of rows: Start small. Say 10, then try 100, then 1000. Depending on the number of columns you are looking at, index sizes, available heap, etc. you will see a noticeable performance degradation for a single query beyond a certain threshold.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top