Question

I have two nodes Cassandra cluster. In order to test Cassandra i built a File table (Fid Integer,Sid Integer), Which Fid is key. I built index on Sid, Insert rate is about 10,000 in 1 second. But when i select from table the performance is terrible, and for low limit like 1000 it generate error, bellow is my sample code,

from cassandra.cluster import Cluster

cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myk')
rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
for user_row in rows:
    print user_row

Error message is:

Traceback (most recent call last):
  File "Test.py", line 5, in <module>
    rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')
  File "build\bdist.win32\egg\cassandra\cluster.py", line 1065, in execute
  File "build\bdist.win32\egg\cassandra\cluster.py", line 2427, in result
cassandra.OperationTimedOut: errors={}, last_host=172.16.47.130

by changing

rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')

to

rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000',timeout=20.0)

Error has gone, but why performance (for fetching 1000 rows from a 800,000 records table) is very slow? Any hints?

Était-ce utile?

La solution

I built index on Sid

The key to the lack of performance here is your use of secondary indexes in place of what should be either a clustering key or part of a composite key. Secondary indexes in Cassandra are for assisting in full table scans (an expensive operation) for batch analytics or for early development testing. They are not analogous to relational indexes.

So if you want to execute queries like

rows = session.execute('SELECT * FROM File WHERE sid = 1 limit 1000')

then you need a table whose primary key is sid. If you would like to query based on FID as well then you need two complimentary tables, one keyed on FID and one on SID. At insert time you would place the information in both tables.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top