Question

I am testing deployment of Cassandra 2.0 on 4 DCs with NetworkTopologyStrategy and PropertyFileSnitch. Replication quota is 1 for each DC, meaning that each DC has complete database. My keyspace is configured for consistency "one" reads. Meaning (as I understand) that the client can get data locally if its available without performing any quorums etc.

Unfortunately my test results suggest otherwise. If I artificially (using MiniNet) increase latency for one of the DCs I can see that my reads on other DCs are significantly slowing down (proportionally to delay, more than dynamic_snitch_badness_threshold).

During this test I am not writing any data, I am only performing reading. Note, that if I completely disconnect one of the nodes I am getting my performance back to 100%.

Therefore I have 2 questions, 1. Why one DC is slowing down performance of the entire system when I performing consistency one reads. And 2. Why Dynamic snytch does not reroute communication away from badly performing node (default settings, tested for 20+ minutes).

Regards.

EDIT: So this is my set of actions so far. When I creating table I am adding this: with read_repair_chance = 0 and speculative_retry= 'NONE';

Problem: When I am working with cqlsh console, I can read current consistency level and I can set LOCAL_ONE as per documentation. But new settings are not persistent, when I exit cqlsh and entering it again, I can see the default consistency ONE again. It seems that the setting is per session?

I run nodetool netstat on slow node and I see that there is no repair attempts but there are some responses??

Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0<<-----------
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         0              0
Responses                       n/a         0           3807<<------------
Was it helpful?

Solution

1) Depending on your read load, even if you use LOCAL_ONE or LOCAL_QUORUM, you may cause some load on the nodes in the other datacenter from read repair. Try watching the output of nodetool tpstats and see if the nodes are doing lots of read repair. If so, try turning off read_repair_chance for your CFs by setting it to zero.

To observe the above behavior, enable DEBUG logging and look for lines like this:

ReadCallback.java (line 79) Blockfor is ....

It should tell you if the request is blocked by sending requests to nodes in other DC or not, possibly due to read_repair.

2) Dynamic snitch has a reset interval. It means that regardless of past history, it resets the scores it captured for latency of each node. You might observe queries routing to slow node after the snitch reset.

OTHER TIPS

The answer to 1) is that 'One' requires a single response but doesn't require that the coordinator route the request to the local dc. For this use 'LOCAL_ONE'. This guarantees your reads will not go cross-dc.

LOCAL_ONE Available in Cassandra 1.2.11 and 2.0.2 and later. A write must be sent to, and successfully acknowledged by, at least one replica node in the local datacenter. In a multiple data center clusters, a consistency level of ONE is often desirable, but cross-DC traffic is not. LOCAL_ONE accomplishes this. For security and quality reasons, you can use this consistency level in an offline datacenter to prevent automatic connection to online nodes in other data centers if an offline node goes down.

http://www.datastax.com/documentation/cassandra/2.0/webhelp/cassandra/dml/dml_config_consistency_c.html

Try tracing some requests to get more information about how C* is executing your queries. http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top