Question

I have a Cassandra question. Do you know how Cassandra does updates/increments of counters?

I want to use a storm bolt (CassandraCounterBatchingBolt from storm-contrib repo on github) which writes into cassandra. However, I'm not sure how some of the implementation of the incrementCounterColumn() method works .. and there is also the limitations with cassandra counters (from: http://wiki.apache.org/cassandra/Counters) which makes them useless for my scenario IMHO:

  • If a write fails unexpectedly (timeout or loss of connection to the coordinator node) the client will not know if the operation has been performed. A retry can result in an over count CASSANDRA-2495.

  • Counter removal is intrinsically limited. For instance, if you issue very quickly the sequence "increment, remove, increment" it is possible for the removal to be lost

Anyway, here is my scenario:
I update the same counter faster than the updates propagate to other Cassandra nodes.

Example:
Say I have 3 cassandra nodes. The counters on each of these nodes are 0.
Node1:0, node2:0, node3:0

An increment comes: 5 -> Node1:0, node2:0, node3:0

Increment starts at node 2 – still needs to propagate to node1 and node3
Node1:0, node2:5, node3:0

In the meantime, another increment arrives before previous increment
is propagated: 3 -> Node1:0, node2:5, node3:0

Assuming 3 starts at a different node than where 5 started we have:
Node1:3, node2:5, node3:0

Now if 3 gets propagated to the other nodes AS AN INCREMENT and not as a new value (and the same for 5) then eventually the nodes would all equal 8 and this is what I want.

If 3 overwrites 5 (because it has a later timestamp) this is problematic – not what I want.

Do you know how these updates/increments are handled by Cassandra?

Note, that a read before a write is still susceptible to the same problem depending from which replica node the read executes (Quorum can still fail if propagation is not far along)

I'm also thinking that maybe putting a cache b/w my storm bolt and Cassandra might solve this issue but that's a story for another time.

Was it helpful?

Solution

Counters in C* have a complex internal representation that avoids most (but not all) problems of counting things in a leaderless distributed system. I like to think of them as sharded counters. A counter consists of a number of sub-counters identified by host ID and a version number. The host that receives the counter operation increments only its own sub-counter, and also increments the version. It then replicates its whole counter state to the other replicas, which merge it with their states. When the counter is read the node handling the read operation determines the counter value by summing up the total of the counts from each host.

On each node a counter increment is just like everything else in Cassandra, just a write. The increment is written to the memtable, and the local value is determined at read time by merging all of the increments from the memtable and all SSTables.

I hope that explanation helps you believe me when I say that you don't have to worry about incrementing counters faster than Cassandra can handle. Since each node keeps its own counter, and never replicates increment operations, there is no possibility of counts getting lost by race conditions like a read-modify-write scenario would introduce. If Cassandra accepts the write, your're pretty much guaranteed that it will count.

What you're not guaranteed, though, is that the count will appear correct at all times unless. If an increment is written to one node but the counter value read from another just after, there is not guarantee that the increment has been replicated, and you also have to consider what would happen during a network partition. This more or less the same with any write in Cassandra, it's in its eventually consistent nature, and it depends on which consistency levels you used for the operations.

There is also the possibility of a lost acknowledgement. If you do an increment and loose the connection to Cassandra before you can get the response back you can't know whether or not your write got though. And when you get the connection back you can't tell either, since you don't know what the count was before you incremented. This is an inherent problem with systems that choose availability over consistency, and the price you pay for many of the other benefits.

Finally, the issue of rapid remove, increment, removes are real, and something you should avoid. The problem is that the increment operation will essentially resurrect the column, and if these operations come close enough to each other they might get the same timestamp. Cassandra is strictly last-write-wins and determines last based on the timestamp of the operation. If two operations have the same time stamp, the "greater" one wins, which means the one which sorts after in a strict byte order. It's real, but I wouldn't worry too much about it unless you're doing very rapid writes and deletes to the same value (which is probably a fault in your data model).

Here's a good guide to the internals of Cassandra's counters: http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf

OTHER TIPS

The current version of counters are just not a good fit for a use case that requires guarantees of no over-counting and immediate consistency.

There are increment and decrement operations, and those will not collide with each other, and, barring any lost mutations or replayed mutations, will give you a correct result.

The rewrite of Cassandra counters (https://issues.apache.org/jira/browse/CASSANDRA-6504) might be interesting to you, and it should address all of the current concerns with getting a correct count.

In the meantime, if I had to implement this on top of a current version of Cassandra, and an accurate count was essential, I would probably store each increment or decrement as a column, and do read-time aggregation of the results, while writing back a checkpoint so you don't have to read back to the beginning of time to calculate subsequent results.

That adds a lot of burden to the read side, though it is extremely efficient on the write path, so it may or may not work for your use case.

To understand updates/increments i.e write operations, i will suggest you to go through Gossip, protocol used by Cassandra for communication. In Gossip every participant(node) maintains their state using the tuple σ(K) = (V*N) where σ(K) is the state of K key with V value and N as version number.

To maintain the single version of truth for a data packet Gossip maintains a Reconciliation mechanism namely Precise & Scuttlebutt(current). According to Scuttlebutt Reconciliation, before updating any tuple they communicate with each other to check who is holding the highest version (newest value) of the key. Whosoever is holding the highest version is responsible for the write operation.

For further information read this article.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top