Question

I've been reading into CvRDTs and I'm aware that Riak has already added a few to Riak 2.

My question is: why would Riak implement a gcounter when it sounds like the underlying vclock that is associated with every object records the same information? Wouldn't the result be a gcounter stored with a vclock, each containing the same essential information?

My only guess right now would be that Riak may garbage-collect the vclocks, trimming information that would actually be important for the purpose of a gcounter (i.e. the number of increments).

I cannot read Erlang particularly well, so maybe I've wrongly assumed that Riak stores vclocks with these special-case data types. However, the question still applies to the homegrown solutions that are written on top of standard Riak (and hence inherit vclocks with each object persisted).

EDIT:

I have since written the following article to help explain CvRDTs in a more practical manner. This article also touches on the redundancy I have highlighted above:

Conflict-free Replicated Data Types (CRDT) - A digestible explanation with less math.

Était-ce utile?

La solution

  1. Riak prunes version vectors, no big deal for causality (false concurrency, more siblings, safe) but a disaster for counters.

  2. Riak's CRDT support is general. We "hide" CRDTs inside the regular riak object.

  3. Riak's CRDT support is in it's first wave, we'll be optimising further as we make further releases.

We have a great mailing list for questions like this, btw. Stack Overflow has it's uses but if you want to talk to the authors of an open source DB why not use their list? Since Riak is open source, you can submit a pull request, we'd love to incorporate your ideas into the code base.

Autres conseils

Quick answer: Riak's counters are actually PN-Counters, ie they allow both increments and decrements, so can't be implemented like a vclock, as they require tracking the increments and decrements differently.

Long Answer:

This question suggests you have completely misunderstood the difference between a g-counter and a vector clock (or version vector).

A vector clock (vclock) is a system for tracking the causality of concurrent updates to a piece of data. They are a map of {actor => logical clock}. Actors only increment their logical clocks when the data they're associated with changes, and try to increment it as little as possible (so at most once per update). Two vclocks can either be concurrent, or one can dominate the other.

A g-counter is a CvRDT with what looks like the same structure as a vclock, but with important differences. They are implemented as a map of {actor => counter}. Actors can increment their own counter as much as they want. A g-counter has the concept of a "counter value", and the concept of a "merge", so that when concurrent operations are executed by different actors, they can work out what the actual "counter value" should be.

Importantly, g-counters can't track causality, and vclocks have no idea what their "counter value" is.

To conflate the two in a codebase would not only be confusing, but could also bring in errors.

Add this to the fact that riak actually implements pn-counters. The difference is that a g-counter can only be incremented, but pn-counters can be both incremented and decremented. Pn-counters work by being a map of {actor => (increment count, decrement count)}, which more obviously has a different structure to a vclock. You can only increment both those counts, hence why there are two and not just one.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top