Conflict-free Replicated Data Types (CRDT) vs Paxos or Raft

https://stackoverflow.com/questions/11254318

18-06-2021
|

Domanda

When is it a good idea to use something like CRDT instead of paxos or raft?

Soluzione

If you can use something like CRDT, do so. You should get much better performance. It also enables interesting use cases such as working offline and then merging later. However it is not always possible to design things such that a CRDT will work for you. In that case, paxos can solve the hard problems for you.

But even if you've decided to use paxos, generally you should limit how much work is being done directly through the paxos algorithm. Instead for performance reasons you want to reserve paxos for necessary operations such as master election, and then let a replicated master setup handle most decisions. (In a high throughput environment the master is likely to do something like delegate responsibility for specific shards to specific children, which replicate off each other. Do not let the master become a bottleneck...)

That said, it is much easier to claim that you'll wave the magic wand of paxos than it is to actually do it in practice. In that light you may find http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf to be an interesting description of the difficulties that a real-world paxos implementation is likely to encounter.

Altri suggerimenti

I think this guy know what he is talking about:

Blog

Video

Conclusion about distributed systems

CRDTs and Paxos have different goals and are used under different scenarios. They have in common that they help programmers deal with concurrency/replication. CRDTs are data types that asume concurrent updates will occur. Paxos is a protocol that enforces they wont, by enforcing a total order on them. Let's see this in more detail.

Lets say we have a replicated set which is replicated at two different places.

Using Paxos guarantees that writes to the set will be executed by every replica in the same order. More generally, it guarantees that all replicas AGREE on how the state of the set evolves.

If you have, for example, user1 performing update1 at replica1, adding element 1 to the replicated set while simultaneously user2 performs update2, adding element2 at replica2, Paxos will make replicas agree on a given order for those updates, or possibly agree on choosing one of the two updates and discarding the second one, depending on how you use it and what you want to achieve. If Paxos outcome is, say, that update1 comes before update2, every replica will update the set in that order. As a consequence, users reading the set concurrently with those updates can observe, regardless of where (at which replica) they read, ONLY the following states of the set (assuming the set was empty at the beggining):

{} (empty set)

{element1}

{element1, element2}

Furthermore, these states can be seen ONLY in that order, meaning that once the state of the set is {element1, element2} (at every replica), no subsequent read will return {} or {element1}.

Positive side: This set is simple to reason about, as it is equivalent to a set that is not replicated.

Negative side: Unavailability: If replicas can't talk to each other (network partition), your set can't be updated, as there can be no agreement. Low performance, high-latency: Agreement require that replicas synchronize before replying to the client. This incurs latency proportional to the latency between replicas.

CRDTs have weaker guarantees. A CRDT set is not equivalent to a sequential, single-copy one. It asumes that there is no agreement or total order on how replicas are updated.

CRDTs guarantee that if both replicas of the set have seen the same updates (regardless of the order in which they see them), then they will exhibit the same state; replicas will converge.

In our example of two users performing updates concurrently, a system that does not run Paxos to order operations on the set (this happens, e.g., under eventual or causal consistency), will allow replica1 to add element1 while replica2 is adding element2

so, the state at replica1 will be: {element1}

and the state at replica2 will be: {element2}

At this point in time, replicas diverge. Later, when replicas synchronise, they will exchange their updates, finally exhibiting this state:

state at replica1 will be: {element1, element2}

state at replica2 will be: {element2, element1}

At this point in time, replicas have converged.

Users reading the set concurrently with those updates can observe, depending of where (at which replica) they read, the following states of the set (assuming the set was empty at the beggining):

{} (empty set)

{element1} (if they read from replica1)

{element2} (if they read from replica2)

{element1, element2}

{element2, element1}

Negative side: This set is hard to reason about, as it shows states that could not occur in a sequential set. In our example, we have observed only the case of two concurrent adds to a set, which is straightforward. Concurrent adds and remove are more complex There are many datatypes with different issues:

A comprehensive study of Convergent and Commutative Replicated Data Types

Positive side: High-availability: If replicas can't talk to each other (network partition), your set CAN be updated. Replicas will sync when they connect back. High performance, low-latency: Replicas immediately reply to clients and synchronize in the background, after replying to the client.

There is a flaw with the CRDT Treedoc example. Each node requires a disambiguator for the case when two systems insert at the same time with the same key.

After this happens it is not longer possible for systems to insert between the entries that have identical keys but different disambiguators, as that requires the system to insert another identical key but controlling the disambiguator ordering. The disambiguators are not dense so this is not always possible. If the disambiguators were yet another tree, you solve one problem but then need another conflict resolution mechanism a depth further down ... etc.

This unmentioned problem, plus the fact you need to do a two phase commit to tidy up the meta-data makes me think CRDTs are still a work in progress.

There are multiple metrics we have:

throughput (CRDT and Paxos are the same because all requests are replicated on all replicas in the end no matter CRDT or Paxos);
latency (CRDT is better than Paxos because it writes to smaller number of replicas);
reliability (CRDT is weaker than Paxos because it writes to smaller number of replicas (smaller than majority) which may result state lost);
consistency (CRDT is weaker than Paxos because it allows concurrent writes without synchronization point (basically no overlapping replicas), while Paxos writes always requires an overlapping replica to do the serialization).

My suggestion is that we should use Paxos when the replicas are not far from each other (e.g., within a data center), and use CRDT when network partitioning is a normal (e.g., disconnected mobile).

Comment to a response by @btilly:

The question in the topic is related to different models of consistency and thus design patterns:

CRDT relates to the Eventual Consistency family of algorithms,
PAXOS family of algorithms relates to the Strong Consistency / Linerizability (client to server sync, ),
RAFT is equivalent to the PAXOS in an algorithmic and consistency way, but its considered easier to implement.

CRDT and Paxos are very different things and the usage must be decided based on your architecture requirements. I believe the best way of handling it, is to review use cases where those algorithms have been successfully applied.

Use patterns:

CRDT may be used for data sync between clients (i.e. mobile devices) and servers, real-time collaborating editing, values sync in dist-db implementations and all other cases where eventual consistency is fine.

PAXOS mostly used in proprietary systems supporting systems infrastructure (i.e. Chubby), or for implementing sync in distributed database systems like BigTable, Datastore, Spinnaker, Spanner and etc.

RAFT is more popular in OSS infrastructure projects like ETCD, Consul, ...

(don't remember what CockroachDB and TiDB are based on)

There is also a BFT, but its used less.

p.s. PAXOS != ZooKeeper. ZooKeeper uses different algorithm called ZAB and is not a replicated state machine, but replicated snapshots machine with single writer and relaxed read model. Google's Chubby based on Paxos, but its proprietary.

p.s.s. It's interesting how PAXOS is evolving all these years. In last 20 years, a lot of variants were invented handling various optimizations of edge cases, quorum sizes, and cluster reconfiguration.

Whenever it is appropriate. However, PaxOS is not that bad as its throughput is typically the same as with CRDT, not to mention that the reliability is much higher (CRDT may result state lost), and, its latency is not that bad neither as it only requires a majority of the replicas replies instead of all.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow