Question

I am going to implement a key value store with multi Paxos. I would have several nodes, one of which is the primary node. This primary node receive update requests and replicate values to slave nodes.

My question is how the primary node (or leader) is selected? Can I still use the Paxos algorithm? If so, do you think it is necessary to abstract the paxos implementation to a single unit that could be used not only by the replication unit but also the leader election unit?

If I use the node with the least id to be the leader? How can I implement the master lease?

Thanks for any answers.

Was it helpful?

Solution

Before I get to the actual question, I would suggest that for a paxos-like system, you don't think of it as a master-slave relationship, but rather an equal-peer relationship. Basic Paxos doesn't even have a leader concept. Multi-paxos tacks on a leader as a performance optimization, electing that leader is part of the protocol.

Multi-Paxos boils down to Paxos underneath: there is a prepare phase and an accept phase. The insight of Multi-Paxos is that once a node wins an accept round, it has simultaneously won leader election and after that the prepare phase isn't necessary from that leader until it detects that another node has taken over leadership.


And now some practical advise. I have many years of experience working on several paxos, multi-paxos, and other consensus systems.

I first suggest not implementing either Paxos or Multi-paxos. Optimizing Paxos systems for performance while keeping it correct is very hard—especially if you are having these types of questions. I would instead look into implementing the Raft protocol.

Taking both protocols as is right off the paper, the Raft protocol can have much better throughput than Multi-Paxos. The Raft authors (and others) suggest that Raft is easier to understand, and implement.

You may also look into using one of the open-source Raft systems. I don't have experience with any of them to tell you how easy it is to maintain. I have heard, though, of pain in maintaining Zookeeper instances. (I have also heard complaints about Zookeeper's correctness proof.)

Next, it has been proven that every consensus protocol can loop forever. Build into your system a time-out mechanism, and randomized backoffs where appropriate. This is how practical engineers get around theoretical impossibilities.

Finally, examine your throughput needs. If your throughput is high enough, you will need to figure out how to partition across several consensus-clusters. And that's a whole 'nother ball of wax.

OTHER TIPS

You can solve this with a parallel multi-paxos instance to manage the configuration of your cluster. Consider a replicated JSON object that's updated via multi-paxos contains the following information:

  • Serial number
  • Current leader ID
  • Timestamp of current leader's lease expiration
  • List of peer ids

You can use a stock paxos implementation and put all of the necessary logic in your networking layer:

  • Drop all Prepare and Accept messages received from any node other than the leader until the lease expires.
  • Proactively bump the serial number and your leader's lease time shortly before lease expiry.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top