Leader election for paxos-based replicated key value store

Question 1

Before I get to the actual question, I would suggest that for a paxos-like system, you don't think of it as a master-slave relationship, but rather an equal-peer relationship. Basic Paxos doesn't even have a leader concept. Multi-paxos tacks on a leader as a performance optimization, electing that leader is part of the protocol.

Multi-Paxos boils down to Paxos underneath: there is a prepare phase and an accept phase. The insight of Multi-Paxos is that once a node wins an accept round, it has simultaneously won leader election and after that the prepare phase isn't necessary from that leader until it detects that another node has taken over leadership.

And now some practical advise. I have many years of experience working on several paxos, multi-paxos, and other consensus systems.

I first suggest not implementing either Paxos or Multi-paxos. Optimizing Paxos systems for performance while keeping it correct is very hard—especially if you are having these types of questions. I would instead look into implementing the Raft protocol.

Taking both protocols as is right off the paper, the Raft protocol can have much better throughput than Multi-Paxos. The Raft authors (and others) suggest that Raft is easier to understand, and implement.

You may also look into using one of the open-source Raft systems. I don't have experience with any of them to tell you how easy it is to maintain. I have heard, though, of pain in maintaining Zookeeper instances. (I have also heard complaints about Zookeeper's correctness proof.)

Next, it has been proven that every consensus protocol can loop forever. Build into your system a time-out mechanism, and randomized backoffs where appropriate. This is how practical engineers get around theoretical impossibilities.

Finally, examine your throughput needs. If your throughput is high enough, you will need to figure out how to partition across several consensus-clusters. And that's a whole 'nother ball of wax.

Question 2

You can solve this with a parallel multi-paxos instance to manage the configuration of your cluster. Consider a replicated JSON object that's updated via multi-paxos contains the following information:

Serial number
Current leader ID
Timestamp of current leader's lease expiration
List of peer ids

You can use a stock paxos implementation and put all of the necessary logic in your networking layer:

Drop all Prepare and Accept messages received from any node other than the leader until the lease expires.
Proactively bump the serial number and your leader's lease time shortly before lease expiry.