Question

Running on Ubuntu. Program is in C++. I have 2 process running on different hosts , when one is master and one is slave (don’t have any priority between them, just that only one handle request.). Only one process can be a master and handle request. Two process always up and in case they are crash there is a watch dog that restart them.

The hosts are connected by network cable.

My plan is to ask for keep alive from one to other and in case that slave stop getting keep alive from master it need to change its state to master. When master start up again it first wait to get keep alive and in case not get it set role as master. if get it set role as slave.

I will be happy to get your opinion on:

how to prevent from both to be master at the same time? This is my MAJOR concern. When start up and in connectvity failure, how do you prevent 2 master at the same time?

Do you think that it will be better to query for keep alive or to send keep alive? ( for my opinion its better to ask for keep alive than push )

any other good advices and pitfalls will be more than welcome.

Was it helpful?

Solution

The way I've done this is to have each process spawn a heartbeat thread that sends out a UDP packet once a second, and listens for incoming UDP packets from the other process. If the heartbeat thread doesn't receive any UDP packets from the other process for a specified amount of time (e.g. 5 seconds), it assumes the other process is down and notifies the parent thread that it should be come the master now.

The reason the heartbeat sending/listening is done in a dedicated thread is because that way if the main thread is busy doing a lengthy calculation, it won't cause heartbeat UDP packets to temporarily not be sent. That way the algorithms in the main thread don't need to be real-time in order to avoid triggering spurious failovers.

There is another issue to think about here... what happens if a network problem temporarily cuts communication between the two hosts? (e.g. some joker or QA tester unplugs the Ethernet cable for 1 minute, then plugs it back in) In that case, both processes will stop receiving UDP packets from the other process, so both processes will think the other process has gone away, and both will become the master process. Then when the network cable is reconnected, you have two master processes running at once, which is not what you want. So you need some way for two master processes to decide which of the two should demote itself back to slave status, to satisfy the Highlander Principle ("there can be only one!"). This could be as simple as "the host with the smallest IP address should remain master", or you could have each heartbeat packet contain the sending process's uptime, and the host with the larger uptime should remain master, or etc.

OTHER TIPS

The typical way to solve this problem is to hold an election. Everyone in the system shares the data that they'll use as input to the algorithm so that everyone can come to the same conclusion.

For example: the peers all (both) send each other some unique identifier (MAC address, or pid, or high-precision process start time, e.g.). Then each peer uses the same comparison to determine the winner (greatest value, e.g.). Then they inform each other of the results.

For the problem regarding transient connectivity faults, see the Byzantine Generals.

See also:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top