Bully Algorithm - Detecting Failure

https://stackoverflow.com/questions/16277125

13-04-2022
|

Question

Descriptions of the bully algorithm usually do not cover the actual detection of a failure.

I have a working implementation of the bully algorithm that uses the elections themselves to detect failures, rather than have failures trigger elections.

In short, elections in my implementation are performed on a scheduled basis, rather than upon a failure detection.

Clearly this means network traffic is generated, but it seems like a simple solution to something that otherwise might become complicated (e.g. having a separate failure detection mechanism, which will have its own network traffic).

Can anyone see a problem with this?

Solution

Let us assume there are 4 nodes A, B, C and D in your distributed system. Let us assume the current leader is A. An election occurs only if any one of the nodes B, C or D identifies that the coordinator A is not responding. The failure of the leader A is understood because of message timeouts or failure of the coordinator to initiate a handshake. Unlike your algorithm in the standard bully algorithm the elections are performed only in case of coordinator failure or when a new node with a higher process id is introduced.

OTHER TIPS

Usually, the leader election is started when a member suspect that there is no leader anymore, i.e. after a (local) timeout. Frequently, a local timeout is not sufficient, but in addition an expected action of the leader.

Appling this scheme, there is no need for a periodic re-election nor for a special failure detection.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow