Failover: For Software error or Hardware or Both?

https://stackoverflow.com/questions/17276840

01-06-2022
|

Question

I Am designing a system where I will have programs running in Nominal/Redundant mode, One on one machine, one on another machine. Should the Nominal program fail (Failover event), the Redundant should take over and assume operations as a new Nominal process. This should be transparent to the user.

My Question is: when the Failover occurs, should this be only because of a Hardware failure ? or are Software errors enough of a cause to trigger a Failover ?

More generally, is there an industry standard for deciding what should cause a Failover, or is that up to the system architect / designer ?

Solution

From the cluster point of view those kinds of errors do not make any difference. The thing is that you cannot rely on any "I am failing" events from a failing node.

Cluster (in your case "Redundant" role) just finds out that a node didn't send heartbeat (didn't respond to ping). Then "Redundant" makes itself "master" and starts processing incoming requests. That's all, I think.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow