Dealing with master failure in a master-slave DB setup

https://softwareengineering.stackexchange.com/questions/377952

system-reliability

07-02-2021
|

سؤال

I'm learning about system design for the first time and am really intrigued by reliability. Given a setup where you have a master that replicates and writes data through to a slave, how do you persist/maintain availability of data to users if the master goes down? Lots of these architectures I find online seem to have a single link from the master to the slave. Is it implied that there is some arbiter that can elect a new master when the current master goes down? In the picture below from: https://docs.rightscale.com/cm/designers_guide/cm-cloud-computing-system-architecture-diagrams.html, it appears as if the applications are only being linked to the master. This looks like a single point of failure.

المحلول

how do you persist/maintain availability of data to users if the master goes down?

If the master goes down, then the Slaves continue to run, just without any new data arriving (from the inactive Master).

Is it implied that there is some arbiter that can elect a new master when the current master goes down?

Correct. SqlServer has "Witnesses", Oracle's DataGuard uses "Observers" and other DBMSs will, presumably, have something similar. Both the same thing - detecting a failed Master and coordinating the process of promoting a Standby to be the new Primary.

it appears as if the applications are only being linked to the master

It depends on the architecture.

SqlServer clusters offer a [single] Virtual I.P.Address to which all clients connect and the clustering software works out where the Primary is and "joins the dots".

Oracle takes a different approach, using its TNS technology to tell clients about all of the available nodes (Primary and Standby) and letting the Oracle client software for each work out which database the client needs to talk to.

نصائح أخرى

The image that you refer to has a simplified relation between the applications and the DB. In my experience with Oracle and their solution Dataguard (1 master and 1..n 'slaves') I can say that the 'Master' is not fixed. At any time you can change (or have changed) the role between a 'Slave' and the 'Master'. For the applications this is made transparent by using a DNS-like solution. This means that it is not a 'single-point-of-failure'.

I believe that diagram is misleading. As you correctly identified, if the only connection was from the application to master, a master-slave relationship would be pointless and the application would fail if the master database went down.

As an example, a MongoDB replica set consists of several nodes, one being the primary and the others being secondaries. The application connects to the entire replica set (e.g. the connection string is user:pass@host1.local,host2.local,host3.local) and reads / writes to the primary. Should the primary go down, the remaining secondaries hold an election to determine a new primary.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى softwareengineering.stackexchange