Why is there automatic failover for HA and manual failover for DR?

https://dba.stackexchange.com/questions/278297

09-03-2021
|

Pregunta

I am reading this summary from the book 'Pro SQL Server 2019 Administration' from Carter:

And it specifies:

hot → automatic failover → used for high availability (HA)

warm → manual failover → used for disaster recovery (DR)

Since 'high availability' is usually planned and 'disaster recovery' is not, shouldn't we aim for manual failover in 'high availability' scenarios?

What's the point of having automatic failover for high availability?

It makes sense to me that if DR occurs, failover should be automatic, and if we have something planned (patching, etc.) and we have HA, we can do the failover manually...

Solución

It is tricky with generalizations and simplifications. However, here's an attempt:

My feeling is that HA generally is closer geographically. In such cases it is more realistic to use synchronous methods, like sync Availability Groups. In such cases, it is reasonable to failover automatically.

But when we talk DR, it is reasonable to imagine longer distances, since having some distance is part of the resilience architecture itself. And with longer distance, it is likely that you have async methods to get changes across. Like async AGs. Log shipping is by its very nature async. I.e., failover with an async solution and you lose data! Not something you want to happen "behind the scenes".

Otros consejos

This is going to change wildly based on each business's tolerance for data loss and recovery times. But, here are my thoughts on the matter.

The SQL Servers do not live in a vacuum and our HA and DR plans are part of a concerted whole that includes app servers, BI and other resources. Add in a healthy mix of geographic dispersal and synchronization across dispersed regions isn't feasible since we don't want temporary lags in the connection to cause lag for the users connected to the primary site.

Given all of that, our business has long decided that any failover will have to be manually executed so that we can make the entire environment move as a whole. We do have synchronous partners for our SQL Servers but we still elected to use manual failover only.

Because in a DR scenario you need to check what really happened. E.g. your DR site loses contact with the primary. Should it activate? Well the primary might actually be fine and someone’s just put a backhoe through the fibre, this happens all the time. Or the primary might have been totally destroyed by a natural disaster or a terrorist atrocity. There is no way for SQL Server to figure this out for itself.

HA is easy, it’s typically within the same DC so the range of failure modes is much more tractable for automated decisions.

Since 'high availability' is usually planned and 'disaster recovery' is not ...

Not all High Availability "events" are planned.

If the primary database server crashes at 2am, why not leave the databases down until you get in at 9am and start them up on the other node? Your business would lose money.

So you have an H.A. solution to automatically move the databases to the other node, keeping the business running, until you come in to figure out what went wrong, which is secondary.

If your entire data centre goes up in a puff of smoke, that's a Disaster.

You don't really know what's what and you don't really know what's likely to work when you turn it on. That's why D.R. tends to be more hand-cranked.

From my perspective; High Availability relates to the ability to access the resource in the event of network outage, power failure, DNS outages/attacks, etc. Non-volatile events, in other words. These types of events can usually be managed by preparing hot stand-by systems, setting up alternative configurations, and through monitoring and automation.

Disaster Recovery is getting a system that has been taken offline or corrupted through something out of the ordinary occurring, back online. Something like a malware attack, natural disaster, etc. These types of events generally require human triage and higher level resolutions.

One problem with automatic disaster recovery that’s not yet been mentioned is that your disaster may have been replicated to your failover sites. If the disaster is a DBA thinking he’s on a test database and typing “DELETE * FROM CUSTOMERS;” on the live database, then your DR has to include a manual rollback on replicated database copies. If the DR site automatically goes live, then you’ll have new database transactions before the rollback happens, and so you’ll have a big mess.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a dba.stackexchange