Longer time to bring SQL online after node failover

https://dba.stackexchange.com/questions/284296

14-03-2021
|

Question

We had a strange issue post our regular OS patching for SQL servers.

Per best practices, we apply patch on Passive and do node failovers to make current passive, active and vice versa to complete patching.

Generally node failover is seamless and completes under a min. However recently we had issue where it took 4 minutes to bring SQL online after the node failover:

I am checking logs and events but could not find the reason: below is the findings so far:

Note: The SQL server is running on VM

No increased in SQL server activity
Databases are part of DB mirroring
No increase in user connections or user queries running longer in that duration
VLF under 500 for all databases
NO CPU/Memory pressure and LPIM is enabled.
The process which was seemed to killed time out long running during failover was EXEC sp_server_diagnostics 20 running for past 86745234 secs

Please assist what else should i be checking to find the root cause?

Edit- I tried analyzing the cluster log and can see sql offline was initiated but i am not sure where it spend atleast 4 mins internally to actually shut down sql and bring it back. after 4 mins sql error log had all enteries for databases being brought up approx 10 secs. So it looks DB might not hve any involvement here for slowing down process.

Edit- some VLF info when checked currently

Solution

Most likely your problem is caused by the databases going through recovery to redo or undo transactions that haven't been hardened to data files.

Avoid recovery all together

Before doing planned server reboots or failovers of an FCI, particularly one with a lot of memory, I like to run a CHECKPOINT on every database. This minimizes the time spend on a clean shutdown of all databases, and (when databases aren't shut down cleanly) minimizes crash recovery time on restart.

I use sp_ineachdb from the First Responder Kit to do this:

EXEC DBA.dbo.sp_ineachdb 'CHECKPOINT;`;

If you hate free code that makes your life easier, you could do something with dynamic SQL:

DECLARE @sql nvarchar(max) = N'';

SELECT @sql += N'CHECKPOINT ' + QUOTENAME(name) + '; '
FROM sys.databases;

EXEC sys.sp_executesql @stmt = @sql;

But make recovery faster by doing less work

And of course, as Erik, Darling mentioned in the comments, make sure your VLFs are in order and properly sized. Scanning those VLFs during crash recovery is what can cause you all the pain you're seeing. For planned maintenance you can CHECKPOINT to minimize or eliminate crash recovery. But if you have... uhhh.... a crash and failover unexpectedly, that crash recovery is still going to happen, and there's not much you can do about it.

Be less direct

I've also had tons of luck with indirect checkpoints. We've rolled this out across our entire environment to much success.

And lobby the powers that be for an upgrade

SQL Server 2019 includes a feature called Accelerated Database Recovery, which can speed up recovery process, particularly when there are long-running, large transactions. ADR is not just for recovery after a crash, but also helps in other scenarios where the transaction log needs to be recovered–including Availability Group secondary redo and Failover Cluster Instance failovers.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange