SQL Server Availability Group LeaseTimeout and slow IO

https://dba.stackexchange.com/questions/284934

15-03-2021
|

Question

Our 5 main databases runs on a physical (2 * 8 cores, 512GB, Hypertreading) SQL Server 2016 SP2 Enterprise in a single Availability Gruop and sometimes we get errors that the lease-timeout has expired. My understanding is that if the lease can't get updated there is a system-wide problem.

When I check the output of sp_server_diagnostics (*SQLDIAG*.xel files), in the log folder of the primary replica, around the time of the timeout I always find pending IO operations.

<ioSubsystem ioLatchTimeouts="0" intervalLongIos="0" totalLongIos="1">
<longestPendingRequests>
<pendingRequest duration="26566" filePath="\?\F:\SqlLogs\db1.ldf" offset="80824832" handle="0x8d10" /> <pendingRequest duration="1987" filePath="\?\O:\SqlLogs\db2.ldf" offset="3880740352" handle="0x1330" /> <pendingRequest duration="1093" filePath="\?\O:\SqlLogs\db3.ldf" offset="288143360" handle="0x132c" /> <pendingRequest duration="974" filePath="\?\O:\SqlLogs\db3.ldf" offset="288145408" handle="0x132c" /> <pendingRequest duration="937" filePath="\?\O:\SqlLogs\db3.ldf" offset="288146944" handle="0x132c" />
</longestPendingRequests>
</ioSubsystem>

This is what I find in the clusterlog of the primary replica:

WARN [RES] SQL Server Availability Group: [hadrag] Failed to retrieve data column. Return code -1
ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Resource Alive result 0.
ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, diagnostics heartbeat is lost
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel
ERR [RES] SQL Server Availability Group <AG_Name>: [hadrag] Resource Alive result 0.
WARN [RHS] Resource AG_Name IsAlive has indicated failure.

This are the erros in the SQL Server errorlog:

Error: 19407, Severity: 16, State: 1
SQL Server hosting availability group 'AG_Name' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.

Error: 19407, Severity: 16, State: 1
The lease between availability group 'AG_Name' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

Always On: The local replica of availability group 'AG_Name' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

This is the output from SELECT @@version:

Microsoft SQL Server 2016 (SP2-CU15) (KB4577775) - 13.0.5850.14 (X64) Sep 17 2020 22:12:45 Copyright (c) Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2012 R2 Standard 6.3 (Build 9600: )

In our monitoring there are no signs of high cpu usage. Also no memory dumps are created at the time of the problem.

As a result of this timeout the WSFC-service restarts the cluster resource 'AG_Name'. After that this resource is restarted everything works perfect again.
What I don't understand is: how can slow IO-requests cause a lease-timeout? Can to many pending IO-requests cause a lease-timeout?

Solution 2

After checking the wait stats in our monitoring tool I noticed that on moment of the issue there were two leading wait types with a wait time of 526000 ms/s, PREEMPTIVE_SP_SERVER_DIAGNOSTICS and PREEMPTIVE_HADR_LEASE_MECHANISM.

If I interprete this correctly, the PREEMPTIVE part means that a thread outside of SQLOS is executing the commands. In this case, executing SP_SERVER_DIAGNOSTICS and renewing the lease.
The high wait time shows that SQL Server was waiting for those threads to finish. So I think this was an issue of the OS that wasn't responding.

Our system administrator also mentioned that on the moment of the timeout there were several warnings with event-id 153 in the system log:

Device\MPIODisk0 is currently in a degraded state. One or more paths have failed, though the process is now complete.

So my conclusion is due to the disk problems the OS was not responding within the defined timeout settings and caused the cluster resource to restart.

OTHER TIPS

how can slow IO-requests cause a lease-timeout? Can to many pending IO-requests cause a lease-timeout?

No, slow I/O requests can't cause a lease timeout directly.

However, if the server is completely overloaded (CPU at 100%), than can cause pending I/O requests and lease timeouts. The default lease timeout is 20 seconds, and your pending I/O is 26 seconds. High CPU or some other server / OS level problem is more likely the issue here.

Another cause is that SQL Server encountered a serious error, and is generating dump files (which causes the process to pause, potentially long enough for WSFC to think the lease timed out).

See the documentation for a couple more possibilities:

OS not responding, low virtual memory, working set paging, generating dump, pegged CPU, WSFC down (loss of quorum)

You should review the SQL Server error log to see if there are dumps being created. If you have monitoring from the time of these incidents, you could also check for maxed out CPU.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange