Question

The following error was occurred in cluster events and the availability group was failed which resulted the databases in non-synchronizing state.

A component on the server did not respond in a timely fashion. This caused the cluster resource 'AG' (resource type 'SQL Server Availability Group', DLL 'hadrres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource.

Please help me to find the root cause (A component on the server did not respond in a timely fashion).

Was it helpful?

Solution

That error means the AG failed one of the health detection timeouts (lease, session, or health check) summarized in this table from the docs:

Mechanics and guidelines of lease, cluster, and health check timeouts for Always On availability groups - Summary of Timeout Guidelines

Your first stop should be to review the SQL Server error log on each node in the AG, to see if there was something going on that might have caused the instance to stop responding to the cluster AG resource. For instance, you might have crash dumps, deadlocked schedulers, etc.

To dig into the cluster-related details, you'll need to call the PowerShell Get-ClusterLog cmdlet (see here) to get the cluster log file from each node in the AG. Then find the time associated with the failure error message you mentioned in the question. Review the cluster log files around that time period for errors (search for "ERR") or any messages of interest from the AG resource DLL itself (which will include "[RES]" or "[hadrag]").

I would expect you'll find messages that include "timed out" or "IsAlive" showing the timing of the failures. At this point, you should be able to figure out which server was not responding, and investigate that directly.

Figuring these things can be difficult. Microsoft support has released a tool that can be used to categorize some types of failovers, which might be useful to you:

Failover Detection Utility - Availability Group Failover Analysis Made Easy

OTHER TIPS

Unfortunately I couldn't find much information out there but you might want to review this article on Failover Clustering by Microsoft (it's seemed to help other people with similar issues, and goes pretty in depth): https://techcommunity.microsoft.com/t5/failover-clustering/failover-clustering-networking-basics-and-fundamentals/ba-p/1706005

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top