Why are these two HADR DMVs reporting different states?
-
28-09-2020 - |
Question
SQL Server 2012 (11.0.5058.0) Enterprise Edition
We have 8 Availability Groups in a 2(HA)+1(DR) cluster and our monitoring DMVs are reporting results that confuse me. 6 Availability Groups are configured for HA and DR, 1 is configured for HA only, and 1 is configured for DR only.
Each of the 6 HA/DR Availability Groups have "SQLB" as a primary and "SQLA" as a secondary (synchronous) HA replica and "SQLC" as a secondary (async) replica.
On both secondaries:
SELECT dhags.group_id, dhags.synchronization_health_desc
FROM sys.dm_hadr_availability_group_states dhags
reports that all Availability Group replication sync health are NOT_HEALTHY
and
select replica_id,synchronization_health_desc
from sys.dm_hadr_availability_replica_states
reports that all replicas have a sync health of HEALTHY
.
The primary replica reports all Availability Groups and replicas with a sync health of HEALTHY
.
While I understand that one reports on replica sync health and the other reports on AG sync health, it seems logical to me that if the more granular (AG) state was not healthy, that would affect the overall health of the broader context (replica). I cannot find MSDN documentation that describes how the health is determined at each level.
Why would the secondaries report NOT_HEALTHY
for Availability Group sync health, but HEALTHY
for replica sync health, and why does this differ from the primary's report?
Solution
Sadly, sys.dm_hadr_availability_replica states is not a reliable indicator of replica health. Here's the Connect item on one of the bugs we've run into where that DMV stops refreshing - note in the comments that log_send_queue_size in the DMV sys.dm_hadr_database_replica_states shows 0 even when there's log data to be sent.
Note that the Connect item is marked as Won't Fix. Sad trombone.