Question

We have a SQL Server 2016 Availability Group with 3 servers. Twice in the past 5 days we have had performance issues. 8a is our primary server, and 8b and 8c are the secondaries.

We started having issues right before 8PM. We saw timeouts as well as 502 Bad Gateway errors. On 8b and 8c we see a huge spike in DPT_ENTRY_LOCK waits as well an increase in LCK_M_IS, LCK_M_S, and HADR_SYNC_COMMIT on the primary at the same time. Unfortunately we don't know which server of the 3 was timing out because all 3 sit behind a load balancer for read-only connections.

We are currently trying to track down what was locking stuff up but can't figure out what the DPT_ENTRY_LOCK type is on the secondaries. When SQL Skills doesn't have the answer I start to worry.

Wait Chart

Was it helpful?

Solution

DPT_ENTRY_LOCK waits are caused when the workload is constantly modifying the same pages (so constant updates would cause this to show up). Practically nothing has been written around this subject so trying to get definitive answers (along with supporting evidence) is more than a challenge, as such I'll give you what I've seen from experience on my side. You would probably need to open a CSS call at this point for something more definitive.

When you have a large number of transaction on your primary that have to be flushed to the secondary, and you have a high number of reads on a secondary that access the same tables, you end up with this lock, or a DIRTY_PAGE_TABLE_LOCK wait.

These happen when there is a lot of work going on, meaning that there are a large number of dirty pages that need to be flushed, expect to see very short wait times on this usually. You wouldn't expect this thanks to RCSI on the secondary, but it happens, and I've only seen it in the case where there are a large number of writes on the primary and read calls on the secondary.

As for a way around this, you could try using indirect checkpoints (if not already enabled) which uses sorted, partitioned lists to track dirty pages rather than the random ordering used by prior versions - see SQL 2016 – It Just Runs Faster: Indirect Checkpoint Default

There is a possibility that this could cause unforeseen performance issues on your primary, so you should test it at load prior to implementing on a production system. And there is no guarantee that this will work. I'm actually about to try this very thing for some systems that I'm experiencing the same issues with.

You might expect to see PARALLEL_REDO_TRAN_TURN waits as well, which are caused by page splits.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top