Pregunta

We are experiencing a slow growth of Stolen Memory on our database servers over the space of several days. It appears to plateau around 130-140GB, at which point we start having larger problems such as out-of-memory errors, multi-second freezes & AG failovers. Problems start to manifest about a week after a reboot. I have started recording history of stolen memory, which is shown below:

Stolen memory over time

Looking at sys.dm_os_memory_clerks, it appears as if the bulk of this is coming from non-page memory recorded against the buffer pool on NUMA node 0:

Buffer pool memory clerks

Tracking the total pages_kb for the buffer pool over time shows the decline in number of pages as virtual_memory_committed_kb grows. (On April 13th, the server was rebooted for windows updates. The buffer pool fills to 400GB in around an hour)

Buffer pool pages

Has anyone seen this behaviour before?

We are running SQLServer 2016 CU12 13.0.5698.0 The server is a 64-core AWS EC2 i3.16xlarge instance. We have a number of other clusters of the same size that are all showing this issue. We also have a few clusters on 32-core i3.8xlarge instances that also show the growth in stolen memory, but they don't end up stalling / throwing out-of-memory errors. The only difference (other than scale) is that the 64-core servers have 2 NUMA nodes.

Server config

Update: MS indicated that the bugfix in KB4536005 is not being back-ported to SQL2016.

¿Fue útil?

Solución

i have a suspicion. First - are you able to open support tickets with Microsoft?

Easiest way to check my suspicion is capture [\SQLServer:Memory Node(*)\Stolen Node Memory (KB)] for both NUMA nodes, and compare the sum to [\SQLServer:Memory Manager\Stolen Server Memory (KB)]. If my suspicion is correct, when trouble is brewing the discrepancy between the two - which seem like they should always agree - will be quite high. The other tell-tale characteristic: up to N-1 SQLOS NUMA nodes may have show this relationship (where N is the count of NUMA nodes) [database node memory] + [stolen node memory] + [free node memory] > [total node memory]

i describe the problem somewhat in these blog posts.

https://sql-sasquatch.blogspot.com/2018/07/sql-server-2016-memory-accounting.html
https://sql-sasquatch.blogspot.com/2018/10/sql-server-2016-memory-accounting-part.html

The basic accounting problem is that sometimes buffer pool growth occurs in a way that buffer descriptor blocks get allocated from SQLOS node A but the pages referenced in the bdbs actually come from SQLOS node B. The result of this condition is that a portion of physical memory controlled by SQLOS gets double-counted: the same memory is accounted on node A (where the bdbs live) as [Database Node Memory] AND accounted on SQLOS node B as [Stolen Node Memory]. That situation is confusing and inefficient... but its not the full bloom of the problem yet.

The problem fully blooms when so much Node B [stolen node memory] is also Node A [database node memory] that Node B [database node memory] drops to ~2% of Node B [target node memory]. When that happens, the rate of [\SQLServer:Buffer Manager\Free list stalls/sec] skyrockets - we saw 2000/sec when this happened to us. SQL Server is trying to correct the issue (too little [database node memory]) on Node B by trimming various types of cache on Node B. But it can't!! Because the [stolen node memory] isn't in any of the various expected cache types.

Temporary resolution: when [total node memory] approaches [target node memory] but [database node memory] approaches 2% of [target node memory], execute DBCC DROPCLEANBUFFERS.

kb4536005 resolves this issue in SQL Server 2017 CU20 and SQL Server 2019 CU2. https://support.microsoft.com/en-us/help/4536005/improvement-fix-incorrect-memory-page-accounting-that-causes-out-of-me

There is a similar sounding fix in SQL Server 2016 SP2 CU5, kb4470916. https://support.microsoft.com/en-ca/help/4470916/fix-out-of-memory-error-occurs-when-database-node-memory-kb-drops-belo

However, I don't believe kb4470916 resolves the issue with double-accounting. So while it may improve SQL Server response to a single SQLOS node having [database node memory] at the ~2% threshold, i think it leaves open the possibility of poking the bear due to this double-counting. And that may be the situation you are in.

However, if the sum of [stolen node memory] across the two nodes always aligns with [stolen server memory] in the instance, you can forget all about this as if it were a bad dream. :-)

Licenciado bajo: CC-BY-SA con atribución
No afiliado a dba.stackexchange
scroll top