Why is my SQL Server giving out of memory errors when there should be plenty?

https://dba.stackexchange.com/questions/279941

10-03-2021
|

Question

I have a production system running SQL Server 2019 Standard edition. It recently had a problem 3 times in 1 day where it became unresponsive until a reboot. Errors seem to point to memory limitations. 32GB is installed and the machine is dedicated to MSSQL. Max Memory set to 26GB.

(edited 2020-11-23 to remove some erroneous info re: MEMORYCLERK_SQLGENERAL.)

The most damnning info from the logs is a series of FAIL_PAGE_ALLOCATION errors.

Note: Since having the problem I've bumped up to a larger instance size (64GB)

Update 2020-11-22: The problem has reoccurred on the new larger instance. The memory errors began a few minutes after a scheduled backup of several databases. That may or may not be coincidence. It became nonresponsive and I forcibly restarted it.

I will start the MSFT support process as suggested by Josh and sqL_handLe.
I've made an Agent job (sproc here) to collect the results from dm_os_memory_objects every 5 minutes, based on sqL_handLe's query.
I've shared recent logs here. (I replaced the nl tags with actual newlines to make it easier to read.)
Disabled SSAS as suggested by Jana Sattainathan. I did confirm it was running and had 13GB "Min Working Set" as shown in the linked article, even though its actual working set was small. I think it's possible this subtle issue is the culprit, but if it's as dangerous as it seems, I'm very surprised I haven't run into it before. Jana, consider converting your comment to an answer?

The non-SQL load on this server is near-zero. The backup agent is just Ola Hallengren's TSQL scripts. The default SQL services are running but not in use. As mentioned, I've stopped SSAS. We occasionally use SSIS but not since the instance upgrade.

Solution

Identifying large memory objects associated with MEMORYCLERK_SQLGENERAL may make this situation on November 18 more clear.

-- memory objects in MEMORYCLERK_SQLGENERAL
SELECT omo.type, omo.pages_in_bytes, omo.partition_type_desc, omo.memory_node_id
FROM sys.dm_os_memory_objects omo
JOIN sys.dm_os_memory_clerks omc 
ON omo.page_allocator_address = omc.page_allocator_address
WHERE omc.type = 'MEMORYCLERK_SQLGENERAL'
ORDER BY pages_in_bytes DESC;

Recently I saw a case where automatic seeding of an availability group replica was generating errors on the remote side due to insufficient disk space. In that case, memory object MEMOBJ_DBCOPY grew to be quite large over time (MEMOBJ_DBCOPY is in MEMORYCLERK_SQLGENERAL).

Editing this to add ~~ Obviously, if MEMORYCLERK_SQLGENERAL just keeps growing it'll consume memory that could instead be used for other activities and performance will degrade over time. At some point, OOMs will occur - for example maybe there won't be enough memory in the pool and SQL Server will inform of that with an error.

And there's one more threshold/behavior to be aware of. When SQL Server "total server memory" is no longer in its initial growth after startup, and database cache declines to 2% of "target server memory" on any one SQLOS memory node (in this case looks like there is only one memory node), most if not all of the user worker threads will likely try to correct the situation and a large amount of CPU will be expended performing "free list stalls".

Microsoft mentions this briefly in this kb article (although the kb article doesn't shed any light on what might be causing MEMORYCLERK_SQLGENERAL to be growing so large in this case).

KB4536005 - Improvement: Fix incorrect memory page accounting that causes out-of-memory errors in SQL Server https://support.microsoft.com/en-us/help/4536005/kb4536005-improvement-fix-incorrect-memory-page-accounting-that-causes

In some cases where SQL Server memory gets into internal trouble, using DBCC DROPCLEANBUFFERS, or DBCC FREEPROCCACHE, or DBCC FREESYSTEMCACHE ('ALL') can rescue the instance for the time being. Unfortunately, not the case here. Have to actually identify the memory objects involved in order to bring MEMORYCLERK_SQLGENERAL back under control.

Edited to discuss November 22 SQL Server log... ~~

In the November 22 memorystatus output, memoryclerk_sqlgeneral doesn’t seem to be of immediate concern: immediately after each of the many errors memoryclerk_sqlgeneral seems to be at or about the same size of 22040 kb. Other extremely interesting details: [system physical memory high] is 1(hinting SQL Server memory could grow) and [current committed] is less than [target committed] (also hinting SQL Server memory should be able to grow).

The full SQL Server log shows the increase of RAM in the instance.

11/18/2020 10:15:11,Server,Unknown,Detected 32409 MB of RAM. This is an informational message; no user action is required.

11/18/2020 15:26:01,Server,Unknown,Detected 64723 MB of RAM. This is an informational message; no user action is required.

We also see the modifications to [Max Server Memory].

11/18/2020 10:33:03,spid96,Unknown,Configuration option 'max server memory (MB)' changed from 300000 to 280000. Run the RECONFIGURE statement to install.

11/19/2020 11:34:10,spid94,Unknown,Configuration option 'max server memory (MB)' changed from 280000 to 60000. Run the RECONFIGURE statement to install.

Both 300000 mb and 280000 mb were too large for SQL Server to attain on a VM with 32409 mb of memory. So SQL Server chose lower values of [target server memory] on the VM that it could attain, and managed memory based on those values.

What is likely not as intuitive is on the new VM, for reasons currently unknown, SQL Server also chose a lower target than the new [Max Server Memory] value of 60000 mb. That can be seen in memorystatus sections like the following.

11/22/2020 09:06:59,spid69,Unknown,Memory Manager               KB                             
---------------------------------------- ----------
VM Reserved                               101157820
VM Committed                                 636280
Locked Pages Allocated                     51513180
Large Pages Allocated                             0
Emergency Memory                               1024
Emergency Memory In Use                          16
Target Committed                           58468304
Current Committed                          52149464
Pages Allocated                            50371824
Pages Reserved                               273864
Pages Free                                    35352
Pages In Use                                7855040
Page Alloc Potential                       50512960
NUMA Growth Phase                                 0
Last OOM Factor                                   3
Last OS Error                                     0

The [Target Committed] value here of 58468304 kb corresponds to what appears in perfmon as [Target Server Memory]. Its approximately 57098 mb. Many of the errors have this value. The other value I see in there for [Target Committed] is 57882440 kb, or approximately 56526 mb.

If the non-SQL Server workload on this VM is as expected (eg backup agent if there is one, other SQL Server related stuff like SSIS, SSAS, SSRS and any other apps including SSMS or ADS) I recommend setting [Max Server Memory] no higher than 56500 mb for this instance. That may result in better memory management from SQL Server.

However, there's still the question: If memorystatus results show [system physical memory high] why hasn't SQL Server grown [Current Committed] closer to [Target Committed]? Especially rather than triggering OOM type errors. I don't have an answer for that, yet.

~~ aha! missed this before. parsing MEMORYSTATUS results from the error log can be tricky. the newest lines of the error log are at the top. But... the memorystatus results within each error log line separated with should not be reversed like the error log lines. First reverse full lines from the error log, then replace with newlines.

Correct order is: clerk, VM Reserved, VM Committed, Locked Pages Allocated, SM Reserved, SM Committed, Pages Allocated.

11/22/2020 09:06:59,spid69,Unknown,MEMORYCLERK_SQLBUFFERPOOL (node 0)               KB
---------------------------------------- ----------
VM Reserved                                25686200
VM Committed                                 141312
Locked Pages Allocated                      1072140
SM Reserved                                       0
SM Committed                                      0
Pages Allocated                            44537424
11/22/2020 09:06:59,spid69,Unknown,MEMORYCLERK_SQLGENERAL (node 0)                  KB
---------------------------------------- ----------
VM Reserved                                       0
VM Committed                                      0
Locked Pages Allocated                            0
SM Reserved                                       0
SM Committed                                      0
Pages Allocated                               22040

The easiest way to confirm this ordering is by using sqlcmd against the instance. This should give the details for each clerk in the correct order. sqlcmd -Q "dbcc memorystatus"

So scrolling all the way back to November 18 in the logs on Google Drive we find:

11/18/2020 15:10:48,spid51,Unknown,MEMORYCLERK_SQLBUFFERPOOL (node 0)               KB
---------------------------------------- ----------
VM Reserved                                12867644
VM Committed                                  75776
Locked Pages Allocated                       546740
SM Reserved                                       0
SM Committed                                      0
Pages Allocated                            22821672
11/18/2020 15:10:48,spid51,Unknown,MEMORYCLERK_SQLGENERAL (node 0)                  KB
---------------------------------------- ----------
VM Reserved                                       0
VM Committed                                      0
Locked Pages Allocated                            0
SM Reserved                                       0
SM Committed                                      0
Pages Allocated                               22416

During the events on November 18, it wasn't MEMORYCLERK_SQLGENERAL that was over 21.75 gb in size. That was MEMORYCLERK_SQLBUFFERPOOL, the database cache. MEMORYCLERK_SQLGENERAL was a much more reasonable 22416 kb in size on November 18 during the OOMs.

There's some kind-of-good news in this: although this is something that will still take Microsoft's involvement to fix (because SQL Server really shouldn't throw OOMs when there's room to grow into [Max Server Memory]), if the problem persists this is the type of conditions in which clearing caches and dropping clean buffers can prevent needing to restart the service or having to reboot the VM. I saw something similar - very similar actually - on SQL Server 2017 CU18. Once-a-day we cleared caches and dropped clean buffers whie working with Microsoft to try to understand what was going on. That's not ideal but its better than crashing. Unfortunately, the problem magically went away before we could diagnose what was really going wrong.

This is pretty drastic for a band-aid but if the system becomes desparate...

CHECKPOINT
DBCC DROPCLEANBUFFERS -- drop all clean buffers from database cache
DBCC FREEPROCCACHE -- clear plan/procedure cache 
DBCC FREESYSTEMCACHE('ALL') -- clear any other system caches

It's problably redundant to FREEPROCCACHE in addition to FREESYSTEMCACHE('ALL') - I never checked whether that's the case or if the FREEPROCCACHE lends some benefit of its own in that scenario. Not something you'd want to do anymore than necessary since plans would have to be re-compiled and its kicking a lot lof stuff out of the database cache. But if you get desparate while waiting to resolve the condition...

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange