How can I Debug a Buffer Issue?

https://dba.stackexchange.com/questions/115671

29-09-2020
|

Question

I have a production "Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64)" that is showing weird buffer and page life expectancy (PLE) symptoms.

I am running this every minute on my server (to track this issue):

SELECT @ple = CAST([cntr_value] AS VARCHAR(20))
FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'

SELECT @usedBufferPages = CAST(COUNT(*) /128 AS VARCHAR(20)) 
FROM sys.dm_os_buffer_descriptors

DECLARE @StartDate VARCHAR(8) = Convert(VARCHAR(8), GETDATE(), 14)
RAISERROR ('%s. PLE at %s and Used Buffers at %s at %s ', 0, 
            1,@runCountString ,@ple, @usedBufferPages, @StartDate) WITH NOWAIT

This is some example output:

16. PLE at 858 and Used Buffers at 7290 at 09:51:42 
17. PLE at 918 and Used Buffers at 7342 at 09:52:42 
18. PLE at 978 and Used Buffers at 7408 at 09:53:43 
19. PLE at 1039 and Used Buffers at 7547 at 09:54:43 
20. PLE at 1100 and Used Buffers at 7697 at 09:55:44 
21. PLE at 1160 and Used Buffers at 7901 at 09:56:45 
22. PLE at 1221 and Used Buffers at 7961 at 09:57:46 
23. PLE at 1282 and Used Buffers at 8012 at 09:58:46 
24. PLE at 11 and Used Buffers at 313 at 09:59:46 
25. PLE at 31 and Used Buffers at 966 at 10:00:46 
26. PLE at 90 and Used Buffers at 1580 at 10:01:47 
27. PLE at 151 and Used Buffers at 3072 at 10:02:47 
28. PLE at 211 and Used Buffers at 3152 at 10:03:47 
29. PLE at 271 and Used Buffers at 3729 at 10:04:47

At item #24 SQL Server reports the PLE going from 1,282 to 11. SQL Server also reports that the used buffers go from 8,012 to 313.

First I looked for poor running queries, and I found a fixed a few (had no effect on the issue). But, I am not finding any problem queries that correlate to the times that I have PLE/Buffer issues. Also, if it was a poor running query, then I would think the Buffers would be full of that query's data, not empty/missing/errored.

Next I thought that the Virtual Machine was getting its memory restricted when this happened. But I have asked my System Admin and he assures me that the memory is not dynamic or shared in any way. (What it is assigned, it gets, all the time.) Also, I run this script every 10 minutes and when the PLE reports less than 50:

  SELECT * FROM sys.dm_os_sys_memory

And it reports the same/similar values when the PLE/Buffers are high and when they are low. For completeness, here is an example of the values before and after #24 above:

total_physical_memory_kb    available_physical_memory_kb    total_page_file_kb  available_page_file_kb  system_cache_kb kernel_paged_pool_kb    kernel_nonpaged_pool_kb   system_high_memory_signal_state   system_low_memory_signal_state   system_memory_state_desc
20970996                    4758672                         24378868            7929404                 4844160         686076                  182752                    1                                 0                                Available physical memory is high
20970996                    4743468                         24378868            7892632                 4845000         686580                  182688                    1                                 0                                Available physical memory is high

I have checked the System Health Session and it shows nothing related. (All it has are impersonation falures, and their times do not correlate with the times the PLE/Buffers show issues.

I have tracked how often this occurs, I cannot see a pattern or connect it to any jobs or scheduled activities.

Here is a graph that shows PLE and Buffers over 21 hours:

So I am stumped. I think the core of the issue is the buffers not the PLE. (I think PLE is getting a false report of low because all the buffers are somehow gone.)

But I can't think of any way that this could happen. Or what to do next.

I would love advice on additional things to check or suggestions of what this issue might be.

Updates from questions in the comments:

So, how much memory is the server given? The VM has 20 GB of memory.
What is max server memory?

name                    value   value_in_use  description
max server memory (MB)  13000   13000         Maximum size of server memory (MB)
min server memory (MB)  0       16            Minimum size of server memory (MB)

NOTE: I have done a bit of reading on this just now, and it seems these settings are wrong for my server.

How large is the database? There are two transactional databases running on this server (I am in the process of getting servers to isolate them.) Their sizes are 383 GB and 378 GB.

What other applications and services are running on that server? This server hosts the data for my application. There are no other things hitting it. (I have a replicated Operational Data Store for reports and such.

What is the VM technology VM Ware.
Is this VM running on a host that only hosts VMs with similar resource allocation? We have many VMs at our company. All of varying size. This is one of the largest though.

Can you confirm what your System Admin is telling you about memory allocation without just having to believe him? I cannot. I don't have access to those tools.

(In my experience, System Admins will say a lot of things to pass the buck and blame the app or anyone else if it means they don't have to do anything.) I can fully understand that sentiment.

That pattern certainly seems like severe memory pressure I agree. I was hoping to find something to prove that SQL is feeling memory pressure. So I can send it back to the System Admins for more research.

Wait Time Statistics

WaitType               Wait_S      Resource_S  Signal_S  WaitCount  Percentage   AvgWait_S  AvgRes_S  AvgSig_S 
---------------------- ----------- ----------- --------- ---------- ------------ ---------- --------- ---------
PAGEIOLATCH_SH         16250.10    16219.14    30.96     2171649    29.59        0.0075     0.0075    0.0000   
CXPACKET               14214.03    13238.56    975.47    1187935    25.88        0.0120     0.0111    0.0008   
PAGEIOLATCH_EX         6814.59     6806.21     8.38      638725     12.41        0.0107     0.0107    0.0000   
WRITELOG               5157.42     4873.44     283.98    3588476    9.39         0.0014     0.0014    0.0001   
BACKUPIO               2569.51     2538.12     31.39     1704119    4.68         0.0015     0.0015    0.0000   
LCK_M_IX               2477.15     2477.10     0.05      113        4.51         21.9217    21.9213   0.0004   
ASYNC_IO_COMPLETION    2079.99     2079.66     0.33      836        3.79         2.4880     2.4876    0.0004   
BACKUPBUFFER           1807.75     1759.11     48.64     380189     3.29         0.0048     0.0046    0.0001   
IO_COMPLETION          986.23      985.84      0.39      116112     1.80         0.0085     0.0085    0.0000

Solution

As discussed on This SE thread and confirmed by OP.

The issue is due to bug in SQl Server 2012. Ths bug was fixed in SQL Server 2012 SP1 CU4. Or to be on safer said I would recommend you apply SQL Server 2012 SP2 instead of going for CU4.

As per Microsoft Bug fix detail

You may experience slow performance in SQL Server 2012. When you check SQL Server Performance Monitor tools, you see the following:

•A rapid decline in the SQLServer:Buffer Manager\Page life expectancy performance counter values. When this issue occurs, the counter is near 0.

OTHER TIPS

Your buffer pool is only 13GB and your databases are 383 GB and 378 GB which you have classified as being OLTP - small transactions running too frequently.

The above situation, if I have to imagine is like below :

(source : Google Photos)

You have to understand how SQL Server stores information :

SQL Server stores information in memory in a structure called a memory cache. The information in the cache can be data, index entries, compiled procedure plans, and a variety of other types of SQL Server information. To avoid re-creating the information, it is retained the memory cache as long as possible and is ordinarily removed from the cache when it is too old to be useful, or when the memory space is needed for new information. The process that removes old information is called a memory sweep. The memory sweep is a frequent activity, but is not continuous.

You are for sure experienceing memory starvation due to sheer amount of database size and your inadequate buffer pool. Refer to - How to determine ideal memory for instance?

Collect wait stats and check for performance issues that arises from wasted buffer pool memory

Recommendation:

Add more memory to server instance and separate the two databases on different VMs with adequate memory.

There is very little to debug here - you need to add memory, logically split your database across multiple VMs, or understand that the shuffling you have to do with limited memory will lead to performance issues and volatile PLE. Trying to fit 800 GB of data into 13 GB of memory is like trying to stow away in a backpack.

Look closer at the queries being executed. Memory usage alone on databases is normally too coarse a metric to improve things. Assuming you cannot affect the queries (black box application), it is still worth understanding what is affecting the memory usage. For instance a batch process might go and use all the buffer space in a single hit by querying all data on a massive table.

In particular look for any missing indexes that cause full table scans - as they can effectively flush the cache on the server.

SQL Server has an excellent set of analyser tools that can monitor it in realtime, and I suspect you'll see something stick out like a sore thumb once you delve into it.

Not that I'm suggesting changing the database schema, but one thing to look out for is overly large varchar fields - they can really suck up cache space on a large database.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange