Buffer Size Changing?

https://dba.stackexchange.com/questions/115136

29-09-2020
|

Question

I have a production database that is experiencing wildly fluctuating Page Life Expectancy (PLE) issues. (It crashes to zero at random times.)

I have been researching the PLE issue and have found something that seems to point to a VMWare issue, but I am not sure I am using the data right. It seems like I am losing buffer/cache pages.

I am using this query:

SELECT  COUNT(*) AS cached_pages_count, 
        CASE database_id
            WHEN 32767 THEN 'ResourceDb'
            ELSE DB_NAME(database_id)
        END AS database_name
FROM    sys.dm_os_buffer_descriptors
GROUP BY DB_NAME(database_id), database_id
ORDER BY cached_pages_count DESC;

(Found here)

I am totaling the results (the count) before and after my PLE crashes. An example is 1,097,820 before and 131,394 after. So I seem to "lose" 966,426 pages.

My guess is that the hardware for all the virtual machines is under stress, so it will randomly swap out some memory from the server for a while. (This is just a guess.) When that happens all the pages are lost, so the PLE plummets.

So, am I using the sys.dm_os_buffer_descriptors view correctly? From what I read it always shows used buffer/cached pages. So if it is empty (or significantly reduced), I either don't have the memory anymore, or it is empty. (I would love a way to confirm this conclusion.)

Or is there another explanation as to why the count drops so much?

Information below the line was added from the OP's comments

Our System Admins manage the VMs. I am hoping to understand my query before I go to them with this data. The timing of the PLE crashes seems random from the database point of view. (No re-indexing or other high performance stuff happening during the PLE Crashes)

I have done a ton of work to see if it was work load related. And while there is one poorly performing query, it is not enough to use up all the cache. [There is] no rebuilding or other non-routine user activity on the server when the buffer counts go down. And even if it was, would I not see that being used in my query above? (Meaning if it was a SQL Server action, wouldn't the counts stay the same, just with different stuff?)

I don't have access to the VMWare settings. I was hoping to understand my findings better before involving those that do. The point of this question was to ensure I was using the view correctly first.

At the end of the comment chain:

I was trying to say that the PLE issue lead me to the loss of Buffer Pages issue. The query I was using to get PLE would show a low PLE because the pages were being lost. So what was in them was gone. It was a false reading because the amount of memory was reduced.

Here is my @@Version:

Microsoft SQL Server 2012 (SP1) - 11.0.3128.0 (X64) 
    Dec 28 2012 20:23:12 
    Copyright (c) Microsoft Corporation
    Enterprise Edition (64-bit) on Windows NT 6.2 <X64> (Build 9200: ) (Hypervisor)

Solution

Q: I have a production data base that is experiencing wildly fluctuating Page Life Expectancy (PLE) issues. (It crashes to zero at random times.)

Let me ask you what is output of Select @@Version. What is SP and CU level to which your SQL Server is patched. The reason I am asking this is because there was bug in SQl Server 2012 which forced PLE to plummet like what you are observing. Ths bug was fixed in SQL Server 2012 SP1 CU4. Or to be on safer said I would recommend you apply SQL Server 2012 SP2 instead of going for CU4

Its sometime normal for PLE to fluctuate on system having high activity. Actually this is by very virtue how PLE code works in SQL Server. But the fact that its plummetting to zero quite frequently make me believe you might be hitting the bug I have mentioned above.

As per Microsoft Bug fix detail

You may experience slow performance in SQL Server 2012. When you check SQL Server Performance Monitor tools, you see the following:

•A rapid decline in the SQLServer:Buffer Manager\Page life expectancy performance counter values. When this issue occurs, the counter is near 0.

PLE on system is measure of how volatile your buffer pool is, its also measure of amount of I/O activity going in your SQL Server. MSDN says that

Page life expectancy - Indicates the number of seconds a page will stay in the buffer pool without references

Believe me this definition is incomplete. It describes it in form of time which is not a complete definition. I have always noticed that it is measure of I/O activity on server. The greater the I/O activity the more volatile would be BPool, thus fluctuating PLE.

Q: My guess is that the hardware for all the virtual machines is under stress, so it will randomly swap out some memory from the server for a while.

If you believe this is the case and you want SQL Server not be be victim of such issues you must make sure SQl Server service account has Locked Pages in Memory Privielge (LPIM). This will not let OS to force SQL Server page out its memory. If account running SQL Service is local system by default SQL Server will have this privilege in SQL Server 2012.

Note:

This is a workaround. The solution here would be to find out what is causing stress to VM machine. You should fix that. If you feel Wmware Balooning is the issue. You can use RAMMAP tool to track memory which is consumed by Locked Driver. In RAMMAP tool if you see Locked driver taking huge memory its sign of VMware balooning. Take help from the team to configured/disable ballooning for the virtual machine on which SQL Server is running
Befor giving LPIM you must make sure you have set optimum value for max server memory and have left ENOUGH memory for OS to perform efficiently.
If you do not follow above two points and if OS comes under severe memory pressure due to LPIM OS processes would be paged out because it cannot force SQL Server to release memory(its locked/non pageable due to LPIM) and thus leading to tremendous slowness of OS processes.

Q: So, am I using the sys.dm_os_buffer_descriptors view correctly? From what I read it always shows used buffer/cached pages. So if it is empty (or significantly reduced), I either don't have the memory anymore, or it is empty. (I would love a way to confirm this conclusion.)

Buffer descriptors as already mentioned returns information about all the data pages that are currently in the SQL Server buffer pool. IMHO buffer pages are affected by I/O activity on server and thus indirectly related to PLE. If there is request to fetch large amount of pages from disk to memory its quite possible that SQL Server will flush datapages to disk if it finds it needs to create space in buffer pool to bring in the new pages in memory and thus decreasing the amount of data page present in memory for particular database.

So what you are seeing via sys.dm_os_buffer_descriptors is not incorrect but I would not suggest you to use Buffer descriptor DMV to gauge PLE on server. This would not be a correct approach.

OTHER TIPS

This was a group effort and my role is mostly as a curator.

There are many reasons why you could be seeing the results you're seeing.

Zane offered a few potential causes when he commented:

Is the VM overcomitted on memory? Are other activities peaking during this time and therefore windows is having to take memory back from SQL server? Does this happen during high load times? What other process run on this machine?

Tom V also offered some potential causes in his comment:

Do you have index maintenance at that time? If you think it's a vmware issue, do you have access to the vmware console? If so what is the ballooning status? What does MCTLSZ say in esxtop?

swasheck mentioned the importance in investigating the workload too:

In addition to the vmware implications that have rightly been raised, you've also not told us anything about your workload, meaning are you rebuilding indexes, writing to pages, etc.

Since the VM/memory pressure seems to be a likely suspect you should ask the sysadmins some basic questions.

Some suggested questions to ask in a non-accusatory manner include:

Ask the System Admins if they have allocated fixed or dynamic memory to your VM. - Aaron Bertrand
If they are ballooning or over-allocating [memory]. - Zane

It also seems like you are confusing PLE and the number of Buffer Pages in memory

Several people mentioned this issue including swasheck initially and Max Vernon who said:

As @swasheck said,the numbers you reference in your question are not PLE. They are the number of buffer pages in memory. PLE is "Page Life Expectancy", which can go up or down without any change in the number of buffer pages in memory. PLE is a measure of how long the average data page will stay in memory. I've seen servers where this fluctuates from tens of thousands down to 0 without any loss in the number of pages allocated in memory. If PLE truly is low that indicates a totally different problem than the number of buffer pages decreasing unexpectedly.

Zane clarified the role of PLE when he said:

Yeah the issue with use PLE here is that it doesn't indicate an actual loss in memory available to the Buffer Pool. It's more about measuring the turnaround of how often pages are flushed out to make way for new data.

Better options for checking out memory problems

Max Vernon suggested using the following query:

SELECT * FROM sys.dm_os_sys_memory ORDER BY system_memory_state_desc

Kin also suggested that:

System_health_session will give you clear picture if it was internal or external memory pressure with low memory notification.

That is an extended event that can be run in the background without affecting performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange