Excessive CPU use in vendor SQL Server DB

https://dba.stackexchange.com/questions/106246

28-09-2020
|

Question

Background: I've got a SQL Server 2012 DB provided by a vendor so modification of queries and tables is limited. We do own the DB though, so we can add and maintain indexes.

Indexes haven't been maintained or rebuilt so there are hundreds sitting at 30%+ fragmentation... this is my initial suspect of massive and constant CPU use, but while we work on fixing this I'm investigating other issues.

I'm not seeing any significant memory or disk IO pressure. This is a relatively lightly used OLTP system and has been well provisioned for resources... it really shouldn't be having any issues, or at least should only have noticeable spikes, no constant CPU use.

Two questions:

Can out of date statistics and highly fragmented indexes throughout the DB cause the excessive CPU use?
Does the combination of wait stats listed below from this system discredit the index fragmentation explanation?

Information:

WaitType                                    Wait_S
---------------------------------           -----------    
CXPACKET                                    773345.21
PAGELATCH_UP                                737295.83
SOS_SCHEDULER_YIELD                         140425.24
LATCH_EX                                    69877.95
RESOURCE_SEMAPHORE_QUERY_COMPILE            60985.48
LCK_M_SCH_S                                 39488.17

Source query for the wait results:

WITH [Waits] AS
(
 SELECT
     [wait_type],
     [wait_time_ms] / 1000.0 AS [WaitS],
     ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
     [signal_wait_time_ms] / 1000.0 AS [SignalS],
     [waiting_tasks_count] AS [WaitCount],
     100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
     ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
 FROM 
     sys.dm_os_wait_stats
 WHERE 
     [wait_type] NOT IN (... common waits )
     AND [waiting_tasks_count] > 0)
SELECT
    MAX ([W1].[wait_type]) AS [WaitType],
    CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S],
    CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S],
    CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S],
    MAX ([W1].[WaitCount]) AS [WaitCount],
    CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage],
    CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S],
    CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S],
    CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum]
HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95;

Solution

Indexes haven't been maintained or rebuilt so there are hundreds sitting at 30%+ fragmentation... this is my initial suspect of massive and constant CPU use...

and

Can out of date statistics and highly fragmented indexes throughout the DB cause the excessive CPU use?

This is partially true. Index fragmentation wont cause HIGH CPU. Internal Fragmentation means that you have lots of free space on the pages and it will take more time to scan the index. This will incur more disk IO and will require more memory to store the index (due to free spaces in the index pages) which means more wasted space in buffer pool.

Bad statistics will cause the query optimizer to generate inefficient(bad) plans causing a degraded performance e.g. queries that took 2 secs to complete will take 2 mins or 2 hours, etc as sql server will make a bad guess (e.g. will estimate 1 row as opposed to actual 2M rows) and may choose an inapproprate join performing high number of reads or can choose a bad join e.g. nested loop where a hash or merge join would have been a better choice. Bad (outdated or old) stats will peg your CPU at a much higher level.

So keeping your Statistics and Indexes defragmented will definitely help. Instead of crafting your own solution, I would recommend to use Ola's Index Maintenance solution.

Refer to Kendra's excellent post : Why Index Fragmentation and Bad Statistics Aren’t Always the Problem (Video) ?

Remus Rusanu has a really good blog post on : The Bizzaro Guide to SQL Server Performance (caution : Don't follow it !)

Does the combination of wait stats listed below from this system discredit the index fragmentation explanation?

I believe that there might be more things to address at your sql server configuration level than just to worry about Index fragmentation. Also CXPACKET wait in itself is not a problem.

Things to check :

MAXDOP - set it away from default to a more sensible value.
Make sure you have TF 1117 and 1118 turned on for addressing tempdb contention. and have multiple tempdb data files as per kb-2154845
Start looking at your queries - Top 20 CPU Consumers to see if they are using proper indexes , not doing any implicit conversions, etc.
Turn ON optimize for ad hoc workloads
Refer to MS KB 2964518: Recommended updates and configuration options for SQL Server 2012 and SQL Server 2014 with high-performance workloads
Make sure that the server is using High Performance power plan option.

You can use Glenn Berry's diagnostic queries - 2012 version

-- Signal Waits for instance  (Query 27) (Signal Waits)
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2)) 
AS [% Signal (CPU) Waits],
CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM (wait_time_ms) AS NUMERIC(20,2)) 
AS [% Resource Waits]
FROM sys.dm_os_wait_stats WITH (NOLOCK) OPTION (RECOMPILE);

-- Signal Waits above 15-20% is usually a sign of CPU pressure

and

-- Top Cached SPs By Total Worker time (SQL Server 2012). Worker time relates to CPU cost  (Query 44) (SP Worker Time)
SELECT TOP(25) p.name AS [SP Name], qs.total_worker_time AS [TotalWorkerTime], 
qs.total_worker_time/qs.execution_count AS [AvgWorkerTime], qs.execution_count, 
ISNULL(qs.execution_count/DATEDIFF(Second, qs.cached_time, GETDATE()), 0) AS [Calls/Second],
qs.total_elapsed_time, qs.total_elapsed_time/qs.execution_count 
AS [avg_elapsed_time], qs.cached_time
FROM sys.procedures AS p WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS qs WITH (NOLOCK)
ON p.[object_id] = qs.[object_id]
WHERE qs.database_id = DB_ID()
ORDER BY qs.total_worker_time DESC OPTION (RECOMPILE);

-- This helps you find the most expensive cached stored procedures from a CPU perspective
-- You should look at this if you see signs of CPU pressure

Additionally Joe Sack talks about Troubleshooting Methodology for SQL Server CPU Performance Issues

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange