Is it normal to share CPUs on a VM of SQL Server?

https://dba.stackexchange.com/questions/246589

07-02-2021
|

Question

Our IT set up a SQL Server as a VM on a big VMWare box that contains other VMs. The CPUs are setup as shared. As a result, any queries that might need multiple CPUs take 30x longer than if I limited it to a single CPU. Example:

SELECT TOP 2000 lwa.Message INTO #foo
FROM dbo.LogWidgetsAPI lwa (NOLOCK)
ORDER BY lwa.TimeStamp

SELECT TOP 2000 lwa.Message INTO #foo
FROM dbo.LogWidgetsAPI lwa (NOLOCK)
ORDER BY lwa.TimeStamp
OPTION (MAXDOP 1)  ------------- Force it to run on a single CPU

1st example uses parallelism and takes about 30 seconds or so. 2nd one forces use of a single CPU and take 20 milliseconds.

Note: after running the single CPU query I go back to the running the multi-cpu one and the timing & plan is the same - so I don't think the issue is related to "cold cache" vs "warm cache"

So my theory is that because first query uses multiple CPUs, it must wait till all CPUs in question are idle, and therefore it just waits.

So my question. Should SQL Server VMs have dedicated CPUs or shared ones are normal?

Here is the plan which uses parallelism. Here is the plan which forces a single CPU.

La solution

Is it normal to share CPUs on a VM of SQL Server?

Yeah, it's pretty common. A lot of times VMs are used to consolidate lots of SQL Servers (especially ones that don't have extreme performance requirements) onto one host. This can save on licensing costs, since SQL Server can be licensed at the host level.

Is it a good idea? I mean, it depends a lot on how oversubscribed the VM is, and how CPU intensive the workloads are.

Looking at the screenshots of the two execution plans, they are basically the same except for the parallelism. One problematic area in the parallel plan is the serial zone where the "Top" operator lives:

There is some overhead associated with bringing all the rows together onto one thread, and then redistributing them for the parallel insert. I wouldn't expect that overhead to be 30 seconds though.

So my theory is that because first query uses multiple CPUs, it must wait till all CPUs in question are idle, and therefore it just waits.

No, that's not how parallelism works in SQL Server. The threads that are scanning the clustered index at the top-right of the plan can do very uneven levels of work depending on how busy the different CPUs are.

Now, if this instance of SQL Server is so busy that all of the available threads are being used for other queries, then the parallel query might be waiting on THREADPOOL. Which brings me to my next point:

The parallel query is most likely waiting on some resource. I would start by looking at the "WaitStats" portion of the execution plan in SSMS:

That'll be in the "Properties" window of the left-most operator in your plan. As one example, a really high SOS_SCHEDULER_YIELD value in this case could be a sign of this SQL Server instance not getting a turn on the hosts CPU. Jonathan Kehayias has a really good post on that topic here:

CPU Ready Impact on SOS_SCHEDULER_YIELD

You could also compare the ratio of elapsed time to CPU time in the two queries. Those numbers are in the same properties window:

If the ratio is significantly different between the two queries, that's another sign that the parallel query is waiting on some resource.

If you have access to the host / virtualization, you could look into stats directly there to see if guests are waiting long periods to be scheduled on the CPU. Jonathan has another post about that here, which is specific to VMWare: CPU Ready Time in VMware and How to Interpret its Real Meaning

Autres conseils

_{Answer content originally left in comments}

No, because VMWare doesn’t even see SQL Server, it only sees Windows. It schedules CPUs at the level of the VM, not at the level of a process in a VM. – Gaius

You license SQL by the core... why are you sharing that with anything other than SQL? Yes, oversubscribing CPU's are normal practice for virtualization, but not typically done for SQL. – Jonathan Fite

Compare the Logical IOs for the two queries. If they are in the same ratio as the elapsed time, then the hypervisor is not a problem. You can't just compare CPU time here, as a VM thread can be scheduled in the guest, but waiting on the host.– David Browne - Microsoft

You need to monitor the cpu ready statistics on the vm host to know if there is oversubscribing happening to answer that question.

One thing to keep in mind with VMware and shared cpus is that it has to obtain the number of vcpus that the vm has been assigned before it will execute the instructions against the cpu.

In other words if your vm is assigned 8 vcpus but only 4 are available at the time it’s has instructions to process it will have to wait until it gets all 8 before continuing.

On an oversubscribed host you will see how often the vm is waiting for its turn on the cores in the cpu ready metrics. That metric with values regularly over 5% is an strong indicator that there is a cpu constraint. – Aaron

Licencié sous: CC-BY-SA avec attribution

Non affilié à dba.stackexchange