Scheduler Yields = CPU Pressure…or do they?
I have been on a number of support calls where the customer insists there is a CPU bottleneck, and in some cases he/she even points to my blog posts and/or those from others indicating that SQL Server scheduler yields are an indication of CPU pressure; however perfmon’s OS CPU counters don’t support that observation. Once we get to the root of the issue and resolve the problem, the customer (very rightly) wants an explanation of why there were scheduler yields (SOS_SCHEDULER_YIELD) without any noticeable CPU pressure.
In some recent cases I have pointed customers to Mario Broodbakker’s excellent article
In essence, the article illustrates what can happen when two “killer” queries are executed on the same scheduler. Because the SQLOS attempts to ensure that no one query will monopolize a CPU queries are forced to yield when appropriate.
For background and sample scripts relating to the SQL Server Scheduler you can look at Slava Oks’s posts on
SQLOS – unleashed andSQLOS’s DMVs Continued which benefit from first understanding the user mode scheduler (UMS) which was best explained by the late (and great) Ken Henderson (seeInside the SQL Server 2000 User Mode Scheduler ).
Mario points out how:
- when CPU-intensive queries are assigned to different schedulers yields are not observed, however when multiple CPU-intensive queries are assigned to the same scheduler they cause one another to yield thereby producing a situation that appears as CPU Pressure but is in fact what he refers to as SOS Scheduler pressure
- when the overall CPU load on the system is high a similar phenomenon is observed, however not only are scheduler yields increased, so are other event waits
One factor that I really appreciate about this article is the clear explanation and examples provided for what I usually refer to as cascading resource bottlenecks. The article shows how external CPU pressure can make it difficult for the SQLOS scheduler to schedule processes, and thereby results in increased wait times across all events. In these cases looking at the processor run queue length (this is a great example of when this counter truly is relevant), as well as the increased signal wait times in the [sys].[dm_os_wait_stats] DMV can expose the root issue. In these cases it “takes time to get scheduled on the CPU, and you only hope that your thread is not pushed off of the CPU, preemptively, by a higher priority thread that is ready to run. This is something to be aware of: on very CPU bound systems, sometimes the waits are not what they seem.”
As many SQL Server pros like to point out, generalizations and quickly jumping to a conclusion can be dangerous – and this is a perfect example of how a generalization that’s not supported by the facts of the case can lead to drawing the wrong conclusion. To use a cheesy, but very relevant catch phrase from my youth (a la GI Joe) – Now you know, and knowing is half the battle.