CPU Steal | Blue Matador - Troubleshooting

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. As such, it only happens in virtualized environments like AWS, GCP, Azure, vSphere, and Xen.

To see the steal time in Linux, run top on the command line and look for %st. Seeing the steal time on Windows depends on the hypervisor but usually requires installation of the guest additions package for that particular hypervisor.

Note for AWS: Amazon has a concept of CPU credits for certain instance types. You earn CPU credits every hour and use them as the VM requests CPU time. Once the VM’s credits are depleted, CPU will be stolen until more credits are earned. You can view CPU credits per VM on the AWS web console.

Effects

The impact of stolen CPU always manifests in slowness but can have more profound effects on your infrastructure. Here are some examples:

Slower page load times
Slower database query times
Slower processing of reports
Increased queue size of asynchronous tasks because of an inability to process them quickly
Increased IaaS bill due to launching more servers to handle the same amount of load

There are two possible causes of steal time:

The VM needs more CPU than the physical server can offer. AWS credits fall into this category
The CPU on the physical server is oversubscribed.

Under no circumstances should you tolerate high steal time on a server. It means you’re getting worse performance than what you’re paying for. Moving and upgrading servers is quick and painless and solves the problem at its root.

Quick Fix

Manually terminate the virtual machine and launch a replacement.

Thorough Fix

If money is no object, then upgrading the VM is the easiest guaranteed solution.

Otherwise, finding the cause is best done through trial and error. Terminate the VM and relaunching it will move it to another physical server. If steal time persists through multiple moves, then it’s time to upgrade the VM to have more CPU.

An automated solution where high steal time kicks off a relaunch can be effective but can also mask scaling issues.

Resources

Understanding CPU Steal Time - when should you be worried? (Scout App)
Is there a Windows equivalent of Unix 'CPU steal time'? (Server Fault)
AWS CPU Credits and Baseline Performance (AWS Documentation)
Azure Monitoring CPU Steal time/Wait Time (Microsoft MSDN)