Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. As such, it only happens in virtualized environments like AWS, GCP, Azure, vSphere, and Xen.

To see the steal time in Linux, run  top  on the command line and look for %st. Seeing the steal time on Windows depends on the hypervisor but usually requires installation of the guest additions package for that particular hypervisor.

Note for AWS: Amazon has a concept of CPU credits for certain instance types. You earn CPU credits every hour and use them as the VM requests CPU time. Once the VM’s credits are depleted, CPU will be stolen until more credits are earned. You can view CPU credits per VM on the AWS web console.


EFFECTS


The impact of stolen CPU always manifests in slowness but can have more profound effects on your infrastructure. Here are some examples:

  • Slower page load times
  • Slower database query times
  • Slower processing of reports
  • Increased queue size of asynchronous tasks because of an inability to process them quickly
  • Increased IaaS bill due to launching more servers to handle the same amount of load

There are two possible causes of steal time:

  • The VM needs more CPU than the physical server can offer. AWS credits fall into this category
  • The CPU on the physical server is oversubscribed.

Under no circumstances should you tolerate high steal time on a server. It means you’re getting worse performance than what you’re paying for. Moving and upgrading servers is quick and painless and solves the problem at its root.

 

QUICK FIX


Manually terminate the virtual machine and launch a replacement.

 

THOROUGH FIX


If money is no object, then upgrading the VM is the easiest guaranteed solution.

Otherwise, finding the cause is best done through trial and error. Terminate the VM and relaunching it will move it to another physical server. If steal time persists through multiple moves, then it’s time to upgrade the VM to have more CPU.

An automated solution where high steal time kicks off a relaunch can be effective but can also mask scaling issues.


RESOURCES