High Load | Blue Matador - Troubleshooting

CPU load is a measure of how many processes are currently requesting a processor. Processes blocked on I/O (like accepting network connections) don’t count in the calculation. A load of 1 indicates that exactly 1 process needs a CPU. If you have at least 1 CPU, your server is doing great.

The number fluctuates wildly every second, so the number is generally aggregated over 1 minute, 5 minutes, and 15 minutes. We call these the 1 minute load average, 5 minute load average, and 15 minute load average. To see your load averages, run top and look for “load average”.

Comparing load between servers and clusters is only valid if the number of CPUs is the same in the comparison. A load of 2 on a server that has 4 CPUs is “healthy.” A load of 2 on a server that has 1 CPU is “sick.” We recommend using normalized load, which doesn’t have this problem. To calculate normalized load, divide the load by the number of processors. This is the metric we recommend you use in all load calculations. A load of greater than 1 means sick. A load of less than 1 means healthy.

Servers can handle spikes in normalized load, even over 1, and still be healthy. If the load spike is a one-time spike, the CPU will naturally catch up. The problems occur when spikes are prolonged or abnormally high such that the normalized load becomes unrecoverable.

Effects

When you have an unrecoverable load, you can expect issues like the following:

Slowness across the board, with some endpoints or queries taking the brunt of it
Unresponsive terminals and RDP sessions
Missing data and absent alerts in monitoring tools
Failed automatic remediation when kicked off by the server itself

Quick Fix

Add CPU capacity by adding servers or upgrading servers.

Thorough Fix

If the server is running a third-party application, verify your configuration and add more capacity as appropriate. The most common cause for high load on web servers and database servers is the number of active connections allowed. Allowing fewer connections will streamline operations and reduce overall congestion.

If the server is running a custom application, determine the cause of the load:

Normal scaling operations. If your normalized load is growing steadily with your user base, it’s time to add more capacity.
Bad code release. You will see a sudden spike in normalized load as the code was released. Involve the developers to find and remedy the critical path.

Resources

Understanding Linux CPU Load - when should you be worried? (Scout App)
Load Definition (Wikipedia)
Linux Load Averages: Solving the Mystery (Brendan Gregg's Blog)
Understanding the Load Average on Linux and Other Unix-like Systems (How-To Geek)
Interpreting CPU Utilization on Windows for Performance Analysis (Microsoft TechNet)
How to Fix High CPU Usage in Windows (MakeUseOf)
Windows Performance Counters Explained (AppAdminTools)