Swapping | Blue Matador - Troubleshooting

Swap space, or page files on Windows, is the OS’s virtual memory management. If the server runs out of RAM, the OS will use this additional space as memory. If both RAM and swap space are exhausted, the OS will be forced to either reject additional requests for memory or select a process to kill to reclaim memory.

Best practices dictate having at least as much swap space as you have memory. Running without swap space is dangerous unless your memory allocation and management is highly tuned. Even in this case, disk space is cheap, and it’s better to have more in the off chance something blows up.

Most operating systems will proactively copy sections of memory to swap. This happens asynchronously and causes no slow down to your applications. When memory is oversubscribed, the OS simply evicts one of those sections and assigns it to another program. Even having some of the swap space used is not that big of a deal, because it just means this asynchronous, proactive swapping has taken place.

The problem occurs during the synchronous read-in of swap space. If any process needs to access its swapped-out memory, the OS must retrieve from disk what was expected to be in memory. This retrieval is blocking I/O—processing is stopped until the memory is reloaded. It wreaks havoc on your databases, load balancers, web servers, caches, and other services.

When your server is spending more time swapping than processing application logic, it is known as thrashing. This dangerous state can often only be fixed by restarting the server.

Effects

During times when your server is swapping data into memory, you can expect problems like:

Spotty responsiveness in network connections and processing power
Timeouts on SSH and RDP connections
Failed health checks on endpoints that recover on their own
Spikes in disk throughput and I/O wait times

Swapping data into memory can be completely avoided by correctly allocating memory to each running process, while reserving a small amount for OS overhead, periodic tasks, and interactive logins.

Quick Fix

Restart the server and consider increasing memory capacity.

Thorough Fix

Calculate the right amount of memory for every process running on the server. Record your calculations in a spreadsheet or document and share it with your team. Often you’ll find that either a single process was allowed to do something bad or that your server needs more memory than it was given.

Some processes don’t allow configured limits—monitor those for actual usage for your calculations. If one server has multiple processes that expand to fill the available memory, split those processes onto multiple servers.

Find the OS overhead by running that OS without any applications for 24 hours. Remember to include space for periodic tasks and upgrades. At the end, make sure you leave about 3% to 5% as a safety buffer.

Some of the processes that are the most difficult to configure are databases and web servers. Spend the time to configure them correctly. Read the docs online to find configuration best practices.

Resources

Linux Swappiness (Wikipedia)
Windows Page File: The Definitive Guide (Microsoft TechNet)
How to Tell if Windows Server is Swapping (Server Fault StackExchange)
Commands to Monitor Swap Space Usage in Linux (Tecmint)
Linux OOM Killer (Linux Memory Management)
Thrashing (Wikipedia)