Disk IO | Blue Matador - Troubleshooting

Read and write operations are measured in IOPS (input/output operations per second). A single operation from your application’s point of view may translate to zero, one, or many operations on the disk. For example, creating a 2GB will create thousands of IOPS. This is due to caching, size of requested I/O operation, sector size, file system alignment on disk, and a variety of other factors.

To see your disk throughput on Linux, use iostat -xd . For Windows, open the Performance tab of the Task Manager. To test your throughput, use dd on Linux and Diskspd on Windows.

The type of disk plays a big role in total throughput. Magnetic disks have a seek time to spin the platter and move the head, while solid-state disks have immediate random access. While more factors are at play, you can expect a magnetic disk to have 50-500 IOPS, while an SSD will have 3,000-40,000. For magnetic disks, fragmentation increases the time spent seeking for random access.

Ignoring random spikes and seasonality of data, a change in IOPS usually means a change in usage or a forthcoming error.

EFFECTS

While searching for problems, make sure you look at some of the most common causes:

The server dropped out of a load balancer
A new code release fundamentally changed the traffic patterns
An administrator changed a firewall or network configuration
Critical scaling trigger hit so more servers are needed, need to add more servers

If you haven’t spent time optimizing the size and alignment of file system pages to disk sectors, we recommend you spend time to learn about it. Basically, disks are partitioned into sectors. File systems are partitioned into pages. In a perfect world, a single page fits on a single sector. In a surprisingly common set of environments, they don’t. In this case, every read/write is actually 2 reads/writes — your throughput is halved. Reduce your IOPS need substantially by fixing your alignment and sizing.

Quick Fix

Verify that the critical applications on this server are still responsive.

Thorough Fix

Compare server vitals against deployments, configuration changes, infrastructure changes, user logins, and automated remediation. For every inflection point, identify the root cause. If one seems out of place, investigate and remedy as appropriate.

Resources

IOPS (Wikipedia)
Disk Sector (Wikipedia)
Storage performance: IOPS, latency and throughput (Rickard Nobel)
Linux and Unix Test Disk I/O Performance With dd Command (nixCraft)
10 Free Tools to Measure Hard Drive and SSD Performance (Raymond Tech)
Know Your Storage Constraints: IOPS and Throughput (Green House Data)
Windows Performance Counters Explained (AppAdminTools)
Hard disk drive performance characteristics (Wikipedia)
Amazon EBS Volume Types (AWS Documentation)
DiskSpd, PowerShell, and Storage Performance on Windows (Microsoft TechNet)
Configuring Windows Performance Monitor to Capture Disk I/O Activity (SmarterTools)
5 Tools for Monitoring Disk Activity in Linux (OpsDash)