Read and write operations are measured in IOPS (input/output operations per second). A single operation from your application’s point of view may translate to zero, one, or many operations on the disk. For example, creating a 2GB will create thousands of IOPS. This is due to caching, size of requested I/O operation, sector size, file system alignment on disk, and a variety of other factors.

    To see your disk throughput on Linux, use  iostat -xd . For Windows, open the Performance tab of the Task Manager. To test your throughput, use  dd  on Linux and Diskspd on Windows.

    The type of disk plays a big role in total throughput. Magnetic disks have a seek time to spin the platter and move the head, while solid-state disks have immediate random access. While more factors are at play, you can expect a magnetic disk to have 50-500 IOPS, while an SSD will have 3,000-40,000. For magnetic disks, fragmentation increases the time spent seeking for random access.

    Ignoring random spikes and seasonality of data, a change in IOPS usually means a change in usage or a forthcoming error.


    While searching for problems, make sure you look at some of the most common causes:

    • The server dropped out of a load balancer
    • A new code release fundamentally changed the traffic patterns
    • An administrator changed a firewall or network configuration
    • Critical scaling trigger hit so more servers are needed, need to add more servers

    If you haven’t spent time optimizing the size and alignment of file system pages to disk sectors, we recommend you spend time to learn about it. Basically, disks are partitioned into sectors. File systems are partitioned into pages. In a perfect world, a single page fits on a single sector. In a surprisingly common set of environments, they don’t. In this case, every read/write is actually 2 reads/writes — your throughput is halved. Reduce your IOPS need substantially by fixing your alignment and sizing.


    Quick Fix

    Verify that the critical applications on this server are still responsive.


    Thorough Fix

    Compare server vitals against deployments, configuration changes, infrastructure changes, user logins, and automated remediation. For every inflection point, identify the root cause. If one seems out of place, investigate and remedy as appropriate.