Docs

    Every packet on the network has a cyclic redundancy check (CRC) to maintain data integrity across potentially faulty network hardware. The CRC is a checksum. If any portion of the packet gets changed, the whole packet is thrown away and retransmitted immediately.

    Changes to packets can happen as the result of voltage differences, timing between network devices, and even cosmic rays inducing single event upsets (SEU). Most of these are harmless because of their incredibly low probability. Faulty hardware is the single largest cause of consistent errors on the network.

    Zero and near-zero error counts are tolerable, even normal. It becomes worrisome when errors are sustained over a long period of time or reach a disruptive threshold that impacts performance.

    To detect network errors on Linux, run  ifconfig  and look for “errors” for each network device. In Windows powershell, use the  Get-NetAdapterStatistics  cmdlet and look for “ReceivedPacketErrors.” All network devices have their own internal counter for number of errors.


    Effects


    With increased network errors, you could experience issues like:

    • Increased number of timeouts to databases, services, and caches
    • Missing data in syslog, statsd, or other UDP monitoring tools
    • Increased 5xx HTTP status codes due to unexpected timeouts
    • Difficulties connecting to the server

     

    Quick Fix


    If you’re in a cloud environment, terminate your server and relaunch to move to a different physical server.

    If you’re in a physical environment, correlate error reports to identify the faulty hardware.

     

    Thorough Fix


    Start with network components including switches, routers, cables, and gateways. Replace them systematically while watching network errors on servers within the network.

    After all network components have been verified, upgrade the firmware and device drivers on the affected servers. Consider upgrading the OS.

     

    Resources