Each network device has a counter for the number of dropped packets. When packets are dropped, the transport layer (layer 4 of the OSI model) is responsible for retransmission.

UDP is quick but unreliable. Its packets are neither retransmitted nor counted in the number of dropped packets. If you are dropping a lot of packets, and also sending a lot of UDP traffic, the problem may be worse than the network interface is letting on.

While TCP will retransmit packets, it will often take multiple seconds to do so, encouraging your applications to time out. RFC 6298 defines the retransmission timeout calculation. In short, your application will fare better with a reliable network infrastructure.

Dropping packets is most often the symptom of a hardware failure of the network card, network cables, or network devices like switches and routers. It could also be an issue of throughput, where you’re sending too much data for network components to handle. When network devices receive more traffic, they store the information in buffers. When the buffers are full, the new traffic is dropped.


EFFECTS


Possible issues caused by dropped packets include:

  • Increased number of timeouts to databases, services, and caches
  • Missing data in syslog, statsd, or other UDP monitoring tools
  • Increased 5xx HTTP status codes due to unexpected timeouts
  • Difficulties connecting to the server

 

QUICK FIX


If you’re in a cloud environment, terminate your server and relaunch to move to a different physical server.

If you’re in a physical environment, correlate dropped packet reports to identify the faulty hardware.

 

THOROUGH FIX


Start with network components including switches, routers, cables, and gateways. Replace them systematically while watching dropped packets on servers within the network. If throughput is the issue, then the network may only drop packets at a certain time of day when traffic is at peak throughput.

After all network components have been verified, upgrade the firmware and device drivers on the affected servers. Also consider upgrading the OS.

Always implement your own retry logic inside your application to control the retransmission timeout.


RESOURCES