Elastic Load Balancers route traffic to your application. You can generally expect a steady stream of requests to your load balancers; even some 400s and 500s are normal. However, the rate at which your load balancers service requests or produce 400s or 500s can be a good indicator of application health. An anomaly in these metrics can signal problems before they become apparently in other parts of the application. Blue Matador automatically detects these anomalies and alerts you about them.


Request Count


A healthy application should see a relatively stable request rate. Both an anomalous increase or decrease in request count could signal a malfunctioning application. Possible causes of changes in request count include:

  • Errors in your server side code causing retry logic in your clients to make many requests to a failing endpoint
  • A release of buggy client code causing erroneous API requests
  • A malfunctioning cache layer

If the request count increased for legitimate reasons (increase in users or a new feature), you may need to add additional targets to your load balancer to handle the increased load.

 

400s


When the rate of 4xx response codes increases, it’s likely the case that a client that makes requests to your ELBs is buggy. Possible reasons include:

  • Typos in URLs resulting in a spike of 404 errors
  • Parameter names and types changed for REST APIs resulting in 400 errors
  • A bug in authentication code resulting in 401s and 403s

 

500s


When the rate of 5xx response codes increases, your problem is most likely due to a bug in your server side code. The increase can often be tied to a specific code release. Correlating with your release schedule should be your first place to look for clues as to what went wrong. Other possible causes include:

  • Network problems between microservices. 500s often cascade from upstream services down to the client.
  • Bugs in software dependencies

 

Access Logs


Access logs are very helpful when diagnosing issues with ELBs. By default, ELB does not collect access logs, but can be configured to send logs to S3. You can then configure your log management tool (or download the files and use grep) to look for endpoints that are causing problems.

 

Latency


For Classic load balancers, the Latency metric in CloudWatch measures the time it takes for a registered instance to send response headers after receiving a request from the load balancer. For Application load balancers, the equivalent CloudWatch metric is TargetResponseTime.

An increase in latency can indicate a performance issue with your application. If traffic patterns for your load balancer have not changed significantly, check to see if a downstream service such as a database is experiencing high latency, and propagating that time to your web server. If you have seen an increase in traffic, it is possible that your instances are overloaded and adding capacity to the load balancer may help. For low-traffic load balancers, it is also possible that the average latency is thrown off by a few requests taking a very long time.

 

Bytes Processed


For Application load balancers, the ProcessedBytes metric measures the total number of bytes going in and out of the load balancer. A change in this metric can be caused by two things:

  • Request rate has increased/decreased
  • Request or response sizes have increased/decreased

Anomalies with bytes processed are mostly useful for correlating other issues.

 

RESOURCES