Top 4 must-monitor API Gateway metrics

Intro

Marc Andreessen famously said, “Software is eating the world.” Dr. Steve Willmott subsequently retorted, “APIs are eating software.” This is because API based architectures are enabling companies to get away from monolithic architectures and move to microservices. Innovative, disruptive companies like Netflix, Airbnb, Uber, Square, and Slack all build their infrastructure and technology using APIs.

When managing APIs, several challenges arise, like managing multiple versions, monitoring third-party developers’ access, authorization, traffic spikes, etc. That’s why people consider using an API gateway instead of doing all that manually.

Amazon didn’t invent API gateways, but AWS API Gateway has seen a steady increase in popularity since launching in 2015 due to its serverless nature.

If you’re building a serverless infrastructure and standardizing on Amazon API Gateway, here are the four key metrics you should monitor to ensure optimal API Gateway performance.

4xx responses in API Gateway

When you send requests to and get responses from the API Gateway API, you may encounter client errors indicated by a 4xx HTTP response code. This means that there is a problem with the client request, like an authentication failure or missing required parameters. This guide provides details on API Gateway error codes and how to handle them.

Getting some 4xx responses from an API is normal, but when your API responds with more 4xx errors than usual, it could signal major issues in your application. CloudWatch exposes the number of client-side errors captured in a given period and provides the Sum (i.e. the total count of the 4XXError errors in the given period) and the Average (i.e. the 4xx error rate in a given period).

You’ll want to watch for anomalies with these metrics to ensure the count and rate of 4xx errors is not increasing.

To troubleshoot anomalous 4xx responses, enable CloudWatch logs and determine which endpoints are returning 4xx responses.

5xx responses in API Gateway

5xx responses indicate server errors. You could be seeing these because of a bug released by your API or because your endpoint is timing out.

Similar to 4xx responses, seeing a few 5xx responses is probably normal for your API. But you’ll want to know if there is an anomalous number of 5xx responses for a sustained amount of time. Like 4xx responses, CloudWatch provides the sum and average 5xx error response metrics.

And just like 4xx responses, to troubleshoot anomalous 5xx responses, enable CloudWatch logs and figure out which endpoints are returning 5xx responses.

API Gateway request count

API Gateway counts the number of requests made to your API and displays the metric in CloudWatch. This metric is the count of all requests, including requests that result in an error response. Because this metric helps determine billing, it’s important to monitor for any major changes.

If you are experiencing more requests than usual, the following might be true:

A bug in your application code is causing erroneous requests to a particular endpoint or set of endpoints.
A bug in an endpoint is returning error responses, causing a large number of retries.

The following are some potential causes of fewer requests than expected:

The application code that calls an endpoint may be malfunctioning and not making requests.
Permissions issues may be keeping your application from being able to call your API.

In each case, you’ll want to watch for anomalous changes so you can know about potentially expensive fluctuations in request count. It is helpful to enable CloudWatch logs to help look for the endpoint that is receiving the anomalous number of requests.

Request latency in API Gateway

Latency measures the amount of time between when your API receives a request and when it responds to the request. This metric is important to monitor because you likely have performance requirements for your application, and higher latency often = bugs.

When debugging latency issues, you should first ensure that your code is the source of the latency. To do so, you can check the IntegrationLatency and Latency metrics in CloudWatch.

IntegrationLatency measures only the time it takes your API endpoint to return a result, while Latency measures the end to end time for the request. If the two metrics are mostly the same, your code is the source of the latency. If IntegrationLatency is much lower than Latency, the latency is coming from AWS, and you will have to wait for AWS to fix the issue.

If the latency is coming from your application, use CloudWatch Logs to find the endpoint that is taking longer to execute. The log format you enable should contain the $context.responseLatency variable so you can view how long the requests took.

Conclusion

Monitoring these four key metrics for Amazon API Gateway will put you in the driver’s seat for application performance. While it is possible to set up alerts for these metrics with a traditional monitoring tool, the process will be laborious and the result could be suboptimal. For instance, manually setting up an alert for request count with a static threshold could result in noise and false positives.

That’s why we recommend giving Blue Matador a try. Out-of-the-box, Blue Matador’s pre-configured, dynamic alerts, and sophisticated algorithms will automatically monitor these key metrics for Amazon API Gateway in addition to hundreds of other events in your AWS environment. Get started with your 14-day trial and see how easy it is to monitor Amazon API Gateway with Blue Matador.