There’s a lot out there about Amazon CloudWatch already, but since Amazon’s default EC2 monitoring service pushes regular updates, most of what you’ll find will be dated (last updated: 9/26/2018). Let us know what you think about our take on CloudWatch 101. The plan is to get you up to speed in a hurry.
What is CloudWatch?
Amazon CloudWatch monitors your Amazon Web Services (AWS) resources. CloudWatch is configured for EC2 out of the box. Add the CloudWatch Agent to monitor other AWS services.
Essentially, CloudWatch is an archive built to store AWS metrics’ time series data. CloudWatch converts raw data feeds into digestible, actionable information. CloudWatch provides a set of pre-defined EC2 variables for free. The free tier also lets you graph and alert on these metrics.
Their paid service allows you to access, graph, and alert on additional metrics—including your own custom metrics—through the console, command line, or API. (The free tier limits you to console access only.) If you’re on AWS and haven’t taken detailed monitoring (paid version) out for a spin yet, it’s definitely worth the drive.
What’s the latest from CloudWatch?
Here are some of the latest features that CloudWatch has released:
- VPC Support - AWS now supports private connections between your VPC and CloudWatch.
- Metric Math - Metric Math gives you the ability to create derived metrics from existing metrics. Graph derived metrics’ time series data and add it to your dashboard. Yeah, math!
- “M of N” Trigger - This update adds the ability to configure the evaluation period, M, and number of datapoints to alarm on, N, separately.
- CloudWatch Agent - This new agent unifies system metrics and log files from EC2 instances and On-Prem servers. The agent supports both Windows and Linux.
You can get the latest feature updates from CloudWatch here.
How does CloudWatch work?
Amazon CloudWatch serves as a metrics repository for other AWS services. By default EC2 pushes metrics to CloudWatch for later retrieval and real-time analysis. CloudWatch can also store and retrieve statistics passed from custom EC2 variables, other AWS services, and most recently, On-Prem servers.
Here’s a visual representation of how CloudWatch operates within the larger AWS ecosystem. (This diagram is from the official CloudWatch User Guide).
Important CloudWatch Concepts
AWS Cloudwatch concepts are important to learn in order to feel comfortable reading CloudWatch documentation because much of it assumes that you’re already somewhat familiar with the product.
Metrics are the most basic building block of CloudWatch. A metric is a variable that stores a time series data set. AWS services push metrics to CloudWatch. You can then get useful information about those metrics from CloudWatch.
- specific to a AWS region.
- can’t be deleted.
- expire after 15 months of no new data points.
- defined by an unique name/namespace/dimension combination.
- High-resolution metric data for 3 hours.
- Detailed standard metric data for 15 days.
- Basic standard metric data for 63 days.
It’s hard to have a time series without a timestamp for each metric data point. CloudWatch allows for timestamps from two weeks in the past to two hours into the future. If you don’t send a timestamp dimension with your metric data points, CloudWatch creates a timestamp for you and sets it to the current time (UTC).
A namespace is a CloudWatch metrics container. Namespaces are useful if you want to avoid aggregating two different metrics with the same name. Every metric data point needs to be assigned to a namespace. CloudWatch won’t assign metrics to a default namespace for you. CloudWatch namespaces use AWS/service as their naming convention.
Alarms are triggered based on a persistent state change for a specified period of time. CloudWatch can’t sound an alarm based on a particular state only.
Here’s more attributes that all alarms types have in common:
- CloudWatch Alarm Limits: 5,000 alarms per region per AWS account; one-day maximum monitoring period
- CloudWatch Alarm Properties: state:enabled,disabled; history (stored for two weeks)
- CloudWatch Alarm Access: list configured alarms; filter by state, time range
- CloudWatch Alarm Testing: temporary state change for a single alarm comparison period
Below are some important differences in alarm types.
Status Check Alarms
Status Check Alarms trigger when a status check fails (gets set to zero). Status checks can be tied to a system-wide or instance metric.
A High-Resolution Alarm is tied to a high-resolution metric. Because high-resolution metrics update every second, High-Resolution Alarms can be triggered based on metric values within a ten-second period. For more information about high-resolution metrics, see AWS CloudWatch Configuration Guide: CloudWatch Custom Metrics.
Percentile-Based CloudWatch Alarms
By default, alarms rely on a sound statistical assessment of the metric being monitored. Percentile-base CloudWatch Alarms address the challenge of monitoring a metric when there’s not enough data for a good statistical assessment.
A dimension is metrics metadata in the form of a name/value pair. Metrics can have up to ten dimensions. When you set dimensions, AWS services send both data and metadata to CloudWatch.
Dimensions can be useful for filtering data and aggregating statistics. CloudWatch treats metrics across different namespaces as different metrics even if they have the same dimensions. (CloudWatch can’t aggregate across a custom metric dimensions.)
Percentiles are useful in identifying outliers and periods of high demand. A standard approach for finding outliers is to look for data points three standard deviations from a metric’s average. Persistent metric data points above the 95th percentile points to a period of high use, regardless of what resource utilization looks like.
You can use percentiles with the following AWS services:
- Amazon EC2
- Amazon RDS
- Application Load Balancer
- Elastic Load Balancing
- API Gateway
There are some limits to percentile statistics. You can’t aggregate statistics if any metric data point in the time series has a negative value. Also, percentiles don’t work on data sets pushed to CloudWatch.
Statistics aggregate time series data points across a specified time period. Available statistics include: minimum, maximum, sum, average, count, and percentile. You can also push your own statistics to CloudWatch.
The default period is 60 seconds. Valid values for a period are 1, 5, 10, 30, or any multiple of 60. All statistic time period requests use seconds as the unit of time. The default time range is the last hour.
If you’re getting two statistics for metrics with the same name, namespace and dimension(s), you might want to check to make sure your units are the same across all metric data points. (If you don’t set units for custom metrics, CloudWatch sets the unit value to “None”.)