Kubernetes disaster prevention and recovery

Yeah, Kubernetes is great at making sure your workloads run as needed. But another of its amazing benefits is its ability to recover from failure all by itself.

On an everyday basis, Kubernetes takes care of the complicated task of container orchestration. However, as with any complicated system, there is always the chance that you’ll experience failures and downtime.

The failure could come from a hardware problem on a node, a bug in your code, operator error, data loss on the etcd cluster, a natural disaster, or even an alien invasion.

Don't let aliens take you down.

Thus, it's a good idea to have a plan at the ready to recover the working state of your Kubernetes cluster just in case.

Tips to prevent a disaster on Kubernetes

Ok, so it's good to have a backup plan. That's clear. But it is even more important to reduce the likelihood of a disaster happening in the first place. Here are a few tips for keeping your Kubernetes deployment reliable.

Before you deploy Kubernetes, we recommend that you talk with other companies about how they did it and problems to watch out for. Check Stackshare.io to find other companies running Kubernetes. Another great resource is the Kubernetes community, which has further links to Twitter, Stack Overflow, and forums.

We also suggest that you use a managed service provider for Kubernetes like Amazon EKS. You can create a highly available Kubernetes system on your own from scratch, but it is going to take a lot of work. EKS also takes care of a lot of the backup and recovery tasks that we’re going to talk further about below.

In Kubernetes, all the data about your cluster’s state is stored in the etcd, so making sure the etcd cluster’s reliability is stable should be a priority. Managed services like EKS take care of that for you.

Avoid a single point of failure (SPOF). This is especially important on key components like etcd or the master nodes hosting the control plane. It is best practice to replicate your system in odd numbers. Replicating your control plane across three nodes is considered the minimum configuration for high availability. Your etcd replicas should be isolated and placed on dedicated nodes. We recommend at least a 5-node etcd cluster in production.

What should I back up on Kubernetes?

After you’ve done all you can to make sure you have a highly reliable app in the first place, now it is time to prepare for the worst. Let’s start by backing up all the necessary elements to get back up and running if needed without losing anything. Here is a list of all the things you need to think about creating backups for in case of a failure.

etcd
Storage and data
Worker nodes
Security

How to back up etcd on Kubernetes

Data about the configuration and state of the cluster lives in a key-value store database called etcd. Your control plane is stored in etcd storage. A complete failure of etcd is super rare, but it's still a good idea to back up etcd. If you are running Kubernetes in a managed service like AWS EKS, you probably won’t have direct access to etcd or even the disks that are backing etcd. Plus, these services take care of all things etcd for you. You can back these up by simply taking a snapshot of the storage volume of your etcd node.

If you're on-prem, then backing up etcd depends on how you set up etcd in your Kubernetes environment. Essentially, there are two ways it can be set up: as an internal etcd cluster running as containers and pods in your environment, or as an external cluster.

We recommend you store as much as you can outside the cluster so you can recreate it all when necessary. You can take snapshots of etcd using the etcdctl snapshot save command or by copying the member/snap/db file from an etcd data directory.

How to back up Kubernetes storage and data

Persistent volumes

It wouldn’t do any good to recover Pods if the persistent data associated with those Pods was lost. Backing up this data depends on the environment you are running Kubernetes in. If you are using a cloud provider, it may be as simple as reattaching any persistent volumes to their respective pods.

To back up persistent volumes in AWS, you should back up the EBS volumes that back the persistent volumes. You can do that by backing up the data on your EBS volumes as snapshots stored in Amazon S3. Read more on AWS docs.

Local data

This is probably going to be a more common situation for bare-metal deployments where critical data has been persisted to a node’s local disk. This can even happen without your knowledge. Avoid relying on local storage for anything in your cluster that must be preserved. Always use a separate data store as the source of truth.

How to back up Kubernetes worker nodes

Worker nodes are replaceable in Kubernetes, but when it comes to disaster recovery, you should have a process in place to recreate worker nodes. If you're using a cloud provider, you should be able to just spin up a new instance with the parameters you need and join it to the control plane. For bare-metal environments, it will be somewhat more difficult, so you should create a well-thought-out strategy beforehand.

Other stuff to back up: Security

Kubernetes keeps most of the cluster’s operational state in the etcd cluster. However, this isn’t the only state we need to recover. The other assets we need to worry about are:

All PKI assets used by the Kubernetes API server. You can find these in the directory /etc/kubernetes/pki. Run the sudo cp -r /etc/kubernetes/pki backup/ command on your master node to copy the folder containing all the certificates that kubeadm creates. These are the certificates used in your cluster for secure communication between components.
Any Secret encryption keys. These keys are stored in a static file specified with the --experimental-encryption-provider-config in the API server parameter. If these keys are lost, any Secret data is not recoverable.
Any administrator credentials. Most deployment tools create static administrator credentials and provide them in a kubeconfig file. Although these may be recreated, securely storing them off-cluster will reduce recovery time.

Kubernetes disaster recovery

At Blue Matador, we run our own production environment on Kubernetes using AWS, and we have it all Kubernetes disaster recovery configured in Terraform. This makes recreating our environment relatively easy should something happen. We recommend using an infrastructure configuration tool like Terraform, CloudWatch, Chef, Puppet, etc., that fits your needs and helps you do the same.

We also advise you to test your disaster recovery plan at least annually. Having theoretical backups, locations of files, and the items we outlined above is great, but if you never have a dry run, then you risk failure when the real time comes.

Another option to consider, although it is expensive, is called a “hot” replica read. This involves running either a skeleton environment (control plane running, secrets set up, everything but the pods) or everything including pods running in another region or availability zone you can switch traffic to in a disaster. We would only recommend this for mission-critical clusters where you can actually switch clusters easily, meaning there is no local storage and no persistent volumes.

Summary

We hope this at least helps you start on the right path to planning your disaster recovery plan for your Kubernetes environment. Another way to help you keep your Kubernetes environment running smoothly is by using Blue Matador. Our tool offers fully automated Kubernetes monitoring with zero configuration and zero maintenance. It is the fastest and easiest way to get actionable insights into problems with your cluster.