A Kubernetes pod is a group of containers with shared storage, network, and cgroup that are always scheduled to run on the same node. A pod is also the the smallest deployable unit of compute that can be created and managed by Kubernetes.

Waiting Container

When a pod cannot be scheduled or has containers that are waiting, the pod will be in a stuck in the pending state. Common reasons for a container to be waiting include:

  • ImagePullBackOff: The image defined in a container is not available.  This is usually either a typo in the name or tag on the image, or an issue authenticating to a private docker repository.
  • Large Image: If the image for the container is large, and network is slow, it could take a significant amount of time to download the image from the repository
  • readinessProbe: If a readiness probe is defined in the container spec, the container will not be ready until those conditions are met.
  • Failed Mount: If the pod was unable to mount all of the volumes described in the spec, it will not start. This can happen if the volume is already being used, or if a request for a dynamic volume failed.

When troubleshooting a waiting container, make sure the spec for its pod is defined correctly.  Use kubectl describe pod <pod name> to get detailed information on how and when the containers in that pod changed states.


Container Out of Memory

When a container is out of memory, or OOM, it is restarted by its pod according to the restart policy. The default restart policy will eventually back off on restarting the pod if it restarts many times in a short time span. 

To investigate a container that is going OOM, check that the pod spec memory request and limit are high enough for the running application. You can also debug your application's memory usage to figure out if there is a slow memory leak, or if there are other ways to reduce memory usage on the container. 


Container Restarts

In most cases, a container is expected to be long-lived. A restarting container can indicate problems with memory (see the Out of Memory section), cpu usage, or just an application exiting prematurely.

If a container is being restarted because of CPU usage, try increasing the requested and limit amounts for CPU in the pod spec. Remember that 1000m equals one virtual CPU on most providers. If the container does not always need tons of CPU, but has a bursty workload, you can set the requested value to be smaller (e.g. 100m) and the limit to be higher (e.g. 500m) to allow the container to use the CPU it needs without permanently taking up lots of scheduled CPU.

To debug an application exiting prematurely, it can be helpful to modify the command the container is started with.  You can set the command to something like sleep 10000 so that you can connect to the container with  kubectl exec -it <pod name> <container name> and then manually run the application, and check its exit code.


Unschedulable Pod

Blue Matador will detect when a pod could not be scheduled by checking for events in the kubernetes cluster. A pod may be unschedulable for several reasons:

  • Resource Requests: If the pod is requesting more resources than any node can provide, it will not be scheduled. This can be solved by adding nodes, increasing node size, or reducing the resource requests of pods.
  • Persistent Volumes: If the pod requests persistent volumes that are not available, it may not be able to schedule. This can happen when using dynamic volumes, or referring to a persistent volume claim that cannot be completed e.g. requesting an EBS volume without permissions to create it.


Pod Terminating

In rare cases, it is possible for a pod to get stuck in the terminating state. This is detected by finding any pods where every container has been terminated, but the pod is still running. Usually, this is caused when a node in the cluster gets taken out of service abruptly, and the cluster scheduler and controller-manager do not clean up all of the pods on that node. 

Solving this issue is as simple as manually deleting the pod using kubectl delete pod <pod name>.