When a pod cannot be scheduled or has containers that are waiting, the pod will be in a stuck in the pending state. Common reasons for a container to be waiting include:
When troubleshooting a waiting container, make sure the spec for its pod is defined correctly. Use kubectl describe pod <pod name> to get detailed information on how and when the containers in that pod changed states.
When a container is out of memory, or OOM, it is restarted by its pod according to the restart policy. The default restart policy will eventually back off on restarting the pod if it restarts many times in a short time span.
To investigate a container that is going OOM, check that the pod spec memory request and limit are high enough for the running application. You can also debug your application's memory usage to figure out if there is a slow memory leak, or if there are other ways to reduce memory usage on the container.
In most cases, a container is expected to be long-lived. A restarting container can indicate problems with memory (see the Out of Memory section), cpu usage, or just an application exiting prematurely.
If a container is being restarted because of CPU usage, try increasing the requested and limit amounts for CPU in the pod spec. Remember that 1000m equals one virtual CPU on most providers. If the container does not always need tons of CPU, but has a bursty workload, you can set the requested value to be smaller (e.g. 100m) and the limit to be higher (e.g. 500m) to allow the container to use the CPU it needs without permanently taking up lots of scheduled CPU.
To debug an application exiting prematurely, it can be helpful to modify the command the container is started with. You can set the command to something like sleep 10000 so that you can connect to the container with kubectl exec -it <pod name> <container name> and then manually run the application, and check its exit code.
Blue Matador will detect when a pod could not be scheduled by checking for events in the kubernetes cluster. A pod may be unschedulable for several reasons:
In rare cases, it is possible for a pod to get stuck in the terminating state. This is detected by finding any pods where every container has been terminated, but the pod is still running. Usually, this is caused when a node in the cluster gets taken out of service abruptly, and the cluster scheduler and controller-manager do not clean up all of the pods on that node.
Solving this issue is as simple as manually deleting the pod using kubectl delete pod <pod name>.