A Kubernetes deployment manages the scheduling and lifecycle of pods. Deployments provide several key features for managing pods, including rolling updates of pods, the ability to rollback, and easily scaling pods horizontally.

    Unhealthy Deployments

    When a deployment has fewer available pods than desired pods for at least several minutes, it is considered unhealthy. Often there is an issue with the pod spec that must be addressed to fix the deployment.

    Common causes of an unhealthy deployment:

    • Incorrect image specified in pod spec
    • A node being terminated
    • Pod stuck in a crash loop
    • Pending pods


    Quick Fix

    If the deployment was recently updated, you can quickly rollback the change using kubectl: kubectl rollout undo deployment/<deployment name>

    If some of the deployment's pods are unavailable because they are running on a terminated node, you can try manually deleting the pods to force the deployment to recreate them on another node. 

    If a pod is stuck in a crash loop, it may just be running out of resources. Revisit the spec and see if increasing the CPU or memory request and limit values allows the pod to run longer. You can then fully troubleshoot the pod by checking its logs. If resource usage is not the problem, ensure that the image and command are correct and do not exit quickly. Deployments are meant to be used with long-running containers, and should not exit unless there is a fatal error.

    If one or more pods in the deployment are pending, it may be the case that there are not enough resources to schedule the pods on any node. Check the pod spec and adjust the requested CPU and memory usage, or the number of pods being requested, so the entire deployment will fit in your cluster.


    Thorough Fix

    For deployments where new pods should always run before old ones are terminated when updating, set the update strategy in the spec to RollingUpdate and configure the readinessProbe so that the health of a pod can be determined before it is available to the deployment. This will help ensure that if new pods get stuck in a crash loop or fail to start, the old pods will still be available to handle traffic.

    When manually removing nodes from a cluster, you can avoid problems with pods scheduled on that node by first tainting the node so that pods will not schedule on it. Then you can delete pods on that node, and they will not be recreated on it. After none of your pods are running on the node, it is safe to remove from the cluster.