A Kubernetes node is a physical server or VM that runs pods. A node's health is determined by condition checks as well as the amount of used cpu, memory, and pod capacity.

    Node Conditions

    Conditions describe the health of several key node metrics and attributes. They also determine if a node is allowed to have pods scheduled onto it. A full description of node conditions can be found here but they are also summarized below for convenience. Note that not all versions of Kubernetes expose every node condition.

    • OutOfDisk: If unhealthy, then there is not enough disk space for new pods
    • Ready: If unhealthy, then the node will not accept new pods
    • MemoryPressure: If unhealthy, then node memory is low
    • PIDPressure: If unhealthy, then there are too many processes on the node
    • DiskPressure: If unhealthy, then disk capacity is low
    • NetworkUnavailable: If unhealthy, then the network for the node is misconfigured
    • ConfigOK: If unhealthy, then the kubelet is misconfigured


    Node Resources

    Each node has a finite capacity of CPU and memory that can be allocated towards running pods. To quickly see resource usage on a per-node basis in your cluster, run kubectl describe nodes or if your cluster has heapster,  kubectl top nodes.

    CPU is measured in cpu units where one unit is equivalent to one vCPU, vCore, or Core depending on your cloud provider.  Many pods do not require an entire CPU and will request CPU in millicpu. There are 1000 millicpu, or 1000m in one cpu unit.

    Memory is measured in bytes but often displayed in a more readable format such as 128Mi. 

    To determine if a node is near its capacity, the sum of all of the configured request and limit values for both CPU and memory for pods on a node is compared to the capacity of the node. 

    The requested amount of a resource determines if the pod can be scheduled on a node, and can never exceed the capacity of a node. When the requested amount of memory or CPU on a node is near its capacity, no more pods can be scheduled on that node. Typically, a workload will be spread out over several nodes in the cluster, and it is expected that most nodes have roughly the same amount of CPU and memory requested.

    The limit amount of a resource determines how many total resources a pod could use if it needed to. It is common for the limit to exceed the capacity on the node when pod resource requests and limits have not been fine-tuned to the application they are running. An over-committed node is one that has a limit that is much higher than the capacity.  It is considered unhealthy because it can cause node performance to degrade if those pods start actually using resources up to the limit. The easiest way to avoid an over-committed node is to configure pod limits to be equal to the requested amounts. 


    Pod Limit

    The kubelet running on each node has a pod limit (default is 110) which limits the number of pods that can be ran on a node. Reaching this limit means that no more pods will be able to be scheduled. When figuring out how many pods can be ran on each node, remember to look at pods from all namespaces, because system DNS and networking pods count towards the limit.

    There are two ways to avoid hitting the pod-per-node limit:

    • Add more nodes: by adding more nodes to the cluster, pods can be scheduled on the new nodes
    • Modify the kubelet command: the kubelet command on the node can be ran with the --max-pods argument to specify the number of max pods on that node


    Evicted Pods

    The kubelet can prevent total resource starvation by proactively evicting pods when a resource is almost exhausted. Resources that can be monitored for pod eviction include cpu, memory, disk space, and disk inodes.  Both soft and hard limits can be configured with their own thresholds and grace period that will affect how the kubelet evicts pods to reclaim resources. 

    Blue Matador monitors your Kubernetes cluster for any eviction events on a node and warns you when evictions happen. Evictions can be fine-tuned by changing the kubelet command-line parameters, as well as ensuring that pods define their resource request and limits appropriately. 

    The kubelet may evict more pods than needed to reclaim resources, and may evict pods that do not necessarily solve any resource starvation issues. This is because the kubelet uses a pod's Quality of Service level to determine which pods should be evicted first. Pods whose QoS level is Guaranteed will be evicted after other pods. You can set up critical pods to be Guaranteed by setting resource requests and limits in the pods to an actual limit, and ensuring that the request and limit are the same value.

    Node Added

    Blue Matador creates an event every time a node is added to your cluster. This is useful for correlating other events that would be affected by a new node, such as DaemonSet Unhealthy, and is created only as an anomaly since it is usually not actionable.


    Node Removed

    When Blue Matador detects that one of our agents running on your Kubernetes node no longer is sending data, it is assumed that the node has been removed from the cluster. This event can help you track down issues with your agent installation, and can also be used to correlate issues around resource utilization and unhealthy Deployments. In a auto-scaled cluster, this event will be triggered every time the cluster scales in.


    Node Rebooted

    This event simply indicates that a node has been rebooted. Node reboots are usually user-initiated for kernel upgrades, node software updates, or hardware repairs. If the reboot takes less time than the --pod-eviction-timeout on the controller-manager, then the pods on that node will be remain on it when the reboot is finished. Otherwise, the pods will be scheduled onto other nodes after the timeout. If you want to limit the impact on your cluster that may be caused by rebooting a node, you can use kubectl drain and kubectl uncordon on the node to remove the pods beforehand and then allow scheduling again afterwards.


    Eviction Threshold

    Kubelet is in charge of managing node resource usage. If a node becomes low on certain resources, then the kubelet will start evicting pods to try and reclaim these resources. In some cases, the resources that is low does not actually get reclaimed when pods are evicted, so mass-evictions are possible. When this event occurrs, try to figure out which resources are low using kubectl describe <node>. It may help to get on the node directly and reclaim disk space or inodes if those are the affected resources.

    The following eviction thresholds are monitored by kubelet:

    • memory available
    • disk space on nodefs or imagefs
    • inodes on nodefs or imagefs

    The best way to minimize the impact of eviction is to properly configure your pod resource requests so they are evicted in a way that your application can tolerate. Kubelet will evict user pods according to their QoS. BestEffort and Burstable pods where resource usage exceeds the requested amounts will be evicted first, while Gauranteed and Burstable pods where resource usage is below the requested amounts will be evicted last. Properly size your critical pods so that their usage is just slightly below the requested amount so they are less likely to be evicted.


    System OOM

    When a Kubernetes Node completely runs out of memory, it will shut down running pods in an attempt to keep the kubelet alive. This can often mean a disruption in service as pods need to be scheduled onto other nodes in your cluster immediately. Unlike normal eviction, System OOM may indicate that the kubelet itself is using a lot of memory, and this should be investigated. To minimize the impact of a System OOM, you can properly configure your pod resource requests as described in Eviction Threshold.


    Resolv Conf

    Kubelet automatically checks the /etc/resolv.conf file on your nodes for errors and reports them to the event API. This is done because your pods' resolv.conf files are created from the node resolv.conf. If the resolv.conf file on a pod has errors, DNS lookups may not work as expected. The message in the event will usually indicate the issue such as going over the 255 character limit on a line, or too many domains in a search line. Pods will still start normally, and may run fine if the DNS lookups being performed will not be affected by the lines in question. The fix is to modify the node's resolv.conf file.