A Kubernetes Job manages the execution of one or more pods until completion. A job can limit the runtime of a pod, keeps track of the status of the pod, and can retry if the pod fails.  Jobs themselves can be managed by a CronJob that schedules jobs to run using a cron expression. Blue Matador monitors all jobs running in your cluster and notifies you if a job completely fails.

Failed Jobs


Jobs can be configured to fail in certain conditions. This is recommended when creating jobs so that you can track if the pods being ran are running successfully. There are multiple ways to configure a job so that it fails:

  • activeDeadlineSeconds: Setting this parameter in the job spec ensures that the job fails if the pods it creates do not complete within the time limit specified
  • backoffLimit: The backoffLimit specifies how many times a pod can fail and restart before the job is considered failed
  • restartPolicy: with backoffLimit, a restartPolicy of Never ensures that when a pod fails it will eventually propagate and cause the job to fail. If this is set to OnFailure then the pod will be restarted if it fails, and the backoffLimit will never be reached, essentially creating a Job that will not fail and remains active until successful.

When Blue Matador detects that a recently completed job failed, an anomaly will be created. If the same job, or if multiple jobs controlled by the same CronJob fail consistently, then a warning will be created so that the issue can be investigated further.

 

Troubleshooting


When troubleshooting failed jobs, take a look at both the job and pod resources being created. Kubernetes will create events if a job is timing out, or if a pod fails to start or exits unexpectedly and causes the job to fail. Depending on your job configuration, it could take a long time for a job to completely give up and mark itself as failed. Check for other events around the time the job first attempted to run when correlating the issue.

 

Resources