Kubernetes Job | Blue Matador

Failed Jobs

Jobs can be configured to fail in certain conditions. This is recommended when creating jobs so that you can track if the pods being ran are running successfully. There are multiple ways to configure a job so that it fails:

activeDeadlineSeconds: Setting this parameter in the job spec ensures that the job fails if the pods it creates do not complete within the time limit specified

backoffLimit: The backoffLimit specifies how many times a pod can fail and restart before the job is considered failed

restartPolicy: with backoffLimit, a restartPolicy of Never ensures that when a pod fails it will eventually propagate and cause the job to fail. If this is set to OnFailure then the pod will be restarted if it fails.

When Blue Matador detects that a job recently failed, an anomaly will be created. If the same job, or if multiple jobs controlled by the same CronJob fail consistently, then a warning will be created so that the issue can be investigated further.

When troubleshooting failed jobs, take a look at both the job and pod resources being created. Kubernetes will create events if a job is timing out, or if a pod fails to start or exits unexpectedly and causes the job to fail. Depending on your job configuration, it could take a long time for a job to completely give up and mark itself as failed. Check for other events around the time the job first attempted to run when correlating the issue.

Backoff Limit

When a job fails repeatedly, it will eventually reach the configured backoffLimit. Once this limit is reached, the job will no longer be retried. The actual cause of failure for the job may be available in the logs of its pods, or in other events related to the job failure. If you wish to have a job that will retry indefinitely, you can set the restartPolicy to OnFailure. This will ensure the job restarts when it fails, and the backoffLimit will never be reached, essentially creating a Job that retries until successful.

For jobs that are controlled by a CronJob, carefully consider how the backoffLimit should be configured to determine how many times a failed job should restart. If the job is idempotent and being scheduled regularly anyways, then restarting may actually be worse than just waiting for the next schedule from the CronJob.

Resources

Jobs - Run to Completion (Kubernetes Documentation)