A Kubernetes Job manages the execution of one or more pods until completion. A job can limit the runtime of a pod, keeps track of the status of the pod, and can retry if the pod fails.  Jobs themselves can be managed by a CronJob that schedules jobs to run using a cron expression. Blue Matador monitors all jobs running in your cluster and notifies you if a job completely fails.

    Failed Jobs

    Jobs can be configured to fail in certain conditions. This is recommended when creating jobs so that you can track if the pods being ran are running successfully. There are multiple ways to configure a job so that it fails:

    • activeDeadlineSeconds: Setting this parameter in the job spec ensures that the job fails if the pods it creates do not complete within the time limit specified
    • backoffLimit: The backoffLimit specifies how many times a pod can fail and restart before the job is considered failed
    • restartPolicy: with backoffLimit, a restartPolicy of Never ensures that when a pod fails it will eventually propagate and cause the job to fail. If this is set to OnFailure then the pod will be restarted if it fails.

    When Blue Matador detects that a job recently failed, an anomaly will be created. If the same job, or if multiple jobs controlled by the same CronJob fail consistently, then a warning will be created so that the issue can be investigated further.

    When troubleshooting failed jobs, take a look at both the job and pod resources being created. Kubernetes will create events if a job is timing out, or if a pod fails to start or exits unexpectedly and causes the job to fail. Depending on your job configuration, it could take a long time for a job to completely give up and mark itself as failed. Check for other events around the time the job first attempted to run when correlating the issue.


    Backoff Limit

    When a job fails repeatedly, it will eventually reach the configured backoffLimit. Once this limit is reached, the job will no longer be retried. The actual cause of failure for the job may be available in the logs of its pods, or in other events related to the job failure. If you wish to have a job that will retry indefinitely, you can set the restartPolicy to OnFailure. This will ensure the job restarts when it fails, and the backoffLimit will never be reached, essentially creating a Job that retries until successful.

    For jobs that are controlled by a CronJob, carefully consider how the backoffLimit should be configured to determine how many times a failed job should restart. If the job is idempotent and being scheduled regularly anyways, then restarting may actually be worse than just waiting for the next schedule from the CronJob.