diff options
| author | Maciej Szulik <maszulik@redhat.com> | 2017-04-26 12:09:54 +0200 |
|---|---|---|
| committer | Maciej Szulik <maszulik@redhat.com> | 2017-04-26 12:09:54 +0200 |
| commit | bd5d3d9bf3ea9733eadecd3932839123209b2ad3 (patch) | |
| tree | 89d3bfdd96f5e822b4673af4d212af1d90ddd3d3 | |
| parent | 4be538df0a6f5e5588272f93fa9a9e2b02a45f44 (diff) | |
Add backoff policy and failed pod limit
| -rw-r--r-- | contributors/design-proposals/job.md | 35 |
1 files changed, 35 insertions, 0 deletions
diff --git a/contributors/design-proposals/job.md b/contributors/design-proposals/job.md index 31fb0e3f..ac11a7df 100644 --- a/contributors/design-proposals/job.md +++ b/contributors/design-proposals/job.md @@ -18,6 +18,7 @@ Several existing issues and PRs were already created regarding that particular s 1. Be able to get the job status. 1. Be able to specify the number of instances performing a job at any one time. 1. Be able to specify the number of successfully finished instances required to finish a job. +1. Be able to specify backoff policy, when job is continuously failing. ## Motivation @@ -26,6 +27,31 @@ Jobs are needed for executing multi-pod computation to completion; a good exampl here would be the ability to implement any type of batch oriented tasks. +## Backoff policy and failed pod limit + +By design, Jobs do not have any notion of failure, other than Pod's `restartPolicy` +which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243), +[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]). There are +situation where one wants to fail a Job after some amount of retries over certain +period of time, due to a logical error in configuration etc. To do so we are going +following fields will be introduced, which will control the exponential backoff +when retrying Job: number of retries and time to retry. The two fields will allow +creating a fine grain control over the backoff policy, limiting the number of retries +over specified period of time. In the case when only one of them is specified +an exponential backoff with duration of 10 seconds and factor of 2 will be applied +in such a way that either time or number is reached. After reaching the limit +a Job will be marked as failed. + +Additionally, to help debug the issue with a job, and limit the impact of having +too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243)) +we are going to introduce a field which will allow specifying the maximum number +of failed pods to keep around. This number will also take effect if none of the +limits, described above, are set. + +All of the above fields will be optional and will apply no matter which `restartPolicy` +is set on a `PodTemplate`. + + ## Implementation Job controller is similar to replication controller in that they manage pods. @@ -83,6 +109,15 @@ type JobSpec struct { // before the system tries to terminate it; value must be positive integer ActiveDeadlineSeconds *int + // Optional number of retries, before marking this job failed. + BackoffLimit *int + + // Optional time (in seconds), how log a job should be retried before marking it failed. + BackoffDeadlineSeconds *int + + // Optional number of failed pods to retain. + FailedPodsLimit *int + // Selector is a label query over pods running a job. Selector LabelSelector |
