Add backoff policy and failed pod limit

author: Maciej Szulik <maszulik@redhat.com> 2017-04-26 12:09:54 +0200
committer: Maciej Szulik <maszulik@redhat.com> 2017-04-26 12:09:54 +0200
commit: bd5d3d9bf3ea9733eadecd3932839123209b2ad3 (patch)
tree: 89d3bfdd96f5e822b4673af4d212af1d90ddd3d3
parent: 4be538df0a6f5e5588272f93fa9a9e2b02a45f44 (diff)
1 files changed, 35 insertions, 0 deletions
diff --git a/contributors/design-proposals/job.md b/contributors/design-proposals/job.md
index 31fb0e3f..ac11a7df 100644
--- a/contributors/design-proposals/job.md
+++ b/contributors/design-proposals/job.md
@@ -18,6 +18,7 @@ Several existing issues and PRs were already created regarding that particular s
 1. Be able to get the job status.
 1. Be able to specify the number of instances performing a job at any one time.
 1. Be able to specify the number of successfully finished instances required to finish a job.
+1. Be able to specify backoff policy, when job is continuously failing.
 
 
 ## Motivation
@@ -26,6 +27,31 @@ Jobs are needed for executing multi-pod computation to completion; a good exampl
 here would be the ability to implement any type of batch oriented tasks.
 
 
+## Backoff policy and failed pod limit
+
+By design, Jobs do not have any notion of failure, other than Pod's `restartPolicy`
+which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243),
+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over certain
+period of time, due to a logical error in configuration etc.  To do so we are going
+following fields will be introduced, which will control the exponential backoff
+when retrying Job: number of retries and time to retry.  The two fields will allow
+creating a fine grain control over the backoff policy, limiting the number of retries
+over specified period of time.  In the case when only one of them is specified
+an exponential backoff with duration of 10 seconds and factor of 2 will be applied
+in such a way that either time or number is reached.  After reaching the limit
+a Job will be marked as failed.
+
+Additionally, to help debug the issue with a job, and limit the impact of having
+too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243))
+we are going to introduce a field which will allow specifying the maximum number
+of failed pods to keep around.  This number will also take effect if none of the
+limits, described above, are set.
+
+All of the above fields will be optional and will apply no matter which `restartPolicy`
+is set on a `PodTemplate`.
+
+
 ## Implementation
 
 Job controller is similar to replication controller in that they manage pods.
@@ -83,6 +109,15 @@ type JobSpec struct {
     // before the system tries to terminate it; value must be positive integer
     ActiveDeadlineSeconds *int
 
+    // Optional number of retries, before marking this job failed.
+    BackoffLimit *int
+
+    // Optional time (in seconds), how log a job should be retried before marking it failed.
+    BackoffDeadlineSeconds *int
+
+    // Optional number of failed pods to retain.
+    FailedPodsLimit *int
+
     // Selector is a label query over pods running a job.
     Selector LabelSelector
author	Maciej Szulik <maszulik@redhat.com>	2017-04-26 12:09:54 +0200
committer	Maciej Szulik <maszulik@redhat.com>	2017-04-26 12:09:54 +0200
commit	bd5d3d9bf3ea9733eadecd3932839123209b2ad3 (patch)
tree	89d3bfdd96f5e822b4673af4d212af1d90ddd3d3
parent	4be538df0a6f5e5588272f93fa9a9e2b02a45f44 (diff)