KEP for TTL-after-finished controller

author: Janet Kuo <chiachenk@google.com> 2018-08-16 10:30:49 -0700
committer: Janet Kuo <chiachenk@google.com> 2018-08-22 10:55:19 -0700
commit: 80e8c2bda7bb36aaa6fa3f68dcbdff36ce14cc32 (patch)
tree: 4f41009157d6db2d2e2bd7d616b90e3fdd287460 /keps
parent: a119d94f47f86b1dcce5ead9f3efaf8de7f91701 (diff)
2 files changed, 297 insertions, 1 deletions
diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER
index 6f4247a6..f64f5d8d 100644
--- a/keps/NEXT_KEP_NUMBER
+++ b/keps/NEXT_KEP_NUMBER
@@ -1 +1 @@
-26
+27
diff --git a/keps/sig-apps/0026-ttl-after-finish.md b/keps/sig-apps/0026-ttl-after-finish.md
new file mode 100644
index 00000000..85d385ce
--- /dev/null
+++ b/keps/sig-apps/0026-ttl-after-finish.md
@@ -0,0 +1,296 @@
+---
+kep-number: 26
+title: TTL After Finished
+authors:
+  - "@janetkuo"
+owning-sig: sig-apps
+participating-sigs:
+  - sig-api-machinery
+reviewers:
+  - @enisoc
+  - @tnozicka
+approvers:
+  - @kow3ns
+editor: TBD
+creation-date: 2018-08-16
+last-updated: 2018-08-16
+status: provisional
+see-also:
+  - n/a
+replaces:
+  - n/a
+superseded-by:
+  - n/a
+---
+
+# TTL After Finished Controller
+
+## Table of Contents
+
+A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
+[Tools for generating][] a table of contents from markdown are available.
+
+   * [TTL After Finished Controller](#ttl-after-finished-controller)
+      * [Table of Contents](#table-of-contents)
+      * [Summary](#summary)
+      * [Motivation](#motivation)
+         * [Goals](#goals)
+      * [Proposal](#proposal)
+         * [Concrete Use Cases](#concrete-use-cases)
+         * [Detailed Design](#detailed-design)
+            * [Feature Gate](#feature-gate)
+            * [API Object](#api-object)
+               * [Validation](#validation)
+         * [User Stories](#user-stories)
+         * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
+            * [TTL Controller](#ttl-controller)
+            * [Finished Jobs](#finished-jobs)
+            * [Finished Pods](#finished-pods)
+            * [Owner References](#owner-references)
+         * [Risks and Mitigations](#risks-and-mitigations)
+      * [Graduation Criteria](#graduation-criteria)
+      * [Implementation History](#implementation-history)
+
+[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
+
+## Summary
+
+We propose a TTL mechanism to limit the lifetime of finished resource objects,
+including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods
+after they finish. The TTL timer starts when the Job/Pod finishes, and the
+finished Job/Pod will be cleaned up after the TTL expires.
+
+## Motivation
+
+In Kubernetes, finishable resources, such as Jobs and Pods, are often
+frequently-created and short-lived. If a Job or Pod isn't controlled by a
+higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some
+other resources, it's difficult for the users to clean them up automatically,
+and those Jobs and Pods can accumulate and overload a Kubernetes cluster very
+easily. Even if we can avoid the overload issue by implementing a cluster-wide
+(global) resource quota, users won't be able to create new resources without
+cleaning up old ones first. See [#64470][].
+
+The design of this proposal can be later generalized to other finishable
+frequently-created, short-lived resources, such as completed Pods or finished
+custom resources.
+
+[#64470]: https://github.com/kubernetes/kubernetes/issues/64470
+
+### Goals
+
+Make it easy to for the users to specify a time-based clean up mechanism for
+finished resource objects. 
+* It's configurable at resource creation time and after the resource is created.
+
+## Proposal
+
+[K8s Proposal: TTL controller for finished Jobs and Pods][]
+
+[K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit
+
+### Concrete Use Cases
+
+* [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo
+  workflows, etc.), see [#718][].
+
+* [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker.
+
+* [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods.
+
+* Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods.
+
+[Kubeflow]: https://github.com/kubeflow
+[#718]: https://github.com/kubeflow/tf-operator/issues/718
+[Prow]: https://github.com/kubernetes/test-infra/tree/master/prow
+[Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html
+
+### Detailed Design 
+
+#### Feature Gate
+
+This will be launched as an alpha feature first, with feature gate
+`TTLAfterFinished`.
+
+#### API Object
+
+We will add the following API fields to `JobSpec` (`Job`'s `.spec`).
+
+```go
+type JobSpec struct {
+ 	// ttlSecondsAfterFinished limits the lifetime of a Job that has finished
+	// execution (either Complete or Failed). If this field is set, once the Job
+	// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
+	// the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will
+	// be honored. If this field is unset, ttlSecondsAfterFinished will not
+	// expire. If this field is set to zero, ttlSecondsAfterFinished expires
+	// immediately after the Job finishes.
+	// This field is alpha-level and is only honored by servers that enable the
+	// TTLAfterFinished feature.
+	// +optional
+	TTLSecondsAfterFinished *int32
+}
+```
+
+This allows Jobs to be cleaned up after they finish and provides time for
+asynchronous clients to observe Jobs' final states before they are deleted.
+
+
+Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`).
+
+```go
+type PodSpec struct {
+ 	// ttlSecondsAfterFinished limits the lifetime of a Pod that has finished
+	// execution (either Succeeded or Failed). If this field is set, once the Pod
+	// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
+	// the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will
+	// be honored. If this field is unset, ttlSecondsAfterFinished will not
+	// expire. If this field is set to zero, ttlSecondsAfterFinished expires
+	// immediately after the Pod finishes.
+	// This field is alpha-level and is only honored by servers that enable the
+	// TTLAfterFinished feature.
+	// +optional
+	TTLSecondsAfterFinished *int32
+}
+```
+
+##### Validation
+
+Because Job controller depends on Pods to exist to work correctly. In Job
+validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to
+prevent users from breaking their Jobs. Users should set TTL seconds on a Job,
+instead of Pods owned by a Job.
+
+It is common for higher level resources to call generic PodSpec validation;
+therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to
+be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never`
+(i.e. not `Always`).
+
+### User Stories
+
+The users keep creating Jobs in a small Kubernetes cluster with 4 nodes.
+The Jobs accumulates over time, and 1 year later, the cluster ended up with more
+than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests,
+and eventually made the cluster unavailable.
+
+The problem could have been avoided easily with TTL controller for Jobs.
+
+The steps are as easy as:
+
+1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to
+   3600 (i.e. 1 hour).
+1. The user deploys Jobs as usual.
+1. After a Job finishes, the result is observed asynchronously within an hour
+   and stored elsewhere.
+1. The TTL collector cleans up Jobs 1 hour after they complete.
+
+### Implementation Details/Notes/Constraints
+
+#### TTL Controller
+We will add a TTL controller for finished Jobs and finished Pods. We considered
+adding it in Job controller, but decided not to, for the following reasons:
+
+1. Job controller should focus on managing Pods based on the Job's spec and pod
+   template, but not cleaning up Jobs.
+1. We also need the TTL controller to clean up finished Pods, and we consider
+   generalizing TTL controller later for custom resources. 
+
+The TTL controller utilizes informer framework, watches all Jobs and Pods, and
+read Jobs and Pods from a local cache.
+
+#### Finished Jobs
+
+When a Job is created or updated:
+
+1. Check its `.status.conditions` to see if it has finished (`Complete` or
+   `Failed`). If it hasn't finished, do nothing. 
+1. Otherwise, if the Job has finished, check if Job's 
+   `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
+   not set. 
+1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. 
+   `.spec.ttlSecondsAfterFinished` + the time when the Job finishes
+   (`.status.conditions.lastTransitionTime`) > now. 
+1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount
+   of time when it will expire. The computed time period is:
+   (`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` -
+   now).
+1. If the TTL has expired, `GET` the Job from API server to do final sanity
+   checks before deleting it.
+1. Check if the freshly got Job's TTL has expired. This field may be updated
+   before TTL controller observes the new value in its local cache.
+   * If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue
+     the Job after a computed amount of time when it will expire.
+1. Delete the Job if passing the sanity checks. 
+
+#### Finished Pods
+
+When a Pod is created or updated:
+1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`).
+   If it hasn't finished, do nothing. 
+1. Otherwise, if the Pod has finished, check if Pod's
+   `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
+   not set. 
+1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
+   `.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all
+   of its containers termination time
+   `.containerStatuses.state.terminated.finishedAt`) > now. 
+1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount
+   of time when it will expire. The computed time period is:
+   (`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now).
+1. If the TTL has expired, `GET` the Pod from API server to do final sanity
+   checks before deleting it.
+1. Check if the freshly got Pod's TTL has expired. This field may be updated
+   before TTL controller observes the new value in its local cache.
+   * If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue
+     the Pod after a computed amount of time when it will expire.
+1. Delete the Pod if passing the sanity checks. 
+
+#### Owner References
+
+We have considered making TTL controller leave a Job/Pod around even after its
+TTL expires, if the Job/Pod has any owner specified in its
+`.metadata.ownerReferences`.
+
+We decided not to block deletion on owners, because the purpose of
+`.metadata.ownerReferences` is for cascading deletion, but not for keeping an
+owner's dependents alive. If the Job is owned by a CronJob, the Job can be
+cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs
+to keep), or CronJob can choose not to set history limit but set the TTL of its
+Job template to clean up Jobs after TTL expires instead of based on the history
+limit capacity. 
+
+Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has
+owners. 
+
+Similarly, the TTL won't block deletion from generic garbage collector. This
+means that when a Job's or Pod's owners are gone, generic garbage collector will
+delete it, even if it hasn't finished or its TTL hasn't expired. 
+
+### Risks and Mitigations
+
+Risks:
+* Time skew may cause TTL controller to clean up resource objects at the wrong
+  time.
+
+Mitigations:
+* In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time
+  skew. We will also document this risk.
+
+[#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058
+
+## Graduation Criteria
+
+We want to implement this feature for Pods/Jobs first to gather feedback, and
+decide whether to generalize it to custom resources. This feature can be
+promoted to beta after we finalize the decision for whether to generalize it or
+not, and when it satisfies users' need for cleaning up finished resource
+objects, without regressions.
+
+This will be promoted to GA once it's gone a sufficient amount of time as beta
+with no changes. 
+
+[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752
+
+## Implementation History
+
+TBD
author	Janet Kuo <chiachenk@google.com>	2018-08-16 10:30:49 -0700
committer	Janet Kuo <chiachenk@google.com>	2018-08-22 10:55:19 -0700
commit	80e8c2bda7bb36aaa6fa3f68dcbdff36ce14cc32 (patch)
tree	4f41009157d6db2d2e2bd7d616b90e3fdd287460 /keps
parent	a119d94f47f86b1dcce5ead9f3efaf8de7f91701 (diff)