Merge pull request #25237 from vishh/disk-based-eviction-proposal

Automatic merge from submit-queue Proposal for disk based evictions cc @dchen1107 @derekwaynecarr
author: k8s-merge-robot <k8s.production.user@gmail.com> 2016-05-20 17:57:18 -0700
committer: k8s-merge-robot <k8s.production.user@gmail.com> 2016-05-20 17:57:18 -0700
commit: f6a2f4fce88f2943d8aed271d30a89b49e044263 (patch)
tree: eb3e280bb64fb23af6e4946d637eae0d70280cf9
parent: b4536e9389ec2bd0472f1a27921297ecf2c139b5 (diff)
parent: 59d8ebe7666b29ebe5fdb3aab52097da6a65ca36 (diff)
1 files changed, 173 insertions, 10 deletions
diff --git a/kubelet-eviction.md b/kubelet-eviction.md
index c62b26aa..87920906 100644
--- a/kubelet-eviction.md
+++ b/kubelet-eviction.md
@@ -29,9 +29,9 @@ Documentation for other releases can be found at
 
 # Kubelet - Eviction Policy
 
-**Author**: Derek Carr (@derekwaynecarr)
+**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
 
-**Status**: Proposed
+**Status**: Proposed (memory evictions WIP)
 
 This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
 
@@ -58,8 +58,8 @@ moved and scheduled elsewhere when/if its backing controller creates a new pod.
 
 This proposal defines a pod eviction policy for reclaiming compute resources.
 
-In the first iteration, it focuses on memory; later iterations are expected to cover
-other resources like disk.  The proposal focuses on a simple default eviction strategy
+As of now, memory and disk based evictions are supported.
+The proposal focuses on a simple default eviction strategy
 intended to cover the broadest class of user workloads.
 
 ## Eviction Signals
@@ -69,6 +69,16 @@ The `kubelet` will support the ability to trigger eviction decisions on the foll
 | Eviction Signal  | Description                                                                     |
 |------------------|---------------------------------------------------------------------------------|
 | memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
+| nodefs.available   | nodefs.available := node.stats.fs.available |
+| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
+
+`kubelet` supports only two filesystem partitions.
+
+1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
+1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
+
+`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
+`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
 
 ## Eviction Thresholds
 
@@ -151,6 +161,7 @@ The following node conditions are defined that correspond to the specified evict
 | Node Condition | Eviction Signal  | Description                                                      |
 |----------------|------------------|------------------------------------------------------------------|
 | MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
+| DiskPressure | nodefs.available (or) imagefs.available | Available disk space on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
 
 The `kubelet` will continue to report node status updates at the frequency specified by
 `--node-status-update-frequency` which defaults to `10s`.
@@ -174,7 +185,9 @@ The `kubelet` would ensure that it has not observed an eviction threshold being
 for the specified pressure condition for the period specified before toggling the
 condition back to `false`.
 
-## Eviction scenario
+## Eviction scenarios
+
+### Memory
 
 Let's assume the operator started the `kubelet` with the following:
 
@@ -194,6 +207,31 @@ signal.  If that signal is observed as being satisfied for longer than the
 specified period, the `kubelet` will initiate eviction to attempt to
 reclaim the resource that has met its eviction threshold.
 
+### Disk
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-hard="nodefs.available<1Gi,imagefs.available<10Gi"
+--eviction-soft="nodefs.available<1.5Gi,imagefs.available<20Gi"
+--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
+```
+
+The `kubelet` will run a sync loop that looks at the available disk
+on the node's supported partitions as reported from `cAdvisor`.
+If available disk space on the node's primary filesystem is observed to drop below 1Gi,
+the `kubelet` will immediately initiate eviction.
+If available disk space on the node's image filesystem is observed to drop below 10Gi,
+the `kubelet` will immediately initiate eviction.
+
+If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
+or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
+it will record when that signal was observed internally in a cache.  If at the next
+sync, that criterion was no longer satisfied, the cache is cleared for that
+signal.  If that signal is observed as being satisfied for longer than the
+specified period, the `kubelet` will initiate eviction to attempt to
+reclaim the resource that has met its eviction threshold.
+
 ## Eviction of Pods
 
 If an eviction threshold has been met, the `kubelet` will initiate the
@@ -241,11 +279,111 @@ only has guaranteed pod(s) remaining, then the node must choose to evict a
 guaranteed pod in order to preserve node stability, and to limit the impact
 of the unexpected consumption to other guaranteed pod(s).
 
+## Disk based evictions
+
+### With Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete logs
+1. Evict Pods if required.
+
+If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete unused images
+1. Evict Pods if required.
+
+### Without Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete logs
+1. Delete unused images
+1. Evict Pods if required.
+
+Let's explore the different options for freeing up disk space.
+
+### Delete logs of dead pods/containers
+
+As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
+to provide access to logs.
+In the future, if we store logs of dead containers outside of the container itself, then
+`kubelet` can delete these logs to free up disk space.
+Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
+around log evictions. `kubelet` can delete logs of the oldest containers first.
+Since logs from the first and the most recent incarnation of a container is the most important for most applications,
+kubelet can try to preserve these logs and aggresively delete logs from other container incarnations.
+
+Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
+
+### Delete unused images
+
+`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
+Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
+`kubelet` employs a LRU policy when it comes to deleting images.
+
+The existing policy will be replaced with a much simpler policy.
+Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
+above eviction thresholds, then kubelet will not delete any images.
+If `kubelet` decides to delete unused images, it will delete *all* unused images.
+
+### Evict pods
+
+There is no ability to specify disk limits for pods/containers today.
+Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
+`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
+`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
+Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
+`kubelet` will sort pods in each bucket as follows:
+
+#### Without Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
+- local volumes + logs & writable layer of all its containers.
+
+#### With Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
+- local volumes + logs of all its containers.
+
+If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
+
+## Minimum eviction thresholds
+
+In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
+`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
+ is time consuming.
+
+To mitigate these issues, `kubelet` will have a per-resource `minimum-threshold`. Whenever `kubelet` observes
+resource pressure, `kubelet` will attempt to reclaim at least `minimum-threshold` amount of resource.
+
+Following are the flags through which `minimum-thresholds` can be configured for each evictable resource:
+
+`--minimum-eviction-thresholds="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
+
+The default `minimum-eviction-threshold` is `0` for all resources.
+
+## Deprecation of existing features
+
+`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
+some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
+
+| Existing Flag | New Flag | Rationale |
+| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
+| `--image-gc-low-threshold` | `--minimum-eviction-thresholds` | eviction thresholds achieve the same behavior |
+| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
+| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
+| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
+| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
+| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
+
 ## Kubelet Admission Control
 
 ### Feasibility checks during kubelet admission
 
-The `kubelet` will reject `BestEffort` pods if any of its associated
+#### Memory
+
+The `kubelet` will reject `BestEffort` pods if any of the memory
 eviction thresholds have been exceeded independent of the configured
 grace period.
 
@@ -265,13 +403,38 @@ The reasoning for this decision is the expectation that the incoming pod is
 likely to further starve the particular compute resource and the `kubelet` should
 return to a steady state before accepting new workloads.
 
+#### Disk
+
+The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-soft="disk.available<1500Mi"
+--eviction-soft-grace-period="disk.available=30s"
+```
+
+If the `kubelet` sees that it has less than `1500Mi` of disk available
+on the node, but the `kubelet` has not yet initiated eviction since the
+grace period criteria has not yet been met, the `kubelet` will still immediately
+fail any incoming pods.
+
+The rationale for failing **all** pods instead of just best effort is because disk is currently
+a best effort resource for all QoS classes.
+
+Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
+
 ## Scheduler
 
 The node will report a condition when a compute resource is under pressure.  The
 scheduler should view that condition as a signal to dissuade placing additional
-best effort pods on the node.  In this case, the `MemoryPressure` condition if true
-should dissuade the scheduler from placing new best effort pods on the node since
-they will be rejected by the `kubelet` in admission.
+best effort pods on the node.
+
+In this case, the `MemoryPressure` condition if true should dissuade the scheduler
+from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
+
+On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
+placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
 
 ## Best Practices
 
@@ -288,7 +451,7 @@ candidate set of pods provided to the eviction strategy.
 
 In general, it should be strongly recommended that `DaemonSet` not
 create `BestEffort` pods to avoid being identified as a candidate pod
-for eviction.
+for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
 
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
author	k8s-merge-robot <k8s.production.user@gmail.com>	2016-05-20 17:57:18 -0700
committer	k8s-merge-robot <k8s.production.user@gmail.com>	2016-05-20 17:57:18 -0700
commit	f6a2f4fce88f2943d8aed271d30a89b49e044263 (patch)
tree	eb3e280bb64fb23af6e4946d637eae0d70280cf9
parent	b4536e9389ec2bd0472f1a27921297ecf2c139b5 (diff)
parent	59d8ebe7666b29ebe5fdb3aab52097da6a65ca36 (diff)