diff options
| author | David Ashpole <dashpole@google.com> | 2017-11-30 11:45:58 -0800 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2017-11-30 11:45:58 -0800 |
| commit | 51317535a6679efa60437cc54823fba03a16cd2d (patch) | |
| tree | e00853f315aa9d7936f847c442fd6585a951c6b9 | |
| parent | c2fc426271cee8a8cd7db0927cf0ad7c1db4c34d (diff) | |
Add proposal for improving memcg notifications.
| -rw-r--r-- | contributors/design-proposals/node/kubelet-eviction.md | 48 |
1 files changed, 42 insertions, 6 deletions
diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index c777e7c7..2250c5c6 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -191,6 +191,48 @@ signal. If that signal is observed as being satisfied for longer than the specified period, the `kubelet` will initiate eviction to attempt to reclaim the resource that has met its eviction threshold. +### Memory CGroup Notifications + +When the `kubelet` is started with `--experimental-kernel-memcg-notification=true`, +it will use cgroup events on the memory.usage_in_bytes file in order to trigger the eviction manager. +With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager, +collect metrics, and respond with evictions much quicker than using the sync loop alone. + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +However, a current issue with this is that the cgroup notifications trigger based on memory.usage_in_bytes, +but the eviction manager determines memory pressure based on the working set, which is (memory.usage_in_bytes - memory.total_inactive_file). +For example: +``` +capacity = 1Gi +--eviction-hard=memory.available<250Mi +assume memory.total_inactive_file=10Mi +``` +When the cgroup event is triggered, `memory.usage_in_bytes = 750Mi`. +The eviction manager observes +`working_set = memory.usage_in_bytes - memory.total_inactive_file = 740Mi` +Signal: `memory.available = capacity - working_set = 260Mi` +Therefore, no memory pressure. This will occur as long as memory.total_inactive_file is non-zero. + +### Proposed solutions: +1. Set the cgroup event at `threshold*fraction`. +For example, if `--eviction-hard=memory.available<200Mi`, set the cgroup event at `100Mi`. +This way, when the eviction manager is triggered, it will likely observe memory pressure. +This is not guaranteed to always work, but should prevent OOMs in most cases. + +2. Use Usage instead of Working Set to determine memory pressure +This would mean that the eviction manager and cgroup notifications use the same metric, +and thus the response is ideal: the eviction manager is triggered exactly when memory pressure occurs. +However, the eviction manager may often evict unneccessarily if there are large quantities of memory +the kernel has not yet reclaimed. + +3. Increase the syncloop interval after the threshold is crossed +For example, the eviction manager could start collecting observations every second instead of every +10 seconds after the threshold is crossed. This means that even though the cgroup event and eviction +manager are not completely in-sync, the threshold can help the eviction manager to respond faster than +it otherwise would. After a short period, it would resume the standard interval of sync loop calls. + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ### Disk Let's assume the operator started the `kubelet` with the following: @@ -457,9 +499,3 @@ In general, it should be strongly recommended that `DaemonSet` not create `BestEffort` pods to avoid being identified as a candidate pod for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only. -## Known issues - -### kubelet may evict more pods than needed - -The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding -the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future. |
