From 51317535a6679efa60437cc54823fba03a16cd2d Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Thu, 30 Nov 2017 11:45:58 -0800 Subject: Add proposal for improving memcg notifications. --- .../design-proposals/node/kubelet-eviction.md | 48 +++++++++++++++++++--- 1 file changed, 42 insertions(+), 6 deletions(-) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index c777e7c7..2250c5c6 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -191,6 +191,48 @@ signal. If that signal is observed as being satisfied for longer than the specified period, the `kubelet` will initiate eviction to attempt to reclaim the resource that has met its eviction threshold. +### Memory CGroup Notifications + +When the `kubelet` is started with `--experimental-kernel-memcg-notification=true`, +it will use cgroup events on the memory.usage_in_bytes file in order to trigger the eviction manager. +With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager, +collect metrics, and respond with evictions much quicker than using the sync loop alone. + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +However, a current issue with this is that the cgroup notifications trigger based on memory.usage_in_bytes, +but the eviction manager determines memory pressure based on the working set, which is (memory.usage_in_bytes - memory.total_inactive_file). +For example: +``` +capacity = 1Gi +--eviction-hard=memory.available<250Mi +assume memory.total_inactive_file=10Mi +``` +When the cgroup event is triggered, `memory.usage_in_bytes = 750Mi`. +The eviction manager observes +`working_set = memory.usage_in_bytes - memory.total_inactive_file = 740Mi` +Signal: `memory.available = capacity - working_set = 260Mi` +Therefore, no memory pressure. This will occur as long as memory.total_inactive_file is non-zero. + +### Proposed solutions: +1. Set the cgroup event at `threshold*fraction`. +For example, if `--eviction-hard=memory.available<200Mi`, set the cgroup event at `100Mi`. +This way, when the eviction manager is triggered, it will likely observe memory pressure. +This is not guaranteed to always work, but should prevent OOMs in most cases. + +2. Use Usage instead of Working Set to determine memory pressure +This would mean that the eviction manager and cgroup notifications use the same metric, +and thus the response is ideal: the eviction manager is triggered exactly when memory pressure occurs. +However, the eviction manager may often evict unneccessarily if there are large quantities of memory +the kernel has not yet reclaimed. + +3. Increase the syncloop interval after the threshold is crossed +For example, the eviction manager could start collecting observations every second instead of every +10 seconds after the threshold is crossed. This means that even though the cgroup event and eviction +manager are not completely in-sync, the threshold can help the eviction manager to respond faster than +it otherwise would. After a short period, it would resume the standard interval of sync loop calls. + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ### Disk Let's assume the operator started the `kubelet` with the following: @@ -457,9 +499,3 @@ In general, it should be strongly recommended that `DaemonSet` not create `BestEffort` pods to avoid being identified as a candidate pod for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only. -## Known issues - -### kubelet may evict more pods than needed - -The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding -the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future. -- cgit v1.2.3 From 41212d7f4c43a03487c6d88aace260763c4297d7 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Thu, 30 Nov 2017 11:55:00 -0800 Subject: format --- contributors/design-proposals/node/kubelet-eviction.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 2250c5c6..5124b628 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -198,7 +198,6 @@ it will use cgroup events on the memory.usage_in_bytes file in order to trigger With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager, collect metrics, and respond with evictions much quicker than using the sync loop alone. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ However, a current issue with this is that the cgroup notifications trigger based on memory.usage_in_bytes, but the eviction manager determines memory pressure based on the working set, which is (memory.usage_in_bytes - memory.total_inactive_file). For example: @@ -231,8 +230,6 @@ For example, the eviction manager could start collecting observations every seco manager are not completely in-sync, the threshold can help the eviction manager to respond faster than it otherwise would. After a short period, it would resume the standard interval of sync loop calls. -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ### Disk Let's assume the operator started the `kubelet` with the following: -- cgit v1.2.3 From 7948e33ae28232ba6b46a6a9af1ffd1bfc6a3e8b Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Wed, 20 Dec 2017 14:37:56 -0800 Subject: cross out option 2 --- contributors/design-proposals/node/kubelet-eviction.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 5124b628..311236de 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -214,15 +214,15 @@ Therefore, no memory pressure. This will occur as long as memory.total_inactive ### Proposed solutions: 1. Set the cgroup event at `threshold*fraction`. -For example, if `--eviction-hard=memory.available<200Mi`, set the cgroup event at `100Mi`. +For example, if `--eviction-hard=memory.available<200Mi`, use `fraction=1/2` set the cgroup event at `100Mi`. This way, when the eviction manager is triggered, it will likely observe memory pressure. -This is not guaranteed to always work, but should prevent OOMs in most cases. +This is not guaranteed to always work, but should prevent OOMs in most cases, and is simple to implement. -2. Use Usage instead of Working Set to determine memory pressure +~~2. Use Usage instead of Working Set to determine memory pressure This would mean that the eviction manager and cgroup notifications use the same metric, and thus the response is ideal: the eviction manager is triggered exactly when memory pressure occurs. However, the eviction manager may often evict unneccessarily if there are large quantities of memory -the kernel has not yet reclaimed. +the kernel has not yet reclaimed.~~ 3. Increase the syncloop interval after the threshold is crossed For example, the eviction manager could start collecting observations every second instead of every -- cgit v1.2.3 From b5aa22691c5aacc901a022c885ef59c2f695ee62 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Wed, 20 Dec 2017 14:40:31 -0800 Subject: explain fraction --- contributors/design-proposals/node/kubelet-eviction.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 311236de..69a9cc17 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -214,7 +214,7 @@ Therefore, no memory pressure. This will occur as long as memory.total_inactive ### Proposed solutions: 1. Set the cgroup event at `threshold*fraction`. -For example, if `--eviction-hard=memory.available<200Mi`, use `fraction=1/2` set the cgroup event at `100Mi`. +For example, if `--eviction-hard=memory.available<200Mi`, use `fraction=1/2` and set the cgroup event at `100Mi` (`200Mi*1/2`). This way, when the eviction manager is triggered, it will likely observe memory pressure. This is not guaranteed to always work, but should prevent OOMs in most cases, and is simple to implement. -- cgit v1.2.3 From 031dc7414e056413d9d3f20ced3ea84590a027db Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Tue, 2 Jan 2018 16:12:30 -0800 Subject: Periodically change memcg threshold. --- contributors/design-proposals/node/kubelet-eviction.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 69a9cc17..4fff62d5 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -230,6 +230,11 @@ For example, the eviction manager could start collecting observations every seco manager are not completely in-sync, the threshold can help the eviction manager to respond faster than it otherwise would. After a short period, it would resume the standard interval of sync loop calls. +4. Periodically adjust the memory cgroup threshold based on total_inactive_file +For example, the eviction manager would set the threshold for usage_in_bytes to mem_capacity - eviction_hard + +total_inactive_file. This would mean that the threshold is crossed when usage_in_bytes - total_inactive_file += mem_capacity - eviction_hard. As long as total_inactive_file changes slowly, this would be fairly accurate. + ### Disk Let's assume the operator started the `kubelet` with the following: -- cgit v1.2.3 From f8494e93bc9cb518d1013d483249a1af11850997 Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Tue, 27 Feb 2018 13:52:53 -0800 Subject: choose solution --- .../design-proposals/node/kubelet-eviction.md | 40 +++------------------- 1 file changed, 4 insertions(+), 36 deletions(-) diff --git a/contributors/design-proposals/node/kubelet-eviction.md b/contributors/design-proposals/node/kubelet-eviction.md index 4fff62d5..6dd6adc5 100644 --- a/contributors/design-proposals/node/kubelet-eviction.md +++ b/contributors/design-proposals/node/kubelet-eviction.md @@ -198,42 +198,10 @@ it will use cgroup events on the memory.usage_in_bytes file in order to trigger With the addition of on-demand metrics, this permits the `kubelet` to trigger the eviction manager, collect metrics, and respond with evictions much quicker than using the sync loop alone. -However, a current issue with this is that the cgroup notifications trigger based on memory.usage_in_bytes, -but the eviction manager determines memory pressure based on the working set, which is (memory.usage_in_bytes - memory.total_inactive_file). -For example: -``` -capacity = 1Gi ---eviction-hard=memory.available<250Mi -assume memory.total_inactive_file=10Mi -``` -When the cgroup event is triggered, `memory.usage_in_bytes = 750Mi`. -The eviction manager observes -`working_set = memory.usage_in_bytes - memory.total_inactive_file = 740Mi` -Signal: `memory.available = capacity - working_set = 260Mi` -Therefore, no memory pressure. This will occur as long as memory.total_inactive_file is non-zero. - -### Proposed solutions: -1. Set the cgroup event at `threshold*fraction`. -For example, if `--eviction-hard=memory.available<200Mi`, use `fraction=1/2` and set the cgroup event at `100Mi` (`200Mi*1/2`). -This way, when the eviction manager is triggered, it will likely observe memory pressure. -This is not guaranteed to always work, but should prevent OOMs in most cases, and is simple to implement. - -~~2. Use Usage instead of Working Set to determine memory pressure -This would mean that the eviction manager and cgroup notifications use the same metric, -and thus the response is ideal: the eviction manager is triggered exactly when memory pressure occurs. -However, the eviction manager may often evict unneccessarily if there are large quantities of memory -the kernel has not yet reclaimed.~~ - -3. Increase the syncloop interval after the threshold is crossed -For example, the eviction manager could start collecting observations every second instead of every -10 seconds after the threshold is crossed. This means that even though the cgroup event and eviction -manager are not completely in-sync, the threshold can help the eviction manager to respond faster than -it otherwise would. After a short period, it would resume the standard interval of sync loop calls. - -4. Periodically adjust the memory cgroup threshold based on total_inactive_file -For example, the eviction manager would set the threshold for usage_in_bytes to mem_capacity - eviction_hard + -total_inactive_file. This would mean that the threshold is crossed when usage_in_bytes - total_inactive_file -= mem_capacity - eviction_hard. As long as total_inactive_file changes slowly, this would be fairly accurate. +To do this, we periodically adjust the memory cgroup threshold based on total_inactive_file. The eviction manager +periodically measures total_inactive_file, and sets the threshold for usage_in_bytes to mem_capacity - eviction_hard + +total_inactive_file. This means that the threshold is crossed when usage_in_bytes - total_inactive_file += mem_capacity - eviction_hard. ### Disk -- cgit v1.2.3