diff options
| author | Stephen Augustus <foo@agst.us> | 2018-12-01 02:40:42 -0500 |
|---|---|---|
| committer | Stephen Augustus <foo@agst.us> | 2018-12-01 02:40:42 -0500 |
| commit | 1004e56177eb12d85b6e0f6cf1ccd00431f7336b (patch) | |
| tree | e2a87f95b32e046ed32a2eea6cde661704e61fbd /keps/sig-node | |
| parent | 973b19523840d207ae206175ac2093d3b564668c (diff) | |
Add KEP tombstones
Signed-off-by: Stephen Augustus <foo@agst.us>
Diffstat (limited to 'keps/sig-node')
| -rw-r--r-- | keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md | 229 | ||||
| -rw-r--r-- | keps/sig-node/0009-node-heartbeat.md | 396 | ||||
| -rw-r--r-- | keps/sig-node/0014-runtime-class.md | 403 | ||||
| -rw-r--r-- | keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md | 811 | ||||
| -rw-r--r-- | keps/sig-node/compute-device-assignment.md | 154 |
5 files changed, 20 insertions, 1973 deletions
diff --git a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md index 4a2090a1..cfd1f5fa 100644 --- a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md +++ b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md @@ -1,225 +1,4 @@ ---- -kep-number: 8 -title: Protomote sysctl annotations to fields -authors: - - "@ingvagabund" -owning-sig: sig-node -participating-sigs: - - sig-auth -reviewers: - - "@sjenning" - - "@derekwaynecarr" -approvers: - - "@sjenning " - - "@derekwaynecarr" -editor: -creation-date: 2018-04-30 -last-updated: 2018-05-02 -status: provisional -see-also: -replaces: -superseded-by: ---- - -# Promote sysctl annotations to fields - -## Table of Contents - -* [Promote sysctl annotations to fields](#promote-sysctl-annotations-to-fields) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Promote annotations to fields](#promote-annotations-to-fields) - * [Promote --experimental-allowed-unsafe-sysctls kubelet flag to kubelet config api option](#promote---experimental-allowed-unsafe-sysctls-kubelet-flag-to-kubelet-config-api-option) - * [Gate the feature](#gate-the-feature) - * [Proposal](#proposal) - * [User Stories](#user-stories) - * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - -## Summary - -Setting the `sysctl` parameters through annotations provided a successful story -for defining better constraints of running applications. -The `sysctl` feature has been tested by a number of people without any serious -complaints. Promoting the annotations to fields (i.e. to beta) is another step in making the -`sysctl` feature closer towards the stable API. - -Currently, the `sysctl` provides `security.alpha.kubernetes.io/sysctls` and `security.alpha.kubernetes.io/unsafe-sysctls` annotations that can be used -in the following way: - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: sysctl-example - annotations: - security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1 - security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3 - spec: - ... - ``` - - The goal is to transition into native fields on pods: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: sysctl-example - spec: - securityContext: - sysctls: - - name: kernel.shm_rmid_forced - value: 1 - - name: net.ipv4.route.min_pmtu - value: 1000 - unsafe: true - - name: kernel.msgmax - value: "1 2 3" - unsafe: true - ... - ``` - -The `sysctl` design document with more details and rationals is available at [design-proposals/node/sysctl.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#pod-api-changes) - -## Motivation - -As mentioned in [contributors/devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version): - -> Previously, annotations were used for experimental alpha features, but are no longer recommended for several reasons: -> -> They expose the cluster to "time-bomb" data added as unstructured annotations against an earlier API server (https://issue.k8s.io/30819) -> They cannot be migrated to first-class fields in the same API version (see the issues with representing a single value in multiple places in backward compatibility gotchas) -> -> The preferred approach adds an alpha field to the existing object, and ensures it is disabled by default: -> -> ... - -The annotations as a means to set `sysctl` are no longer necessary. -The original intent of annotations was to provide additional description of Kubernetes -objects through metadata. -It's time to separate the ability to annotate from the ability to change sysctls settings -so a cluster operator can elevate the distinction between experimental and supported usage -of the feature. - -### Promote annotations to fields - -* Introduce native `sysctl` fields in pods through `spec.securityContext.sysctl` field as: - - ```yaml - sysctl: - - name: SYSCTL_PATH_NAME - value: SYSCTL_PATH_VALUE - unsafe: true # optional field - ``` - -* Introduce native `sysctl` fields in [PSP](https://kubernetes.io/docs/concepts/policy/pod-security-policy/) as: - - ```yaml - apiVersion: v1 - kind: PodSecurityPolicy - metadata: - name: psp-example - spec: - sysctls: - - kernel.shmmax - - kernel.shmall - - net.* - ``` - - More examples at [design-proposals/node/sysctl.md#allowing-only-certain-sysctls](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#allowing-only-certain-sysctls) - -### Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option - -As there is no longer a need to consider the `sysctl` feature experimental, -the list of unsafe sysctls can be configured accordingly through: - -```go -// KubeletConfiguration contains the configuration for the Kubelet -type KubeletConfiguration struct { - ... - // Whitelist of unsafe sysctls or unsafe sysctl patterns (ending in *). - // Default: nil - // +optional - AllowedUnsafeSysctls []string `json:"allowedUnsafeSysctls,omitempty"` -} -``` - -Upstream issue: https://github.com/kubernetes/kubernetes/issues/61669 - -### Gate the feature - -As the `sysctl` feature stabilizes, it's time to gate the feature [1] and enable it by default. - -* Expected feature gate key: `Sysctls` -* Expected default value: `true` - -With the `Sysctl` feature enabled, both sysctl fields in `Pod` and `PodSecurityPolicy` -and the whitelist of unsafed sysctls are acknowledged. -If disabled, the fields and the whitelist are just ignored. - -[1] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ - -## Proposal - -This is where we get down to the nitty gritty of what the proposal actually is. - -### User Stories - -* As a cluster admin, I want to have `sysctl` feature versioned so I can assure backward compatibility - and proper transformation between versioned to internal representation and back.. -* As a cluster admin, I want to be confident the `sysctl` feature is stable enough and well supported so - applications are properly isolated -* As a cluster admin, I want to be able to apply the `sysctl` constraints on the cluster level so - I can define the default constraints for all pods. - -### Implementation Details/Notes/Constraints - -Extending `SecurityContext` struct with `Sysctls` field: - -```go -// PodSecurityContext holds pod-level security attributes and common container settings. -// Some fields are also present in container.securityContext. Field values of -// container.securityContext take precedence over field values of PodSecurityContext. -type PodSecurityContext struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. - Sysctls []Sysctl `json:"sysctls,omitempty"` -} -``` - -Extending `PodSecurityPolicySpec` struct with `Sysctls` field: - -```go -// PodSecurityPolicySpec defines the policy enforced on sysctls. -type PodSecurityPolicySpec struct { - ... - // Sysctls is a white list of allowed sysctls in a pod spec. - Sysctls []Sysctl `json:"sysctls,omitempty"` -} -``` - -Following steps in [devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version) -during implementation. - -Validation checks implemented as part of [#27180](https://github.com/kubernetes/kubernetes/pull/27180). - -### Risks and Mitigations - -We need to assure backward compatibility, i.e. object specifications with `sysctl` annotations -must still work after the graduation. - -## Graduation Criteria - -* API changes allowing to configure the pod-scoped `sysctl` via `spec.securityContext` field. -* API changes allowing to configure the cluster-scoped `sysctl` via `PodSecurityPolicy` object -* Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option -* feature gate enabled by default -* e2e tests - -## Implementation History - -The `sysctl` feature is tracked as part of [features#34](https://github.com/kubernetes/features/issues/34). -This is one of the goals to promote the annotations to fields. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0009-node-heartbeat.md b/keps/sig-node/0009-node-heartbeat.md index f80b9609..cfd1f5fa 100644 --- a/keps/sig-node/0009-node-heartbeat.md +++ b/keps/sig-node/0009-node-heartbeat.md @@ -1,392 +1,4 @@ ---- -kep-number: 8 -title: Efficient Node Heartbeat -authors: - - "@wojtek-t" - - "with input from @bgrant0607, @dchen1107, @yujuhong, @lavalamp" -owning-sig: sig-node -participating-sigs: - - sig-scalability - - sig-apimachinery - - sig-scheduling -reviewers: - - "@deads2k" - - "@lavalamp" -approvers: - - "@dchen1107" - - "@derekwaynecarr" -editor: TBD -creation-date: 2018-04-27 -last-updated: 2018-04-27 -status: implementable -see-also: - - https://github.com/kubernetes/kubernetes/issues/14733 - - https://github.com/kubernetes/kubernetes/pull/14735 -replaces: - - n/a -superseded-by: - - n/a ---- - -# Efficient Node Heartbeats - -## Table of Contents - -Table of Contents -================= - -* [Efficient Node Heartbeats](#efficient-node-heartbeats) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Proposal](#proposal) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Alternatives](#alternatives) - * [Dedicated “heartbeat” object instead of “leader election” one](#dedicated-heartbeat-object-instead-of-leader-election-one) - * [Events instead of dedicated heartbeat object](#events-instead-of-dedicated-heartbeat-object) - * [Reuse the Component Registration mechanisms](#reuse-the-component-registration-mechanisms) - * [Split Node object into two parts at etcd level](#split-node-object-into-two-parts-at-etcd-level) - * [Delta compression in etcd](#delta-compression-in-etcd) - * [Replace etcd with other database](#replace-etcd-with-other-database) - -## Summary - -Node heartbeats are necessary for correct functioning of Kubernetes cluster. -This proposal makes them significantly cheaper from both scalability and -performance perspective. - -## Motivation - -While running different scalability tests we observed that in big enough clusters -(more than 2000 nodes) with non-trivial number of images used by pods on all -nodes (10-15), we were hitting etcd limits for its database size. That effectively -means that etcd enters "alert mode" and stops accepting all write requests. - -The underlying root cause is combination of: - -- etcd keeping both current state and transaction log with copy-on-write -- node heartbeats being pontetially very large objects (note that images - are only one potential problem, the second are volumes and customers - want to mount 100+ volumes to a single node) - they may easily exceed 15kB; - even though the patch send over network is small, in etcd we store the - whole Node object -- Kubelet sending heartbeats every 10s - -This proposal presents a proper solution for that problem. - - -Note that currently (by default): - -- Lack of NodeStatus update for `<node-monitor-grace-period>` (default: 40s) - results in NodeController marking node as NotReady (pods are no longer - scheduled on that node) -- Lack of NodeStatus updates for `<pod-eviction-timeout>` (default: 5m) - results in NodeController starting pod evictions from that node - -We would like to preserve that behavior. - - -### Goals - -- Reduce size of etcd by making node heartbeats cheaper - -### Non-Goals - -The following are nice-to-haves, but not primary goals: - -- Reduce resource usage (cpu/memory) of control plane (e.g. due to processing - less and/or smaller objects) -- Reduce watch-related load on Node objects - -## Proposal - -We propose introducing a new `Lease` built-in API in the newly create API group -`coordination.k8s.io`. To make it easily reusable for other purposes it will -be namespaced. Its schema will be as following: - -``` -type Lease struct { - metav1.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata - // +optional - ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"` - - // Specification of the Lease. - // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status - // +optional - Spec LeaseSpec `json:"spec,omitempty"` -} - -type LeaseSpec struct { - HolderIdentity string `json:"holderIdentity"` - LeaseDurationSeconds int32 `json:"leaseDurationSeconds"` - AcquireTime metav1.MicroTime `json:"acquireTime"` - RenewTime metav1.MicroTime `json:"renewTime"` - LeaseTransitions int32 `json:"leaseTransitions"` -} -``` - -The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][]. -The only difference is using `MicroTime` instead of `Time` for better precision. -That would hopefully allow us go get directly to Beta. - -We will use that object to represent node heartbeat - for each Node there will -be a corresponding `Lease` object with Name equal to Node name in a newly -created dedicated namespace (we considered using `kube-system` namespace but -decided that it's already too overloaded). -That namespace should be created automatically (similarly to "default" and -"kube-system", probably by NodeController) and never be deleted (so that nodes -don't require permission for it). - -We considered using CRD instead of built-in API. However, even though CRDs are -`the new way` for creating new APIs, they don't yet have versioning support -and are significantly less performant (due to lack of protobuf support yet). -We also don't know whether we could seamlessly transition storage from a CRD -to a built-in API if we ran into a performance or any other problems. -As a result, we decided to proceed with built-in API. - - -With this new API in place, we will change Kubelet so that: - -1. Kubelet is periodically computing NodeStatus every 10s (at it is now), but that will - be independent from reporting status -1. Kubelet is reporting NodeStatus if: - - there was a meaningful change in it (initially we can probably assume that every - change is meaningful, including e.g. images on the node) - - or it didn’t report it over last `node-status-update-period` seconds -1. Kubelet creates and periodically updates its own Lease object and frequency - of those updates is independent from NodeStatus update frequency. - -In the meantime, we will change `NodeController` to treat both updates of NodeStatus -object as well as updates of the new `Lease` object corresponding to a given -node as healthiness signal from a given Kubelet. This will make it work for both old -and new Kubelets. - -We should also: - -1. audit all other existing core controllers to verify if they also don’t require - similar changes in their logic ([ttl controller][] being one of the examples) -1. change controller manager to auto-register that `Lease` CRD -1. ensure that `Lease` resource is deleted when corresponding node is - deleted (probably via owner references) -1. [out-of-scope] migrate all LeaderElection code to use that CRD - -Once all the code changes are done, we will: - -1. start updating `Lease` object every 10s by default, at the same time - reducing frequency of NodeStatus updates initially to 40s by default. - We will reduce it further later. - Note that it doesn't reduce frequency by which Kubelet sends "meaningful" - changes - it only impacts the frequency of "lastHeartbeatTime" changes. - <br> TODO: That still results in higher average QPS. It should be acceptable but - needs to be verified. -1. announce that we are going to reduce frequency of NodeStatus updates further - and give people 1-2 releases to switch their code to use `Lease` - object (if they relied on frequent NodeStatus changes) -1. further reduce NodeStatus updates frequency to not less often than once per - 1 minute. - We can’t stop periodically updating NodeStatus as it would be API breaking change, - but it’s fine to reduce its frequency (though we should continue writing it at - least once per eviction period). - - -To be considered: - -1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes - (instead of 1 minute). That would help with performance/scalability even more. - Caveats: - - NodeProblemDetector is currently updating (some) node conditions every 1 minute - (unconditionally, because lastHeartbeatTime always changes). To make reduction - of NodeStatus updates frequency really useful, we should also change NPD to - work in a similar mode (check periodically if condition changes, but report only - when something changed or no status was reported for a given time) and decrease - its reporting frequency too. - - In general, we recommend to keep frequencies of NodeStatus reporting in both - Kubelet and NodeProblemDetector in sync (once all changes will be done) and - that should be reflected in [NPD documentation][]. - - Note that reducing frequency to 1 minute already gives us almost 6x improvement. - It seems more than enough for any foreseeable future assuming we won’t - significantly increase the size of object Node. - Note that if we keep adding node conditions owned by other components, the - number of writes of Node object will go up. But that issue is separate from - that proposal. - -Other notes: - -1. Additional advantage of using Lease for that purpose would be the - ability to exclude it from audit profile and thus reduce the audit logs footprint. - -[LeaderElectionRecord]: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L37 -[ttl controller]: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/ttl/ttl_controller.go#L155 -[NPD documentation]: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/ -[kubernetes/kubernetes#63667]: https://github.com/kubernetes/kubernetes/issues/63677 - -### Risks and Mitigations - -Increasing default frequency of NodeStatus updates may potentially break clients -relying on frequent Node object updates. However, in non-managed solutions, customers -will still be able to restore previous behavior by setting appropriate flag values. -Thus, changing defaults to what we recommend is the path to go with. - -## Graduation Criteria - -The API can be immediately promoted to Beta, as the API is effectively a copy of -already existing LeaderElectionRecord. It will be promoted to GA once it's gone -a sufficient amount of time as Beta with no changes. - -The changes in components logic (Kubelet, NodeController) should be done behind -a feature gate. We suggest making that enabled by default once the feature is -implemented. - -## Implementation History - -- RRRR-MM-DD: KEP Summary, Motivation and Proposal merged - -## Alternatives - -We considered a number of alternatives, most important mentioned below. - -### Dedicated “heartbeat” object instead of “leader election” one - -Instead of introducing and using “lease” object, we considered -introducing a dedicated “heartbeat” object for that purpose. Apart from that, -all the details about the solution remain pretty much the same. - -Pros: - -- Conceptually easier to understand what the object is for - -Cons: - -- Introduces a new, narrow-purpose API. Lease is already used by other - components, implemented using annotations on Endpoints and ConfigMaps. - -### Events instead of dedicated heartbeat object - -Instead of introducing a dedicated object, we considered using “Event” object -for that purpose. At the high-level the solution looks very similar. -The differences from the initial proposal are: - -- we use existing “Event” api instead of introducing a new API -- we create a dedicated namespace; events that should be treated as healthiness - signal by NodeController will be written by Kubelets (unconditionally) to that - namespace -- NodeController will be watching only Events from that namespace to avoid - processing all events in the system (the volume of all events will be huge) -- dedicated namespace also helps with security - we can give access to write to - that namespace only to Kubelets - -Pros: - -- No need to introduce new API - - We can use that approach much earlier due to that. -- We already need to optimize event throughput - separate etcd instance we have - for them may help with tuning -- Low-risk roll-forward/roll-back: no new objects is involved (node controller - starts watching events, kubelet just reduces the frequency of heartbeats) - -Cons: - -- Events are conceptually “best-effort” in the system: - - they may be silently dropped in case of problems in the system (the event recorder - library doesn’t retry on errors, e.g. to not make things worse when control-plane - is starved) - - currently, components reporting events don’t even know if it succeeded or not (the - library is built in a way that you throw the event into it and are not notified if - that was successfully submitted or not). - Kubelet sending any other update has full control on how/if retry errors. - - lack of fairness mechanisms means that even when some events are being successfully - send, there is no guarantee that any event from a given Kubelet will be submitted - over a given time period - So this would require a different mechanism of reporting those “heartbeat” events. -- Once we have “request priority” concept, I think events should have the lowest one. - Even though no particular heartbeat is important, guarantee that some heartbeats will - be successfully send it crucial (not delivering any of them will result in unnecessary - evictions or not-scheduling to a given node). So heartbeats should be of the highest - priority. OTOH, node heartbeats are one of the most important things in the system - (not delivering them may result in unnecessary evictions), so they should have the - highest priority. -- No core component in the system is currently watching events - - it would make system’s operation harder to explain -- Users watch Node objects for heartbeats (even though we didn’t recommend it). - Introducing a new object for the purpose of heartbeat will allow those users to - migrate, while using events for that purpose breaks that ability. (Watching events - may put us in tough situation also from performance reasons.) -- Deleting all events (e.g. event etcd failure + playbook response) should continue to - not cause a catastrophic failure and the design will need to account for this. - -### Reuse the Component Registration mechanisms - -Kubelet is one of control-place components (shared controller). Some time ago, Component -Registration proposal converged into three parts: - -- Introducing an API for registering non-pod endpoints, including readiness information: #18610 -- Changing endpoints controller to also watch those endpoints -- Identifying some of those endpoints as “components” - -We could reuse that mechanism to represent Kubelets as non-pod endpoint API. - -Pros: - -- Utilizes desired API - -Cons: - -- Requires introducing that new API -- Stabilizing the API would take some time -- Implementing that API requires multiple changes in different components - -### Split Node object into two parts at etcd level - -We may stick to existing Node API and solve the problem at storage layer. At the -high level, this means splitting the Node object into two parts in etcd (frequently -modified one and the rest). - -Pros: - -- No need to introduce new API -- No need to change any components other than kube-apiserver - -Cons: - -- Very complicated to support watch -- Not very generic (e.g. splitting Spec and Status doesn’t help, it needs to be just - heartbeat part) -- [minor] Doesn’t reduce amount of data that should be processed in the system (writes, - reads, watches, …) - -### Delta compression in etcd - -An alternative for the above can be solving this completely at the etcd layer. To -achieve that, instead of storing full updates in etcd transaction log, we will just -store “deltas” and snapshot the whole object only every X seconds/minutes. - -Pros: - -- Doesn’t require any changes to any Kubernetes components - -Cons: - -- Computing delta is tricky (etcd doesn’t understand Kubernetes data model, and - delta between two protobuf-encoded objects is not necessary small) -- May require a major rewrite of etcd code and not even be accepted by its maintainers -- More expensive computationally to get an object in a given resource version (which - is what e.g. watch is doing) - -### Replace etcd with other database - -Instead of using etcd, we may also consider using some other open-source solution. - -Pros: - -- Doesn’t require new API - -Cons: - -- We don’t even know if there exists solution that solves our problems and can be used. -- Migration will take us years. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0014-runtime-class.md b/keps/sig-node/0014-runtime-class.md index 1d1cac28..cfd1f5fa 100644 --- a/keps/sig-node/0014-runtime-class.md +++ b/keps/sig-node/0014-runtime-class.md @@ -1,399 +1,4 @@ ---- -kep-number: 14 -title: Runtime Class -authors: - - "@tallclair" -owning-sig: sig-node -participating-sigs: - - sig-architecture -reviewers: - - dchen1107 - - derekwaynecarr - - yujuhong -approvers: - - dchen1107 - - derekwaynecarr -creation-date: 2018-06-19 -status: implementable ---- - -# Runtime Class - -## Table of Contents - -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non\-Goals](#non-goals) - * [User Stories](#user-stories) -* [Proposal](#proposal) - * [API](#api) - * [Runtime Handler](#runtime-handler) - * [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts) - * [Implementation Details](#implementation-details) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Appendix](#appendix) - * [Examples of runtime variation](#examples-of-runtime-variation) - -## Summary - -`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the -control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the -`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node. - -## Motivation - -There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the -primary motivator for this right now, with both Kata containers and gVisor looking to integrate with -Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also -require support in the future. RuntimeClass provides a way to select between different runtimes -configured in the cluster and surface their properties (both to the cluster & the user). - -In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to -the control plane level, including: accounting for runtime overhead, scheduling to nodes that -support the runtime, and surfacing which optional features are supported by different -runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a -cluster-scoped resource tied to the runtime that can help solve these problems in a future update. - -[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit - -### Goals - -- Provide a mechanism for surfacing container runtime properties to the control plane -- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired - runtime - -### Non-Goals - -- RuntimeClass is NOT RuntimeComponentConfig. -- RuntimeClass is NOT a general policy mechanism. -- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general - RuntimeClass should not be a cross product of runtime properties and node properties. - -The following goals are out-of-scope for the initial implementation, but may be explored in a future -iteration: - -- Surfacing support for optional features by runtimes, and surfacing errors caused by - incompatible features & runtimes earlier. -- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the - cluster admin or provider), and are asserted to be an accurate representation of the runtime. -- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster - (different runtime configurations on different nodes) through scheduling primitives like - `NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and - automatic runtime-aware scheduling is out-of-scope. -- Define standardized or conformant runtime classes - although I would like to declare some - predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP. -- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice, - the details of how pod resource overhead will be implemented is out of scope for this KEP. -- Provide a mechanism to dynamically register or provision additional runtimes. -- Requiring specific RuntimeClasses according to policy. This should be addressed by other - cluster-level policy mechanisms, such as PodSecurityPolicy. -- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and - letting the system match an appropriate RuntimeClass, rather than explicitly assigning a - RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a - future iteration. - -[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit - -### User Stories - -- As a cluster operator, I want to provide multiple runtime options to support a wide variety of - workloads. Examples include native linux containers, "sandboxed" containers, and windows - containers. -- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For - example, rolling out an update with backwards incompatible changes or previously unsupported - features. -- As an application developer, I want to select the runtime that best fits my workload. -- As an application developer, I don't want to study the nitty-gritty details of different runtime - implementations, but rather choose from pre-configured classes. -- As an application developer, I want my application to be portable across clusters that use similar - but different variants of a "class" of runtimes. - -## Proposal - -The initial design includes: - -- `RuntimeClass` API resource definition -- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with -- Kubelet implementation for fetching & interpreting the RuntimeClass -- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler). - -### API - -`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group. - -> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired. -> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_ - -_(This is a simplified declaration, syntactic details will be covered in the API PR review)_ - -```go -type RuntimeClass struct { - metav1.TypeMeta - // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class. - // Namespace should be left blank. - metav1.ObjectMeta - - Spec RuntimeClassSpec -} - -type RuntimeClassSpec struct { - // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container - // creation. The possible values are specific to a given configuration & CRI implementation. - // The empty string is equivalent to the default behavior. - // +optional - RuntimeHandler string -} -``` - -The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is -scheduled, the RuntimeClass cannot be changed. - -```go -type PodSpec struct { - ... - // RuntimeClassName refers to a RuntimeClass object with the same name, - // which should be used to run this pod. - // +optional - RuntimeClassName string - ... -} -``` - -The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards -compatible with current Kubernetes. This means that the legacy runtime does not specify any -RuntimeHandler or perform any feature validation (all features are "supported"). - -```go -const ( - // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy - // RuntimeClass does not specify a runtime handler or perform any - // feature validation. - RuntimeClassNameLegacy = "legacy" -) -``` - -An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is -not defaulted to `legacy` (to leave room for configurable defaults in a future update). - -#### Examples - -Suppose we operate a cluster that lets users choose between native runc containers, and gvisor and -kata-container sandboxes. We might create the following runtime classes: - -```yaml -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: native # equivalent to 'legacy' for now -spec: - runtimeHandler: runc ---- -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: gvisor -spec: - runtimeHandler: gvisor ----- -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: kata-containers -spec: - runtimeHandler: kata-containers ----- -# provides the default sandbox runtime when users don't care about which they're getting. -kind: RuntimeClass -apiVersion: node.k8s.io/v1alpha1 -metadata: - name: sandboxed -spec: - runtimeHandler: gvisor -``` - -Then when a user creates a workload, they can choose the desired runtime class to use (or not, if -they want the default). - -```yaml -apiVersion: extensions/v1beta1 -kind: Deployment -metadata: - name: sandboxed-nginx -spec: - replicas: 2 - selector: - matchLabels: - app: sandboxed-nginx - template: - metadata: - labels: - app: sandboxed-nginx - spec: - runtimeClassName: sandboxed # <---- Reference the desired RuntimeClass - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - protocol: TCP -``` - -#### Runtime Handler - -The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`: - -```proto -message RunPodSandboxRequest { - // Configuration for creating a PodSandbox. - PodSandboxConfig config = 1; - // Named runtime configuration to use for this PodSandbox. - string RuntimeHandler = 2; -} -``` - -The RuntimeHandler is provided as a mechanism for CRI implementations to select between different -predetermined configurations. The initial use case is replacing the experimental pod annotations -currently used for selecting a sandboxed runtime by various CRI implementations: - -| CRI Runtime | Pod Annotation | -| ------------|-------------------------------------------------------------| -| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" | -| containerd | io.kubernetes.cri.untrusted-workload: "true" | -| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" | -| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" | - -These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred -approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be -matched against the specified RuntimeHandler. For example, containerd might have a configuration -corresponding to a "kata-runtime" handler: - -``` -[plugins.cri.containerd.kata-runtime] - runtime_type = "io.containerd.runtime.v1.linux" - runtime_engine = "/opt/kata/bin/kata-runtime" - runtime_root = "" -``` - -This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection -(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox -types (e.g. `kata-containers` or `gvisor` RuntimeClasses). - -### Versioning, Updates, and Rollouts - -Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha -implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**, -thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods -must be updated to reference the new RuntimeClass, and comes with the advantage of native support -for rolling updates through the same mechanisms as any other application update. The -`RuntimeClassName` pod field is also immutable post scheduling. - -This conservative approach is preferred since it's much easier to relax constraints in a backwards -compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass -to beta. - -### Implementation Details - -The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is -added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once -resolved, the RuntimeHandler field is passed to the CRI as part of the -[`RunPodSandboxRequest`][runpodsandbox]. At that point, the interpretation of the RuntimeHandler is -left to the CRI implementation, but it should be cached if needed for subsequent calls. - -If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will -be rejected in admission (controller to be detailed in a following update). If the RuntimeClass -cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail -the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If -the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return -an error. - -[runpodsandbox]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344 - -### Risks and Mitigations - -**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default -dumping ground for every new feature exposed by the node. For each feature, careful consideration -should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The -[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features. - -**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for -PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying -runtime implementation should be extremely limited (generally only around updates & rollouts). To -enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to -restrict a user to a specific RuntimeClass, you must use another policy mechanism such as -PodSecurityPolicy. - -**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity -of runtime configuration from most users (aside from the cluster admin or provisioner). However, we -are still side-stepping the issue of precisely defining specific types of runtimes like -"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories -is even possible. RuntimeClass allows us to decouple this specification from the implementation, but -it is still something I hope we can address in a future iteration through the concept of pre-defined -or "conformant" RuntimeClasses. - -**Non-portability.** We are already in a world of non-portability for many features (see [examples -of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help -address this issue by formally declaring supported features, or matching the runtime that supports a -given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name, -which may not be defined in every cluster. This is something that can be addressed through -pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible -RuntimeClasses. - -## Graduation Criteria - -Alpha: - -- Everything described in the current proposal: - - Introduce the RuntimeClass API resource - - Add a RuntimeClassName field to the PodSpec - - Add a RuntimeHandler field to the CRI `RunPodSandboxRequest` - - Lookup the RuntimeClass for pods & plumb through the RuntimeHandler in the Kubelet (feature - gated) -- RuntimeClass support in at least one CRI runtime & dockershim - - Runtime Handlers can be statically configured by the runtime, and referenced via RuntimeClass - - An error is reported when the handler or is unknown or unsupported -- Testing - - [CRI validation tests][cri-validation] - - Kubernetes E2E tests (only validating single runtime handler cases) - -[cri-validation]: https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/validation.md - -Beta: - -- Most runtimes support RuntimeClass, and the current [untrusted annotations](#runtime-handler) are - deprecated. -- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass -- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary. -- The cluster admin can choose which RuntimeClass is the default in a cluster. -- Additional requirements TBD - -## Implementation History - -- 2018-06-11: SIG-Node decision to move forward with proposal -- 2018-06-19: Initial KEP published. - -## Appendix - -### Examples of runtime variation - -- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods, - but those are mutually exclusive, and support of either is not required by the runtime. The - default configuration is also not well defined. -- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is - defined by the runtime, and support is not guaranteed. -- Windows containers - isolation features are very OS-specific, and most of the current features are - limited to linux. As we build out Windows container support, we'll need to add windows-specific - features as well. -- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes - (e.g. Kata-containers & gVisor). -- Per-pod and Per-container resource overhead varies by runtime. -- Device support (e.g. GPUs) varies wildly by runtime & nodes. -- Supported volume types varies by node - it remains TBD whether this information belongs in - RuntimeClass. -- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may - have differing defaults, or support a subset of capabilities. -- `Privileged` mode is not well defined, and thus may have differing implementations. -- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed - workloads) +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md index a6c5aaba..cfd1f5fa 100644 --- a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md @@ -1,807 +1,4 @@ ---- -kep-number: 0 -title: Quotas for Ephemeral Storage -authors: - - "@RobertKrawitz" -owning-sig: sig-xxx -participating-sigs: - - sig-node -reviewers: - - TBD -approvers: - - "@dchen1107" - - "@derekwaynecarr" -editor: TBD -creation-date: yyyy-mm-dd -last-updated: yyyy-mm-dd -status: provisional -see-also: -replaces: -superseded-by: ---- - -# Quotas for Ephemeral Storage - -## Table of Contents -<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again --> -**Table of Contents** - -- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Project Quotas](#project-quotas) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Future Work](#future-work) - - [Proposal](#proposal) - - [Control over Use of Quotas](#control-over-use-of-quotas) - - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) - - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) - - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) - - [Operation Notes](#operation-notes) - - [Selecting a Project ID](#selecting-a-project-id) - - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) - - [Return a Project ID To the System](#return-a-project-id-to-the-system) - - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - - [Notes on Implementation](#notes-on-implementation) - - [Notes on Code Changes](#notes-on-code-changes) - - [Testing Strategy](#testing-strategy) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks [optional]](#drawbacks-optional) - - [Alternatives [optional]](#alternatives-optional) - - [Alternative quota-based implementation](#alternative-quota-based-implementation) - - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) - - [Infrastructure Needed [optional]](#infrastructure-needed-optional) - - [References](#references) - - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) - - [CVE](#cve) - - [Other Security Issues Without CVE](#other-security-issues-without-cve) - - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) - -<!-- markdown-toc end --> - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -This proposal applies to the use of quotas for ephemeral-storage -metrics gathering. Use of quotas for ephemeral-storage limit -enforcement is a [non-goal](#non-goals), but as the architecture and -code will be very similar, there are comments interspersed related to -enforcement. _These comments will be italicized_. - -Local storage capacity isolation, aka ephemeral-storage, was -introduced into Kubernetes via -<https://github.com/kubernetes/features/issues/361>. It provides -support for capacity isolation of shared storage between pods, such -that a pod can be limited in its consumption of shared resources and -can be evicted if its consumption of shared storage exceeds that -limit. The limits and requests for shared ephemeral-storage are -similar to those for memory and CPU consumption. - -The current mechanism relies on periodically walking each ephemeral -volume (emptydir, logdir, or container writable layer) and summing the -space consumption. This method is slow, can be fooled, and has high -latency (i. e. a pod could consume a lot of storage prior to the -kubelet being aware of its overage and terminating it). - -The mechanism proposed here utilizes filesystem project quotas to -provide monitoring of resource consumption _and optionally enforcement -of limits._ Project quotas, initially in XFS and more recently ported -to ext4fs, offer a kernel-based means of monitoring _and restricting_ -filesystem consumption that can be applied to one or more directories. - -A prototype is in progress; see <https://github.com/kubernetes/kubernetes/pull/66928>. - -### Project Quotas - -Project quotas are a form of filesystem quota that apply to arbitrary -groups of files, as opposed to file user or group ownership. They -were first implemented in XFS, as described here: -<http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html>. - -Project quotas for ext4fs were [proposed in late -2014](https://lwn.net/Articles/623835/) and added to the Linux kernel -in early 2016, with -commit -[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e). -They were designed to be compatible with XFS project quotas. - -Each inode contains a 32-bit project ID, to which optionally quotas -(hard and soft limits for blocks and inodes) may be applied. The -total blocks and inodes for all files with the given project ID are -maintained by the kernel. Project quotas can be managed from -userspace by means of the `xfs_quota(8)` command in foreign filesystem -(`-f`) mode; the traditional Linux quota tools do not manipulate -project quotas. Programmatically, they are managed by the `quotactl(2)` -system call, using in part the standard quota commands and in part the -XFS quota commands; the man page implies incorrectly that the XFS -quota commands apply only to XFS filesystems. - -The project ID applied to a directory is inherited by files created -under it. Files cannot be (hard) linked across directories with -different project IDs. A file's project ID cannot be changed by a -non-privileged user, but a privileged user may use the `xfs_io(8)` -command to change the project ID of a file. - -Filesystems using project quotas may be mounted with quotas either -enforced or not; the non-enforcing mode tracks usage without enforcing -it. A non-enforcing project quota may be implemented on a filesystem -mounted with enforcing quotas by setting a quota too large to be hit. -The maximum size that can be set varies with the filesystem; on a -64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for -ext4fs. - -Conventionally, project quota mappings are stored in `/etc/projects` and -`/etc/projid`; these files exist for user convenience and do not have -any direct importance to the kernel. `/etc/projects` contains a mapping -from project ID to directory/file; this can be a one to many mapping -(the same project ID can apply to multiple directories or files, but -any given directory/file can be assigned only one project ID). -`/etc/projid` contains a mapping from named projects to project IDs. - -This proposal utilizes hard project quotas for both monitoring _and -enforcement_. Soft quotas are of no utility; they allow for temporary -overage that, after a programmable period of time, is converted to the -hard quota limit. - - -## Motivation - -The mechanism presently used to monitor storage consumption involves -use of `du` and `find` to periodically gather information about -storage and inode consumption of volumes. This mechanism suffers from -a number of drawbacks: - -* It is slow. If a volume contains a large number of files, walking - the directory can take a significant amount of time. There has been - at least one known report of nodes becoming not ready due to volume - metrics: <https://github.com/kubernetes/kubernetes/issues/62917> -* It is possible to conceal a file from the walker by creating it and - removing it while holding an open file descriptor on it. POSIX - behavior is to not remove the file until the last open file - descriptor pointing to it is removed. This has legitimate uses; it - ensures that a temporary file is deleted when the processes using it - exit, and it minimizes the attack surface by not having a file that - can be found by an attacker. The following pod does this; it will - never be caught by the present mechanism: - -```yaml -apiVersion: v1 -kind: Pod -max: -metadata: - name: "diskhog" -spec: - containers: - - name: "perl" - resources: - limits: - ephemeral-storage: "2048Ki" - image: "perl" - command: - - perl - - -e - - > - my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999 - volumeMounts: - - name: a - mountPath: /data/a - volumes: - - name: a - emptyDir: {} -``` -* It is reactive rather than proactive. It does not prevent a pod - from overshooting its limit; at best it catches it after the fact. - On a fast storage medium, such as NVMe, a pod may write 50 GB or - more of data before the housekeeping performed once per minute - catches up to it. If the primary volume is the root partition, this - will completely fill the partition, possibly causing serious - problems elsewhere on the system. This proposal does not address - this issue; _a future enforcing project would_. - -In many environments, these issues may not matter, but shared -multi-tenant environments need these issues addressed. - -### Goals - -These goals apply only to local ephemeral storage, as described in -<https://github.com/kubernetes/features/issues/361>. - -* Primary: improve performance of monitoring by using project quotas - in a non-enforcing way to collect information about storage - utilization of ephemeral volumes. -* Primary: detect storage used by pods that is concealed by deleted - files being held open. -* Primary: this will not interfere with the more common user and group - quotas. - -### Non-Goals - -* Application to storage other than local ephemeral storage. -* Application to container copy on write layers. That will be managed - by the container runtime. For a future project, we should work with - the runtimes to use quotas for their monitoring. -* Elimination of eviction as a means of enforcing ephemeral-storage - limits. Pods that hit their ephemeral-storage limit will still be - evicted by the kubelet even if their storage has been capped by - enforcing quotas. -* Enforcing node allocatable (limit over the sum of all pod's disk - usage, including e. g. images). -* Enforcing limits on total pod storage consumption by any means, such - that the pod would be hard restricted to the desired storage limit. - -### Future Work - -* _Enforce limits on per-volume storage consumption by using - enforced project quotas._ - -## Proposal - -This proposal applies project quotas to emptydir volumes on qualifying -filesystems (ext4fs and xfs with project quotas enabled). Project -quotas are applied by selecting an unused project ID (a 32-bit -unsigned integer), setting a limit on space and/or inode consumption, -and attaching the ID to one or more files. By default (and as -utilized herein), if a project ID is attached to a directory, it is -inherited by any files created under that directory. - -_If we elect to use the quota as enforcing, we impose a quota -consistent with the desired limit._ If we elect to use it as -non-enforcing, we impose a large quota that in practice cannot be -exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). - -### Control over Use of Quotas - -At present, two feature gates control operation of quotas: - -* `LocalStorageCapacityIsolation` must be enabled for any use of - quotas. - -* `LocalStorageCapacityIsolationFSMonitoring` must be enabled in addition. If this is - enabled, quotas are used for monitoring, but not enforcement. At - present, this defaults to False, but the intention is that this will - default to True by initial release. - -* _`LocalStorageCapacityIsolationFSEnforcement` must be enabled, in addition to - `LocalStorageCapacityIsolationFSMonitoring`, to use quotas for enforcement._ - -### Operation Flow -- Applying a Quota - -* Caller (emptydir volume manager or container runtime) creates an - emptydir volume, with an empty directory at a location of its - choice. -* Caller requests that a quota be applied to a directory. -* Determine whether a quota can be imposed on the directory, by asking - each quota provider (one per filesystem type) whether it can apply a - quota to the directory. If no provider claims the directory, an - error status is returned to the caller. -* Select an unused project ID ([see below](#selecting-a-project-id)). -* Set the desired limit on the project ID, in a filesystem-dependent - manner ([see below](#notes-on-implementation)). -* Apply the project ID to the directory in question, in a - filesystem-dependent manner. - -An error at any point results in no quota being applied and no change -to the state of the system. The caller in general should not assume a -priori that the attempt will be successful. It could choose to reject -a request if a quota cannot be applied, but at this time it will -simply ignore the error and proceed as today. - -### Operation Flow -- Retrieving Storage Consumption - -* Caller (kubelet metrics code, cadvisor, container runtime) asks the - quota code to compute the amount of storage used under the - directory. -* Determine whether a quota applies to the directory, in a - filesystem-dependent manner ([see below](#notes-on-implementation)). -* If so, determine how much storage or how many inodes are utilized, - in a filesystem dependent manner. - -If the quota code is unable to retrieve the consumption, it returns an -error status and it is up to the caller to utilize a fallback -mechanism (such as the directory walk performed today). - -### Operation Flow -- Removing a Quota. - -* Caller requests that the quota be removed from a directory. -* Determine whether a project quota applies to the directory. -* Remove the limit from the project ID associated with the directory. -* Remove the association between the directory and the project ID. -* Return the project ID to the system to allow its use elsewhere ([see - below](#return-a-project-id-to-the-system)). -* Caller may delete the directory and its contents (normally it will). - -### Operation Notes - -#### Selecting a Project ID - -Project IDs are a shared space within a filesystem. If the same -project ID is assigned to multiple directories, the space consumption -reported by the quota will be the sum of that of all of the -directories. Hence, it is important to ensure that each directory is -assigned a unique project ID (unless it is desired to pool the storage -use of multiple directories). - -The canonical mechanism to record persistently that a project ID is -reserved is to store it in the `/etc/projid` (`projid[5]`) and/or -`/etc/projects` (`projects(5)`) files. However, it is possible to utilize -project IDs without recording them in those files; they exist for -administrative convenience but neither the kernel nor the filesystem -is aware of them. Other ways can be used to determine whether a -project ID is in active use on a given filesystem: - -* The quota values (in blocks and/or inodes) assigned to the project - ID are non-zero. -* The storage consumption (in blocks and/or inodes) reported under the - project ID are non-zero. - -The algorithm to be used is as follows: - -* Lock this instance of the quota code against re-entrancy. -* open and `flock()` the `/etc/project` and `/etc/projid` files, so that - other uses of this code are excluded. -* Start from a high number (the prototype uses 1048577). -* Iterate from there, performing the following tests: - * Is the ID reserved by this instance of the quota code? - * Is the ID present in `/etc/projects`? - * Is the ID present in `/etc/projid`? - * Are the quota values and/or consumption reported by the kernel - non-zero? This test is restricted to 128 iterations to ensure - that a bug here or elsewhere does not result in an infinite loop - looking for a quota ID. -* If an ID has been found: - * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so - that any other uses of project quotas do not reuse it. - * Write temporary copies of `/etc/projects` and `/etc/projid` that are - `flock()`ed - * If successful, rename the temporary files appropriately (if - rename of one succeeds but the other fails, we have a problem - that we cannot recover from, and the files may be inconsistent). -* Unlock `/etc/projid` and `/etc/projects`. -* Unlock this instance of the quota code. - -A minor variation of this is used if we want to reuse an existing -quota ID. - -#### Determine Whether a Project ID Applies To a Directory - -It is possible to determine whether a directory has a project ID -applied to it by requesting (via the `quotactl(2)` system call) the -project ID associated with the directory. Whie the specifics are -filesystem-dependent, the basic method is the same for at least XFS -and ext4fs. - -It is not possible to determine in constant operations the directory -or directories to which a project ID is applied. It is possible to -determine whether a given project ID has been applied to an existing -directory or files (although those will not be known); the reported -consumption will be non-zero. - -The code records internally the project ID applied to a directory, but -it cannot always rely on this. In particular, if the kubelet has -exited and has been restarted (and hence the quota applying to the -directory should be removed), the map from directory to project ID is -lost. If it cannot find a map entry, it falls back on the approach -discussed above. - -#### Return a Project ID To the System - -The algorithm used to return a project ID to the system is very -similar to the algorithm used to select a project ID, except of course -for selecting a project ID. It performs the same sequence of locking -`/etc/project` and `/etc/projid`, editing a copy of the file, and -restoring it. - -If the project ID is applied to multiple directories and the code can -determine that, it will not remove the project ID from `/etc/projid` -until the last reference is removed. While it is not anticipated in -this KEP that this mode of operation will be used, at least initially, -this can be detected even on kubelet restart by looking at the -reference count in `/etc/projects`. - - -### Implementation Details/Notes/Constraints [optional] - -#### Notes on Implementation - -The primary new interface defined is the quota interface in -`pkg/volume/util/quota/quota.go`. This defines five operations: - -* Does the specified directory support quotas? - -* Assign a quota to a directory. If a non-empty pod UID is provided, - the quota assigned is that of any other directories under this pod - UID; if an empty pod UID is provided, a unique quota is assigned. - -* Retrieve the consumption of the specified directory. If the quota - code cannot handle it efficiently, it returns an error and the - caller falls back on existing mechanism. - -* Retrieve the inode consumption of the specified directory; same - description as above. - -* Remove quota from a directory. If a non-empty pod UID is passed, it - is checked against that recorded in-memory (if any). The quota is - removed from the specified directory. This can be used even if - AssignQuota has not been used; it inspects the directory and removes - the quota from it. This permits stale quotas from an interrupted - kubelet to be cleaned up. - -Two implementations are provided: `quota_linux.go` (for Linux) and -`quota_unsupported.go` (for other operating systems). The latter -returns an error for all requests. - -As the quota mechanism is intended to support multiple filesystems, -and different filesystems require different low level code for -manipulating quotas, a provider is supplied that finds an appropriate -quota applier implementation for the filesystem in question. The low -level quota applier provides similar operations to the top level quota -code, with two exceptions: - -* No operation exists to determine whether a quota can be applied - (that is handled by the provider). - -* An additional operation is provided to determine whether a given - quota ID is in use within the filesystem (outside of `/etc/projects` - and `/etc/projid`). - -The two quota providers in the initial implementation are in -`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While -some quota operations do require different system calls, a lot of the -code is common, and factored into -`pkg/volume/util/quota/common/quota_linux_common_impl.go`. - -#### Notes on Code Changes - -The prototype for this project is mostly self-contained within -`pkg/volume/util/quota` and a few changes to -`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were -required elsewhere: - -* The operation executor needs to pass the desired size limit to the - volume plugin where appropriate so that the volume plugin can impose - a quota. The limit is passed as 0 (do not use quotas), _positive - number (impose an enforcing quota if possible, measured in bytes),_ - or -1 (impose a non-enforcing quota, if possible) on the volume. - - This requires changes to - `pkg/volume/util/operationexecutor/operation_executor.go` (to add - `DesiredSizeLimit` to `VolumeToMount`), - `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and - `pkg/kubelet/eviction/helpers.go` (the latter in order to determine - whether the volume is a local ephemeral one). - -* The volume manager (in `pkg/volume/volume.go`) changes the - `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new - `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to - allow passing the desired size and pod UID (in the event we choose - to implement quotas shared between multiple volumes; [see - below](#alternative-quota-based-implementation)). This required - small changes to all volume plugins and their tests, but will in the - future allow adding additional data without having to change code - other than that which uses the new information. - -#### Testing Strategy - -The quota code is by an large not very amendable to unit tests. While -there are simple unit tests for parsing the mounts file, and there -could be tests for parsing the projects and projid files, the real -work (and risk) involves interactions with the kernel and with -multiple instances of this code (e. g. in the kubelet and the runtime -manager, particularly under stress). It also requires setup in the -form of a prepared filesystem. It would be better served by -appropriate end to end tests. - -### Risks and Mitigations - -* The SIG raised the possibility of a container being unable to exit - should we enforce quotas, and the quota interferes with writing the - log. This can be mitigated by either not applying a quota to the - log directory and using the du mechanism, or by applying a separate - non-enforcing quota to the log directory. - - As log directories are write-only by the container, and consumption - can be limited by other means (as the log is filtered by the - runtime), I do not consider the ability to write uncapped to the log - to be a serious exposure. - - Note in addition that even without quotas it is possible for writes - to fail due to lack of filesystem space, which is effectively (and - in some cases operationally) indistinguishable from exceeding quota, - so even at present code must be able to handle those situations. - -* Filesystem quotas may impact performance to an unknown degree. - Information on that is hard to come by in general, and one of the - reasons for using quotas is indeed to improve performance. If this - is a problem in the field, merely turning off quotas (or selectively - disabling project quotas) on the filesystem in question will avoid - the problem. Against the possibility that cannot be done - (because project quotas are needed for other purposes), we should - provide a way to disable use of quotas altogether via a feature - gate. - - A report <https://blog.pythonanywhere.com/110/> notes that an - unclean shutdown on Linux kernel versions between 3.11 and 3.17 can - result in a prolonged downtime while quota information is restored. - Unfortunately, [the link referenced - here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no - longer available. - -* Bugs in the quota code could result in a variety of regression - behavior. For example, if a quota is incorrectly applied it could - result in ability to write no data at all to the volume. This could - be mitigated by use of non-enforcing quotas. XFS in particular - offers the `pqnoenforce` mount option that makes all quotas - non-enforcing. - - -## Graduation Criteria - -How will we know that this has succeeded? Gathering user feedback is -crucial for building high quality experiences and SIGs have the -important responsibility of setting milestones for stability and -completeness. Hopefully the content previously contained in [umbrella -issues][] will be tracked in the `Graduation Criteria` section. - -[umbrella issues]: N/A - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in -`Implementation History`. Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG - acceptance -- the `Proposal` section being merged signaling agreement on a - proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was - available -- the version of Kubernetes where the KEP graduated to general - availability -- when the KEP was retired or superseded - -## Drawbacks [optional] - -* Use of quotas, particularly the less commonly used project quotas, - requires additional action on the part of the administrator. In - particular: - * ext4fs filesystems must be created with additional options that - are not enabled by default: -``` -mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ -``` - * An additional option (`prjquota`) must be applied in `/etc/fstab` - * If the root filesystem is to be quota-enabled, it must be set in - the grub options. -* Use of project quotas for this purpose will preclude future use - within containers. - -## Alternatives [optional] - -I have considered two classes of alternatives: - -* Alternatives based on quotas, with different implementation - -* Alternatives based on loop filesystems without use of quotas - -### Alternative quota-based implementation - -Within the basic framework of using quotas to monitor and potentially -enforce storage utilization, there are a number of possible options: - -* Utilize per-volume non-enforcing quotas to monitor storage (the - first stage of this proposal). - - This mostly preserves the current behavior, but with more efficient - determination of storage utilization and the possibility of building - further on it. The one change from current behavior is the ability - to detect space used by deleted files. - -* Utilize per-volume enforcing quotas to monitor and enforce storage - (the second stage of this proposal). - - This allows partial enforcement of storage limits. As local storage - capacity isolation works at the level of the pod, and we have no - control of user utilization of ephemeral volumes, we would have to - give each volume a quota of the full limit. For example, if a pod - had a limit of 1 MB but had four ephemeral volumes mounted, it would - be possible for storage utilization to reach (at least temporarily) - 4MB before being capped. - -* Utilize per-pod enforcing user or group quotas to enforce storage - consumption, and per-volume non-enforcing quotas for monitoring. - - This would offer the best of both worlds: a fully capped storage - limit combined with efficient reporting. However, it would require - each pod to run under a distinct UID or GID. This may prevent pods - from using setuid or setgid or their variants, and would interfere - with any other use of group or user quotas within Kubernetes. - -* Utilize per-pod enforcing quotas to monitor and enforce storage. - - This allows for full enforcement of storage limits, at the expense - of being able to efficiently monitor per-volume storage - consumption. As there have already been reports of monitoring - causing trouble, I do not advise this option. - - A variant of this would report (1/N) storage for each covered - volume, so with a pod with a 4MiB quota and 1MiB total consumption, - spread across 4 ephemeral volumes, each volume would report a - consumption of 256 KiB. Another variant would change the API to - report statistics for all ephemeral volumes combined. I do not - advise this option. - -### Alternative loop filesystem-based implementation - -Another way of isolating storage is to utilize filesystems of -pre-determined size, using the loop filesystem facility within Linux. -It is possible to create a file and run `mkfs(8)` on it, and then to -mount that filesystem on the desired directory. This both limits the -storage available within that directory and enables quick retrieval of -it via `statfs(2)`. - -Cleanup of such a filesystem involves unmounting it and removing the -backing file. - -The backing file can be created as a sparse file, and the `discard` -option can be used to return unused space to the system, allowing for -thin provisioning. - -I conducted preliminary investigations into this. While at first it -appeared promising, it turned out to have multiple critical flaws: - -* If the filesystem is mounted without the `discard` option, it can - grow to the full size of the backing file, negating any possibility - of thin provisioning. If the file is created dense in the first - place, there is never any possibility of thin provisioning without - use of `discard`. - - If the backing file is created densely, it additionally may require - significant time to create if the ephemeral limit is large. - -* If the filesystem is mounted `nosync`, and is sparse, it is possible - for writes to succeed and then fail later with I/O errors when - synced to the backing storage. This will lead to data corruption - that cannot be detected at the time of write. - - This can easily be reproduced by e. g. creating a 64MB filesystem - and within it creating a 128MB sparse file and building a filesystem - on it. When that filesystem is in turn mounted, writes to it will - succeed, but I/O errors will be seen in the log and the file will be - incomplete: - -``` -# mkdir /var/tmp/d1 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383 -# mkfs.ext4 /var/tmp/fs1 -# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1 -# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767 -# mkfs.ext4 /var/tmp/d1/fs2 -# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 - ...will normally succeed... -# sync - ...fails with I/O error!... -``` - -* If the filesystem is mounted `sync`, all writes to it are - immediately committed to the backing store, and the `dd` operation - above fails as soon as it fills up `/var/tmp/d1`. However, - performance is drastically slowed, particularly with small writes; - with 1K writes, I observed performance degradation in some cases - exceeding three orders of magnitude. - - I performed a test comparing writing 64 MB to a base (partitioned) - filesystem, to a loop filesystem without `sync`, and a loop - filesystem with `sync`. Total I/O was sufficient to run for at least - 5 seconds in each case. All filesystems involved were XFS. Loop - filesystems were 128 MB and dense. Times are in seconds. The - erratic behavior (e. g. the 65536 case) was involved was observed - repeatedly, although the exact amount of time and which I/O sizes - were affected varied. The underlying device was an HP EX920 1TB - NVMe SSD. - -| I/O Size | Partition | Loop w/sync | Loop w/o sync | -| ---: | ---: | ---: | ---: | -| 1024 | 0.104 | 0.120 | 140.390 | -| 4096 | 0.045 | 0.077 | 21.850 | -| 16384 | 0.045 | 0.067 | 5.550 | -| 65536 | 0.044 | 0.061 | 20.440 | -| 262144 | 0.043 | 0.087 | 0.545 | -| 1048576 | 0.043 | 0.055 | 7.490 | -| 4194304 | 0.043 | 0.053 | 0.587 | - - The only potentially viable combination in my view would be a dense - loop filesystem without sync, but that would render any thin - provisioning impossible. - -## Infrastructure Needed [optional] - -* Decision: who is responsible for quota management of all volume - types (and especially ephemeral volumes of all types). At present, - emptydir volumes are managed by the kubelet and logdirs and writable - layers by either the kubelet or the runtime, depending upon the - choice of runtime. Beyond the specific proposal that the runtime - should manage quotas for volumes it creates, there are broader - issues that I request assistance from the SIG in addressing. - -* Location of the quota code. If the quotas for different volume - types are to be managed by different components, each such component - needs access to the quota code. The code is substantial and should - not be copied; it would more appropriately be vendored. - -## References - -### Bugs Opened Against Filesystem Quotas - -The following is a list of known security issues referencing -filesystem quotas on Linux, and other bugs referencing filesystem -quotas in Linux since 2012. These bugs are not necessarily in the -quota system. - -#### CVE - -* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel - before 3.3.6, when huge pages are enabled, allows local users to - cause a denial of service (system crash) or possibly gain privileges - by interacting with a hugetlbfs filesystem, as demonstrated by a - umount operation that triggers improper handling of quota data. - - The issue is actually related to huge pages, not quotas - specifically. The demonstration of the vulnerability resulted in - incorrect handling of quota data. - -* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c) - in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl - function the first time without a host name, which might allow - remote attackers to bypass TCP Wrappers rules in hosts.deny (related - to rpc.rquotad; remote attackers might be able to bypass TCP - Wrappers rules). - - This issue is related to remote quota handling, which is not the use - case for the proposal at hand. - -#### Other Security Issues Without CVE - -* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and - Create Large Files](https://securitytracker.com/id/1002610) - - A setuid root binary inheriting file descriptors from an - unprivileged user process may write to the file without respecting - quota limits. If this issue is still present, it would allow a - setuid process to exceed any enforcing limits, but does not affect - the quota accounting (use of quotas for monitoring). - -### Other Linux Quota-Related Bugs Since 2012 - -* [ext4: report delalloc reserve as non-free in statfs mangled by - project quota](https://lore.kernel.org/patchwork/patch/884530/) - - This bug, fixed in Feb. 2018, properly accounts for reserved but not - committed space in project quotas. At this point I have not - determined the impact of this issue. - -* [XFS quota doesn't work after rebooting because of - crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730) - - This bug resulted in XFS quotas not working after a crash or forced - reboot. Under this proposal, Kubernetes would fall back to du for - monitoring should a bug of this nature manifest itself again. - -* [quota can show incorrect filesystem - name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527) - - This issue, which will not be fixed, results in the quota command - possibly printing an incorrect filesystem name when used on remote - filesystems. It is a display issue with the quota command, not a - quota bug at all, and does not result in incorrect quota information - being reported. As this proposal does not utilize the quota command - or rely on filesystem name, or currently use quotas on remote - filesystems, it should not be affected by this bug. - -In addition, the e2fsprogs have had numerous fixes over the years. +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md index 1ce72617..cfd1f5fa 100644 --- a/keps/sig-node/compute-device-assignment.md +++ b/keps/sig-node/compute-device-assignment.md @@ -1,150 +1,4 @@ ---- -kep-number: 18 -title: Kubelet endpoint for device assignment observation details -authors: - - "@dashpole" - - "@vikaschoudhary16" -owning-sig: sig-node -reviewers: - - "@thockin" - - "@derekwaynecarr" - - "@dchen1107" - - "@vishh" -approvers: - - "@sig-node-leads" -editors: - - "@dashpole" - - "@vikaschoudhary16" -creation-date: "2018-07-19" -last-updated: "2018-07-19" -status: provisional ---- -# Kubelet endpoint for device assignment observation details - -Table of Contents -================= -* [Abstract](#abstract) -* [Background](#background) -* [Objectives](#objectives) -* [User Journeys](#user-journeys) - * [Device Monitoring Agents](#device-monitoring-agents) -* [Changes](#changes) -* [Potential Future Improvements](#potential-future-improvements) -* [Alternatives Considered](#alternatives-considered) - -## Abstract -In this document we will discuss the motivation and code changes required for introducing a kubelet endpoint to expose device to container bindings. - -## Background -[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) requires external agents to be able to determine the set of devices in-use by containers and attach pod and container metadata for these devices. - -## Objectives - -* To remove current device-specific knowledge from the kubelet, such as [accellerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229) -* To enable future use-cases requiring device-specific knowledge to be out-of-tree - -## User Journeys - -### Device Monitoring Agents - -* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors. -* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics: - - - - -## Changes - -Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below: -```protobuf -// PodResources is a service provided by the kubelet that provides information about the -// node resources consumed by pods and containers on the node -service PodResources { - rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} -} - -// ListPodResourcesRequest is the request made to the PodResources service -message ListPodResourcesRequest {} - -// ListPodResourcesResponse is the response returned by List function -message ListPodResourcesResponse { - repeated PodResources pod_resources = 1; -} - -// PodResources contains information about the node resources assigned to a pod -message PodResources { - string name = 1; - string namespace = 2; - repeated ContainerResources containers = 3; -} - -// ContainerResources contains information about the resources assigned to a container -message ContainerResources { - string name = 1; - repeated ContainerDevices devices = 2; -} - -// ContainerDevices contains information about the devices assigned to a container -message ContainerDevices { - string resource_name = 1; - repeated string device_ids = 2; -} -``` - -### Potential Future Improvements - -* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll. -* Add identifiers for other resources used by pods to the `PodResources` message. - * For example, persistent volume location on disk - -## Alternatives Considered - -### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers. -* Pros: - * Reuse an existing API for describing containers rather than inventing a new one -* Cons: - * It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future - * It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case. -* Notes: - * Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container. - -### Add a field to Pod Status. -* Pros: - * Allows for observation of container to device bindings local to the node through the `/pods` endpoint -* Cons: - * Only consumed locally, which doesn't justify an API change - * Device Bindings are immutable after allocation, and are _debatably_ observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status. - -### Use the Kubelet Device Manager Checkpoint file -* Allows for observability of device to container bindings through what exists in the checkpoint file - * Requires adding additional metadata to the checkpoint file as required by the monitoring agent -* Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet -* Future modifications to the checkpoint file are more difficult. - -### Add a field to the Pod Spec: -* A new object `ComputeDevice` will be defined and a new variable `ComputeDevices` will be added in the `Container` (Spec) object which will represent a list of `ComputeDevice` objects. -```golang -// ComputeDevice describes the devices assigned to this container for a given ResourceName -type ComputeDevice struct { - // DeviceIDs is the list of devices assigned to this container - DeviceIDs []string - // ResourceName is the name of the compute resource - ResourceName string -} - -// Container represents a single container that is expected to be run on the host. -type Container struct { - ... - // ComputeDevices contains the devices assigned to this container - // This field is alpha-level and is only honored by servers that enable the ComputeDevices feature. - // +optional - ComputeDevices []ComputeDevice - ... -} -``` -* During Kubelet pod admission, if `ComputeDevices` is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today. -* Before starting the pod, the kubelet writes the assigned `ComputeDevices` back to the pod spec. - * Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup. -* Allows devices to potentially be assigned by a custom scheduler. -* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally. - +KEPs have moved to https://git.k8s.io/enhancements/. +<!-- +This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first. +-->
\ No newline at end of file |
