summaryrefslogtreecommitdiff
path: root/keps/sig-node
diff options
context:
space:
mode:
authorStephen Augustus <foo@agst.us>2018-12-01 02:40:42 -0500
committerStephen Augustus <foo@agst.us>2018-12-01 02:40:42 -0500
commit1004e56177eb12d85b6e0f6cf1ccd00431f7336b (patch)
treee2a87f95b32e046ed32a2eea6cde661704e61fbd /keps/sig-node
parent973b19523840d207ae206175ac2093d3b564668c (diff)
Add KEP tombstones
Signed-off-by: Stephen Augustus <foo@agst.us>
Diffstat (limited to 'keps/sig-node')
-rw-r--r--keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md229
-rw-r--r--keps/sig-node/0009-node-heartbeat.md396
-rw-r--r--keps/sig-node/0014-runtime-class.md403
-rw-r--r--keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md811
-rw-r--r--keps/sig-node/compute-device-assignment.md154
5 files changed, 20 insertions, 1973 deletions
diff --git a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md
index 4a2090a1..cfd1f5fa 100644
--- a/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md
+++ b/keps/sig-node/0008-20180430-promote-sysctl-annotations-to-fields.md
@@ -1,225 +1,4 @@
----
-kep-number: 8
-title: Protomote sysctl annotations to fields
-authors:
- - "@ingvagabund"
-owning-sig: sig-node
-participating-sigs:
- - sig-auth
-reviewers:
- - "@sjenning"
- - "@derekwaynecarr"
-approvers:
- - "@sjenning "
- - "@derekwaynecarr"
-editor:
-creation-date: 2018-04-30
-last-updated: 2018-05-02
-status: provisional
-see-also:
-replaces:
-superseded-by:
----
-
-# Promote sysctl annotations to fields
-
-## Table of Contents
-
-* [Promote sysctl annotations to fields](#promote-sysctl-annotations-to-fields)
- * [Table of Contents](#table-of-contents)
- * [Summary](#summary)
- * [Motivation](#motivation)
- * [Promote annotations to fields](#promote-annotations-to-fields)
- * [Promote --experimental-allowed-unsafe-sysctls kubelet flag to kubelet config api option](#promote---experimental-allowed-unsafe-sysctls-kubelet-flag-to-kubelet-config-api-option)
- * [Gate the feature](#gate-the-feature)
- * [Proposal](#proposal)
- * [User Stories](#user-stories)
- * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- * [Risks and Mitigations](#risks-and-mitigations)
- * [Graduation Criteria](#graduation-criteria)
- * [Implementation History](#implementation-history)
-
-## Summary
-
-Setting the `sysctl` parameters through annotations provided a successful story
-for defining better constraints of running applications.
-The `sysctl` feature has been tested by a number of people without any serious
-complaints. Promoting the annotations to fields (i.e. to beta) is another step in making the
-`sysctl` feature closer towards the stable API.
-
-Currently, the `sysctl` provides `security.alpha.kubernetes.io/sysctls` and `security.alpha.kubernetes.io/unsafe-sysctls` annotations that can be used
-in the following way:
- ```yaml
- apiVersion: v1
- kind: Pod
- metadata:
- name: sysctl-example
- annotations:
- security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
- security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
- spec:
- ...
- ```
-
- The goal is to transition into native fields on pods:
-
- ```yaml
- apiVersion: v1
- kind: Pod
- metadata:
- name: sysctl-example
- spec:
- securityContext:
- sysctls:
- - name: kernel.shm_rmid_forced
- value: 1
- - name: net.ipv4.route.min_pmtu
- value: 1000
- unsafe: true
- - name: kernel.msgmax
- value: "1 2 3"
- unsafe: true
- ...
- ```
-
-The `sysctl` design document with more details and rationals is available at [design-proposals/node/sysctl.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#pod-api-changes)
-
-## Motivation
-
-As mentioned in [contributors/devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version):
-
-> Previously, annotations were used for experimental alpha features, but are no longer recommended for several reasons:
->
-> They expose the cluster to "time-bomb" data added as unstructured annotations against an earlier API server (https://issue.k8s.io/30819)
-> They cannot be migrated to first-class fields in the same API version (see the issues with representing a single value in multiple places in backward compatibility gotchas)
->
-> The preferred approach adds an alpha field to the existing object, and ensures it is disabled by default:
->
-> ...
-
-The annotations as a means to set `sysctl` are no longer necessary.
-The original intent of annotations was to provide additional description of Kubernetes
-objects through metadata.
-It's time to separate the ability to annotate from the ability to change sysctls settings
-so a cluster operator can elevate the distinction between experimental and supported usage
-of the feature.
-
-### Promote annotations to fields
-
-* Introduce native `sysctl` fields in pods through `spec.securityContext.sysctl` field as:
-
- ```yaml
- sysctl:
- - name: SYSCTL_PATH_NAME
- value: SYSCTL_PATH_VALUE
- unsafe: true # optional field
- ```
-
-* Introduce native `sysctl` fields in [PSP](https://kubernetes.io/docs/concepts/policy/pod-security-policy/) as:
-
- ```yaml
- apiVersion: v1
- kind: PodSecurityPolicy
- metadata:
- name: psp-example
- spec:
- sysctls:
- - kernel.shmmax
- - kernel.shmall
- - net.*
- ```
-
- More examples at [design-proposals/node/sysctl.md#allowing-only-certain-sysctls](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/sysctl.md#allowing-only-certain-sysctls)
-
-### Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option
-
-As there is no longer a need to consider the `sysctl` feature experimental,
-the list of unsafe sysctls can be configured accordingly through:
-
-```go
-// KubeletConfiguration contains the configuration for the Kubelet
-type KubeletConfiguration struct {
- ...
- // Whitelist of unsafe sysctls or unsafe sysctl patterns (ending in *).
- // Default: nil
- // +optional
- AllowedUnsafeSysctls []string `json:"allowedUnsafeSysctls,omitempty"`
-}
-```
-
-Upstream issue: https://github.com/kubernetes/kubernetes/issues/61669
-
-### Gate the feature
-
-As the `sysctl` feature stabilizes, it's time to gate the feature [1] and enable it by default.
-
-* Expected feature gate key: `Sysctls`
-* Expected default value: `true`
-
-With the `Sysctl` feature enabled, both sysctl fields in `Pod` and `PodSecurityPolicy`
-and the whitelist of unsafed sysctls are acknowledged.
-If disabled, the fields and the whitelist are just ignored.
-
-[1] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-
-## Proposal
-
-This is where we get down to the nitty gritty of what the proposal actually is.
-
-### User Stories
-
-* As a cluster admin, I want to have `sysctl` feature versioned so I can assure backward compatibility
- and proper transformation between versioned to internal representation and back..
-* As a cluster admin, I want to be confident the `sysctl` feature is stable enough and well supported so
- applications are properly isolated
-* As a cluster admin, I want to be able to apply the `sysctl` constraints on the cluster level so
- I can define the default constraints for all pods.
-
-### Implementation Details/Notes/Constraints
-
-Extending `SecurityContext` struct with `Sysctls` field:
-
-```go
-// PodSecurityContext holds pod-level security attributes and common container settings.
-// Some fields are also present in container.securityContext. Field values of
-// container.securityContext take precedence over field values of PodSecurityContext.
-type PodSecurityContext struct {
- ...
- // Sysctls is a white list of allowed sysctls in a pod spec.
- Sysctls []Sysctl `json:"sysctls,omitempty"`
-}
-```
-
-Extending `PodSecurityPolicySpec` struct with `Sysctls` field:
-
-```go
-// PodSecurityPolicySpec defines the policy enforced on sysctls.
-type PodSecurityPolicySpec struct {
- ...
- // Sysctls is a white list of allowed sysctls in a pod spec.
- Sysctls []Sysctl `json:"sysctls,omitempty"`
-}
-```
-
-Following steps in [devel/api_changes.md#alpha-field-in-existing-api-version](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-field-in-existing-api-version)
-during implementation.
-
-Validation checks implemented as part of [#27180](https://github.com/kubernetes/kubernetes/pull/27180).
-
-### Risks and Mitigations
-
-We need to assure backward compatibility, i.e. object specifications with `sysctl` annotations
-must still work after the graduation.
-
-## Graduation Criteria
-
-* API changes allowing to configure the pod-scoped `sysctl` via `spec.securityContext` field.
-* API changes allowing to configure the cluster-scoped `sysctl` via `PodSecurityPolicy` object
-* Promote `--experimental-allowed-unsafe-sysctls` kubelet flag to kubelet config api option
-* feature gate enabled by default
-* e2e tests
-
-## Implementation History
-
-The `sysctl` feature is tracked as part of [features#34](https://github.com/kubernetes/features/issues/34).
-This is one of the goals to promote the annotations to fields.
+KEPs have moved to https://git.k8s.io/enhancements/.
+<!--
+This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first.
+--> \ No newline at end of file
diff --git a/keps/sig-node/0009-node-heartbeat.md b/keps/sig-node/0009-node-heartbeat.md
index f80b9609..cfd1f5fa 100644
--- a/keps/sig-node/0009-node-heartbeat.md
+++ b/keps/sig-node/0009-node-heartbeat.md
@@ -1,392 +1,4 @@
----
-kep-number: 8
-title: Efficient Node Heartbeat
-authors:
- - "@wojtek-t"
- - "with input from @bgrant0607, @dchen1107, @yujuhong, @lavalamp"
-owning-sig: sig-node
-participating-sigs:
- - sig-scalability
- - sig-apimachinery
- - sig-scheduling
-reviewers:
- - "@deads2k"
- - "@lavalamp"
-approvers:
- - "@dchen1107"
- - "@derekwaynecarr"
-editor: TBD
-creation-date: 2018-04-27
-last-updated: 2018-04-27
-status: implementable
-see-also:
- - https://github.com/kubernetes/kubernetes/issues/14733
- - https://github.com/kubernetes/kubernetes/pull/14735
-replaces:
- - n/a
-superseded-by:
- - n/a
----
-
-# Efficient Node Heartbeats
-
-## Table of Contents
-
-Table of Contents
-=================
-
-* [Efficient Node Heartbeats](#efficient-node-heartbeats)
- * [Table of Contents](#table-of-contents)
- * [Summary](#summary)
- * [Motivation](#motivation)
- * [Goals](#goals)
- * [Non-Goals](#non-goals)
- * [Proposal](#proposal)
- * [Risks and Mitigations](#risks-and-mitigations)
- * [Graduation Criteria](#graduation-criteria)
- * [Implementation History](#implementation-history)
- * [Alternatives](#alternatives)
- * [Dedicated “heartbeat” object instead of “leader election” one](#dedicated-heartbeat-object-instead-of-leader-election-one)
- * [Events instead of dedicated heartbeat object](#events-instead-of-dedicated-heartbeat-object)
- * [Reuse the Component Registration mechanisms](#reuse-the-component-registration-mechanisms)
- * [Split Node object into two parts at etcd level](#split-node-object-into-two-parts-at-etcd-level)
- * [Delta compression in etcd](#delta-compression-in-etcd)
- * [Replace etcd with other database](#replace-etcd-with-other-database)
-
-## Summary
-
-Node heartbeats are necessary for correct functioning of Kubernetes cluster.
-This proposal makes them significantly cheaper from both scalability and
-performance perspective.
-
-## Motivation
-
-While running different scalability tests we observed that in big enough clusters
-(more than 2000 nodes) with non-trivial number of images used by pods on all
-nodes (10-15), we were hitting etcd limits for its database size. That effectively
-means that etcd enters "alert mode" and stops accepting all write requests.
-
-The underlying root cause is combination of:
-
-- etcd keeping both current state and transaction log with copy-on-write
-- node heartbeats being pontetially very large objects (note that images
- are only one potential problem, the second are volumes and customers
- want to mount 100+ volumes to a single node) - they may easily exceed 15kB;
- even though the patch send over network is small, in etcd we store the
- whole Node object
-- Kubelet sending heartbeats every 10s
-
-This proposal presents a proper solution for that problem.
-
-
-Note that currently (by default):
-
-- Lack of NodeStatus update for `<node-monitor-grace-period>` (default: 40s)
- results in NodeController marking node as NotReady (pods are no longer
- scheduled on that node)
-- Lack of NodeStatus updates for `<pod-eviction-timeout>` (default: 5m)
- results in NodeController starting pod evictions from that node
-
-We would like to preserve that behavior.
-
-
-### Goals
-
-- Reduce size of etcd by making node heartbeats cheaper
-
-### Non-Goals
-
-The following are nice-to-haves, but not primary goals:
-
-- Reduce resource usage (cpu/memory) of control plane (e.g. due to processing
- less and/or smaller objects)
-- Reduce watch-related load on Node objects
-
-## Proposal
-
-We propose introducing a new `Lease` built-in API in the newly create API group
-`coordination.k8s.io`. To make it easily reusable for other purposes it will
-be namespaced. Its schema will be as following:
-
-```
-type Lease struct {
- metav1.TypeMeta `json:",inline"`
- // Standard object's metadata.
- // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
- // +optional
- ObjectMeta metav1.ObjectMeta `json:"metadata,omitempty"`
-
- // Specification of the Lease.
- // More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status
- // +optional
- Spec LeaseSpec `json:"spec,omitempty"`
-}
-
-type LeaseSpec struct {
- HolderIdentity string `json:"holderIdentity"`
- LeaseDurationSeconds int32 `json:"leaseDurationSeconds"`
- AcquireTime metav1.MicroTime `json:"acquireTime"`
- RenewTime metav1.MicroTime `json:"renewTime"`
- LeaseTransitions int32 `json:"leaseTransitions"`
-}
-```
-
-The Spec is effectively of already existing (and thus proved) [LeaderElectionRecord][].
-The only difference is using `MicroTime` instead of `Time` for better precision.
-That would hopefully allow us go get directly to Beta.
-
-We will use that object to represent node heartbeat - for each Node there will
-be a corresponding `Lease` object with Name equal to Node name in a newly
-created dedicated namespace (we considered using `kube-system` namespace but
-decided that it's already too overloaded).
-That namespace should be created automatically (similarly to "default" and
-"kube-system", probably by NodeController) and never be deleted (so that nodes
-don't require permission for it).
-
-We considered using CRD instead of built-in API. However, even though CRDs are
-`the new way` for creating new APIs, they don't yet have versioning support
-and are significantly less performant (due to lack of protobuf support yet).
-We also don't know whether we could seamlessly transition storage from a CRD
-to a built-in API if we ran into a performance or any other problems.
-As a result, we decided to proceed with built-in API.
-
-
-With this new API in place, we will change Kubelet so that:
-
-1. Kubelet is periodically computing NodeStatus every 10s (at it is now), but that will
- be independent from reporting status
-1. Kubelet is reporting NodeStatus if:
- - there was a meaningful change in it (initially we can probably assume that every
- change is meaningful, including e.g. images on the node)
- - or it didn’t report it over last `node-status-update-period` seconds
-1. Kubelet creates and periodically updates its own Lease object and frequency
- of those updates is independent from NodeStatus update frequency.
-
-In the meantime, we will change `NodeController` to treat both updates of NodeStatus
-object as well as updates of the new `Lease` object corresponding to a given
-node as healthiness signal from a given Kubelet. This will make it work for both old
-and new Kubelets.
-
-We should also:
-
-1. audit all other existing core controllers to verify if they also don’t require
- similar changes in their logic ([ttl controller][] being one of the examples)
-1. change controller manager to auto-register that `Lease` CRD
-1. ensure that `Lease` resource is deleted when corresponding node is
- deleted (probably via owner references)
-1. [out-of-scope] migrate all LeaderElection code to use that CRD
-
-Once all the code changes are done, we will:
-
-1. start updating `Lease` object every 10s by default, at the same time
- reducing frequency of NodeStatus updates initially to 40s by default.
- We will reduce it further later.
- Note that it doesn't reduce frequency by which Kubelet sends "meaningful"
- changes - it only impacts the frequency of "lastHeartbeatTime" changes.
- <br> TODO: That still results in higher average QPS. It should be acceptable but
- needs to be verified.
-1. announce that we are going to reduce frequency of NodeStatus updates further
- and give people 1-2 releases to switch their code to use `Lease`
- object (if they relied on frequent NodeStatus changes)
-1. further reduce NodeStatus updates frequency to not less often than once per
- 1 minute.
- We can’t stop periodically updating NodeStatus as it would be API breaking change,
- but it’s fine to reduce its frequency (though we should continue writing it at
- least once per eviction period).
-
-
-To be considered:
-
-1. We may consider reducing frequency of NodeStatus updates to once every 5 minutes
- (instead of 1 minute). That would help with performance/scalability even more.
- Caveats:
- - NodeProblemDetector is currently updating (some) node conditions every 1 minute
- (unconditionally, because lastHeartbeatTime always changes). To make reduction
- of NodeStatus updates frequency really useful, we should also change NPD to
- work in a similar mode (check periodically if condition changes, but report only
- when something changed or no status was reported for a given time) and decrease
- its reporting frequency too.
- - In general, we recommend to keep frequencies of NodeStatus reporting in both
- Kubelet and NodeProblemDetector in sync (once all changes will be done) and
- that should be reflected in [NPD documentation][].
- - Note that reducing frequency to 1 minute already gives us almost 6x improvement.
- It seems more than enough for any foreseeable future assuming we won’t
- significantly increase the size of object Node.
- Note that if we keep adding node conditions owned by other components, the
- number of writes of Node object will go up. But that issue is separate from
- that proposal.
-
-Other notes:
-
-1. Additional advantage of using Lease for that purpose would be the
- ability to exclude it from audit profile and thus reduce the audit logs footprint.
-
-[LeaderElectionRecord]: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L37
-[ttl controller]: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/ttl/ttl_controller.go#L155
-[NPD documentation]: https://kubernetes.io/docs/tasks/debug-application-cluster/monitor-node-health/
-[kubernetes/kubernetes#63667]: https://github.com/kubernetes/kubernetes/issues/63677
-
-### Risks and Mitigations
-
-Increasing default frequency of NodeStatus updates may potentially break clients
-relying on frequent Node object updates. However, in non-managed solutions, customers
-will still be able to restore previous behavior by setting appropriate flag values.
-Thus, changing defaults to what we recommend is the path to go with.
-
-## Graduation Criteria
-
-The API can be immediately promoted to Beta, as the API is effectively a copy of
-already existing LeaderElectionRecord. It will be promoted to GA once it's gone
-a sufficient amount of time as Beta with no changes.
-
-The changes in components logic (Kubelet, NodeController) should be done behind
-a feature gate. We suggest making that enabled by default once the feature is
-implemented.
-
-## Implementation History
-
-- RRRR-MM-DD: KEP Summary, Motivation and Proposal merged
-
-## Alternatives
-
-We considered a number of alternatives, most important mentioned below.
-
-### Dedicated “heartbeat” object instead of “leader election” one
-
-Instead of introducing and using “lease” object, we considered
-introducing a dedicated “heartbeat” object for that purpose. Apart from that,
-all the details about the solution remain pretty much the same.
-
-Pros:
-
-- Conceptually easier to understand what the object is for
-
-Cons:
-
-- Introduces a new, narrow-purpose API. Lease is already used by other
- components, implemented using annotations on Endpoints and ConfigMaps.
-
-### Events instead of dedicated heartbeat object
-
-Instead of introducing a dedicated object, we considered using “Event” object
-for that purpose. At the high-level the solution looks very similar.
-The differences from the initial proposal are:
-
-- we use existing “Event” api instead of introducing a new API
-- we create a dedicated namespace; events that should be treated as healthiness
- signal by NodeController will be written by Kubelets (unconditionally) to that
- namespace
-- NodeController will be watching only Events from that namespace to avoid
- processing all events in the system (the volume of all events will be huge)
-- dedicated namespace also helps with security - we can give access to write to
- that namespace only to Kubelets
-
-Pros:
-
-- No need to introduce new API
- - We can use that approach much earlier due to that.
-- We already need to optimize event throughput - separate etcd instance we have
- for them may help with tuning
-- Low-risk roll-forward/roll-back: no new objects is involved (node controller
- starts watching events, kubelet just reduces the frequency of heartbeats)
-
-Cons:
-
-- Events are conceptually “best-effort” in the system:
- - they may be silently dropped in case of problems in the system (the event recorder
- library doesn’t retry on errors, e.g. to not make things worse when control-plane
- is starved)
- - currently, components reporting events don’t even know if it succeeded or not (the
- library is built in a way that you throw the event into it and are not notified if
- that was successfully submitted or not).
- Kubelet sending any other update has full control on how/if retry errors.
- - lack of fairness mechanisms means that even when some events are being successfully
- send, there is no guarantee that any event from a given Kubelet will be submitted
- over a given time period
- So this would require a different mechanism of reporting those “heartbeat” events.
-- Once we have “request priority” concept, I think events should have the lowest one.
- Even though no particular heartbeat is important, guarantee that some heartbeats will
- be successfully send it crucial (not delivering any of them will result in unnecessary
- evictions or not-scheduling to a given node). So heartbeats should be of the highest
- priority. OTOH, node heartbeats are one of the most important things in the system
- (not delivering them may result in unnecessary evictions), so they should have the
- highest priority.
-- No core component in the system is currently watching events
- - it would make system’s operation harder to explain
-- Users watch Node objects for heartbeats (even though we didn’t recommend it).
- Introducing a new object for the purpose of heartbeat will allow those users to
- migrate, while using events for that purpose breaks that ability. (Watching events
- may put us in tough situation also from performance reasons.)
-- Deleting all events (e.g. event etcd failure + playbook response) should continue to
- not cause a catastrophic failure and the design will need to account for this.
-
-### Reuse the Component Registration mechanisms
-
-Kubelet is one of control-place components (shared controller). Some time ago, Component
-Registration proposal converged into three parts:
-
-- Introducing an API for registering non-pod endpoints, including readiness information: #18610
-- Changing endpoints controller to also watch those endpoints
-- Identifying some of those endpoints as “components”
-
-We could reuse that mechanism to represent Kubelets as non-pod endpoint API.
-
-Pros:
-
-- Utilizes desired API
-
-Cons:
-
-- Requires introducing that new API
-- Stabilizing the API would take some time
-- Implementing that API requires multiple changes in different components
-
-### Split Node object into two parts at etcd level
-
-We may stick to existing Node API and solve the problem at storage layer. At the
-high level, this means splitting the Node object into two parts in etcd (frequently
-modified one and the rest).
-
-Pros:
-
-- No need to introduce new API
-- No need to change any components other than kube-apiserver
-
-Cons:
-
-- Very complicated to support watch
-- Not very generic (e.g. splitting Spec and Status doesn’t help, it needs to be just
- heartbeat part)
-- [minor] Doesn’t reduce amount of data that should be processed in the system (writes,
- reads, watches, …)
-
-### Delta compression in etcd
-
-An alternative for the above can be solving this completely at the etcd layer. To
-achieve that, instead of storing full updates in etcd transaction log, we will just
-store “deltas” and snapshot the whole object only every X seconds/minutes.
-
-Pros:
-
-- Doesn’t require any changes to any Kubernetes components
-
-Cons:
-
-- Computing delta is tricky (etcd doesn’t understand Kubernetes data model, and
- delta between two protobuf-encoded objects is not necessary small)
-- May require a major rewrite of etcd code and not even be accepted by its maintainers
-- More expensive computationally to get an object in a given resource version (which
- is what e.g. watch is doing)
-
-### Replace etcd with other database
-
-Instead of using etcd, we may also consider using some other open-source solution.
-
-Pros:
-
-- Doesn’t require new API
-
-Cons:
-
-- We don’t even know if there exists solution that solves our problems and can be used.
-- Migration will take us years.
+KEPs have moved to https://git.k8s.io/enhancements/.
+<!--
+This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first.
+--> \ No newline at end of file
diff --git a/keps/sig-node/0014-runtime-class.md b/keps/sig-node/0014-runtime-class.md
index 1d1cac28..cfd1f5fa 100644
--- a/keps/sig-node/0014-runtime-class.md
+++ b/keps/sig-node/0014-runtime-class.md
@@ -1,399 +1,4 @@
----
-kep-number: 14
-title: Runtime Class
-authors:
- - "@tallclair"
-owning-sig: sig-node
-participating-sigs:
- - sig-architecture
-reviewers:
- - dchen1107
- - derekwaynecarr
- - yujuhong
-approvers:
- - dchen1107
- - derekwaynecarr
-creation-date: 2018-06-19
-status: implementable
----
-
-# Runtime Class
-
-## Table of Contents
-
-* [Summary](#summary)
-* [Motivation](#motivation)
- * [Goals](#goals)
- * [Non\-Goals](#non-goals)
- * [User Stories](#user-stories)
-* [Proposal](#proposal)
- * [API](#api)
- * [Runtime Handler](#runtime-handler)
- * [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts)
- * [Implementation Details](#implementation-details)
- * [Risks and Mitigations](#risks-and-mitigations)
-* [Graduation Criteria](#graduation-criteria)
-* [Implementation History](#implementation-history)
-* [Appendix](#appendix)
- * [Examples of runtime variation](#examples-of-runtime-variation)
-
-## Summary
-
-`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the
-control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the
-`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node.
-
-## Motivation
-
-There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the
-primary motivator for this right now, with both Kata containers and gVisor looking to integrate with
-Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also
-require support in the future. RuntimeClass provides a way to select between different runtimes
-configured in the cluster and surface their properties (both to the cluster & the user).
-
-In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to
-the control plane level, including: accounting for runtime overhead, scheduling to nodes that
-support the runtime, and surfacing which optional features are supported by different
-runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a
-cluster-scoped resource tied to the runtime that can help solve these problems in a future update.
-
-[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit
-
-### Goals
-
-- Provide a mechanism for surfacing container runtime properties to the control plane
-- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired
- runtime
-
-### Non-Goals
-
-- RuntimeClass is NOT RuntimeComponentConfig.
-- RuntimeClass is NOT a general policy mechanism.
-- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general
- RuntimeClass should not be a cross product of runtime properties and node properties.
-
-The following goals are out-of-scope for the initial implementation, but may be explored in a future
-iteration:
-
-- Surfacing support for optional features by runtimes, and surfacing errors caused by
- incompatible features & runtimes earlier.
-- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the
- cluster admin or provider), and are asserted to be an accurate representation of the runtime.
-- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster
- (different runtime configurations on different nodes) through scheduling primitives like
- `NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and
- automatic runtime-aware scheduling is out-of-scope.
-- Define standardized or conformant runtime classes - although I would like to declare some
- predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP.
-- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice,
- the details of how pod resource overhead will be implemented is out of scope for this KEP.
-- Provide a mechanism to dynamically register or provision additional runtimes.
-- Requiring specific RuntimeClasses according to policy. This should be addressed by other
- cluster-level policy mechanisms, such as PodSecurityPolicy.
-- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and
- letting the system match an appropriate RuntimeClass, rather than explicitly assigning a
- RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a
- future iteration.
-
-[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit
-
-### User Stories
-
-- As a cluster operator, I want to provide multiple runtime options to support a wide variety of
- workloads. Examples include native linux containers, "sandboxed" containers, and windows
- containers.
-- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For
- example, rolling out an update with backwards incompatible changes or previously unsupported
- features.
-- As an application developer, I want to select the runtime that best fits my workload.
-- As an application developer, I don't want to study the nitty-gritty details of different runtime
- implementations, but rather choose from pre-configured classes.
-- As an application developer, I want my application to be portable across clusters that use similar
- but different variants of a "class" of runtimes.
-
-## Proposal
-
-The initial design includes:
-
-- `RuntimeClass` API resource definition
-- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with
-- Kubelet implementation for fetching & interpreting the RuntimeClass
-- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler).
-
-### API
-
-`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group.
-
-> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired.
-> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_
-
-_(This is a simplified declaration, syntactic details will be covered in the API PR review)_
-
-```go
-type RuntimeClass struct {
- metav1.TypeMeta
- // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class.
- // Namespace should be left blank.
- metav1.ObjectMeta
-
- Spec RuntimeClassSpec
-}
-
-type RuntimeClassSpec struct {
- // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
- // creation. The possible values are specific to a given configuration & CRI implementation.
- // The empty string is equivalent to the default behavior.
- // +optional
- RuntimeHandler string
-}
-```
-
-The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is
-scheduled, the RuntimeClass cannot be changed.
-
-```go
-type PodSpec struct {
- ...
- // RuntimeClassName refers to a RuntimeClass object with the same name,
- // which should be used to run this pod.
- // +optional
- RuntimeClassName string
- ...
-}
-```
-
-The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards
-compatible with current Kubernetes. This means that the legacy runtime does not specify any
-RuntimeHandler or perform any feature validation (all features are "supported").
-
-```go
-const (
- // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy
- // RuntimeClass does not specify a runtime handler or perform any
- // feature validation.
- RuntimeClassNameLegacy = "legacy"
-)
-```
-
-An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is
-not defaulted to `legacy` (to leave room for configurable defaults in a future update).
-
-#### Examples
-
-Suppose we operate a cluster that lets users choose between native runc containers, and gvisor and
-kata-container sandboxes. We might create the following runtime classes:
-
-```yaml
-kind: RuntimeClass
-apiVersion: node.k8s.io/v1alpha1
-metadata:
- name: native # equivalent to 'legacy' for now
-spec:
- runtimeHandler: runc
----
-kind: RuntimeClass
-apiVersion: node.k8s.io/v1alpha1
-metadata:
- name: gvisor
-spec:
- runtimeHandler: gvisor
-----
-kind: RuntimeClass
-apiVersion: node.k8s.io/v1alpha1
-metadata:
- name: kata-containers
-spec:
- runtimeHandler: kata-containers
-----
-# provides the default sandbox runtime when users don't care about which they're getting.
-kind: RuntimeClass
-apiVersion: node.k8s.io/v1alpha1
-metadata:
- name: sandboxed
-spec:
- runtimeHandler: gvisor
-```
-
-Then when a user creates a workload, they can choose the desired runtime class to use (or not, if
-they want the default).
-
-```yaml
-apiVersion: extensions/v1beta1
-kind: Deployment
-metadata:
- name: sandboxed-nginx
-spec:
- replicas: 2
- selector:
- matchLabels:
- app: sandboxed-nginx
- template:
- metadata:
- labels:
- app: sandboxed-nginx
- spec:
- runtimeClassName: sandboxed # <---- Reference the desired RuntimeClass
- containers:
- - name: nginx
- image: nginx
- ports:
- - containerPort: 80
- protocol: TCP
-```
-
-#### Runtime Handler
-
-The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`:
-
-```proto
-message RunPodSandboxRequest {
- // Configuration for creating a PodSandbox.
- PodSandboxConfig config = 1;
- // Named runtime configuration to use for this PodSandbox.
- string RuntimeHandler = 2;
-}
-```
-
-The RuntimeHandler is provided as a mechanism for CRI implementations to select between different
-predetermined configurations. The initial use case is replacing the experimental pod annotations
-currently used for selecting a sandboxed runtime by various CRI implementations:
-
-| CRI Runtime | Pod Annotation |
-| ------------|-------------------------------------------------------------|
-| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" |
-| containerd | io.kubernetes.cri.untrusted-workload: "true" |
-| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" |
-| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" |
-
-These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred
-approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be
-matched against the specified RuntimeHandler. For example, containerd might have a configuration
-corresponding to a "kata-runtime" handler:
-
-```
-[plugins.cri.containerd.kata-runtime]
- runtime_type = "io.containerd.runtime.v1.linux"
- runtime_engine = "/opt/kata/bin/kata-runtime"
- runtime_root = ""
-```
-
-This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection
-(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox
-types (e.g. `kata-containers` or `gvisor` RuntimeClasses).
-
-### Versioning, Updates, and Rollouts
-
-Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha
-implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**,
-thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods
-must be updated to reference the new RuntimeClass, and comes with the advantage of native support
-for rolling updates through the same mechanisms as any other application update. The
-`RuntimeClassName` pod field is also immutable post scheduling.
-
-This conservative approach is preferred since it's much easier to relax constraints in a backwards
-compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass
-to beta.
-
-### Implementation Details
-
-The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is
-added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once
-resolved, the RuntimeHandler field is passed to the CRI as part of the
-[`RunPodSandboxRequest`][runpodsandbox]. At that point, the interpretation of the RuntimeHandler is
-left to the CRI implementation, but it should be cached if needed for subsequent calls.
-
-If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will
-be rejected in admission (controller to be detailed in a following update). If the RuntimeClass
-cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail
-the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If
-the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return
-an error.
-
-[runpodsandbox]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344
-
-### Risks and Mitigations
-
-**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default
-dumping ground for every new feature exposed by the node. For each feature, careful consideration
-should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The
-[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features.
-
-**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for
-PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying
-runtime implementation should be extremely limited (generally only around updates & rollouts). To
-enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to
-restrict a user to a specific RuntimeClass, you must use another policy mechanism such as
-PodSecurityPolicy.
-
-**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity
-of runtime configuration from most users (aside from the cluster admin or provisioner). However, we
-are still side-stepping the issue of precisely defining specific types of runtimes like
-"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories
-is even possible. RuntimeClass allows us to decouple this specification from the implementation, but
-it is still something I hope we can address in a future iteration through the concept of pre-defined
-or "conformant" RuntimeClasses.
-
-**Non-portability.** We are already in a world of non-portability for many features (see [examples
-of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help
-address this issue by formally declaring supported features, or matching the runtime that supports a
-given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name,
-which may not be defined in every cluster. This is something that can be addressed through
-pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible
-RuntimeClasses.
-
-## Graduation Criteria
-
-Alpha:
-
-- Everything described in the current proposal:
- - Introduce the RuntimeClass API resource
- - Add a RuntimeClassName field to the PodSpec
- - Add a RuntimeHandler field to the CRI `RunPodSandboxRequest`
- - Lookup the RuntimeClass for pods & plumb through the RuntimeHandler in the Kubelet (feature
- gated)
-- RuntimeClass support in at least one CRI runtime & dockershim
- - Runtime Handlers can be statically configured by the runtime, and referenced via RuntimeClass
- - An error is reported when the handler or is unknown or unsupported
-- Testing
- - [CRI validation tests][cri-validation]
- - Kubernetes E2E tests (only validating single runtime handler cases)
-
-[cri-validation]: https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/validation.md
-
-Beta:
-
-- Most runtimes support RuntimeClass, and the current [untrusted annotations](#runtime-handler) are
- deprecated.
-- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass
-- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary.
-- The cluster admin can choose which RuntimeClass is the default in a cluster.
-- Additional requirements TBD
-
-## Implementation History
-
-- 2018-06-11: SIG-Node decision to move forward with proposal
-- 2018-06-19: Initial KEP published.
-
-## Appendix
-
-### Examples of runtime variation
-
-- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods,
- but those are mutually exclusive, and support of either is not required by the runtime. The
- default configuration is also not well defined.
-- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is
- defined by the runtime, and support is not guaranteed.
-- Windows containers - isolation features are very OS-specific, and most of the current features are
- limited to linux. As we build out Windows container support, we'll need to add windows-specific
- features as well.
-- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes
- (e.g. Kata-containers & gVisor).
-- Per-pod and Per-container resource overhead varies by runtime.
-- Device support (e.g. GPUs) varies wildly by runtime & nodes.
-- Supported volume types varies by node - it remains TBD whether this information belongs in
- RuntimeClass.
-- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may
- have differing defaults, or support a subset of capabilities.
-- `Privileged` mode is not well defined, and thus may have differing implementations.
-- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed
- workloads)
+KEPs have moved to https://git.k8s.io/enhancements/.
+<!--
+This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first.
+--> \ No newline at end of file
diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md
index a6c5aaba..cfd1f5fa 100644
--- a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md
+++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md
@@ -1,807 +1,4 @@
----
-kep-number: 0
-title: Quotas for Ephemeral Storage
-authors:
- - "@RobertKrawitz"
-owning-sig: sig-xxx
-participating-sigs:
- - sig-node
-reviewers:
- - TBD
-approvers:
- - "@dchen1107"
- - "@derekwaynecarr"
-editor: TBD
-creation-date: yyyy-mm-dd
-last-updated: yyyy-mm-dd
-status: provisional
-see-also:
-replaces:
-superseded-by:
----
-
-# Quotas for Ephemeral Storage
-
-## Table of Contents
-<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again -->
-**Table of Contents**
-
-- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage)
- - [Table of Contents](#table-of-contents)
- - [Summary](#summary)
- - [Project Quotas](#project-quotas)
- - [Motivation](#motivation)
- - [Goals](#goals)
- - [Non-Goals](#non-goals)
- - [Future Work](#future-work)
- - [Proposal](#proposal)
- - [Control over Use of Quotas](#control-over-use-of-quotas)
- - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota)
- - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption)
- - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota)
- - [Operation Notes](#operation-notes)
- - [Selecting a Project ID](#selecting-a-project-id)
- - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory)
- - [Return a Project ID To the System](#return-a-project-id-to-the-system)
- - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
- - [Notes on Implementation](#notes-on-implementation)
- - [Notes on Code Changes](#notes-on-code-changes)
- - [Testing Strategy](#testing-strategy)
- - [Risks and Mitigations](#risks-and-mitigations)
- - [Graduation Criteria](#graduation-criteria)
- - [Implementation History](#implementation-history)
- - [Drawbacks [optional]](#drawbacks-optional)
- - [Alternatives [optional]](#alternatives-optional)
- - [Alternative quota-based implementation](#alternative-quota-based-implementation)
- - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation)
- - [Infrastructure Needed [optional]](#infrastructure-needed-optional)
- - [References](#references)
- - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas)
- - [CVE](#cve)
- - [Other Security Issues Without CVE](#other-security-issues-without-cve)
- - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012)
-
-<!-- markdown-toc end -->
-
-[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
-
-## Summary
-
-This proposal applies to the use of quotas for ephemeral-storage
-metrics gathering. Use of quotas for ephemeral-storage limit
-enforcement is a [non-goal](#non-goals), but as the architecture and
-code will be very similar, there are comments interspersed related to
-enforcement. _These comments will be italicized_.
-
-Local storage capacity isolation, aka ephemeral-storage, was
-introduced into Kubernetes via
-<https://github.com/kubernetes/features/issues/361>. It provides
-support for capacity isolation of shared storage between pods, such
-that a pod can be limited in its consumption of shared resources and
-can be evicted if its consumption of shared storage exceeds that
-limit. The limits and requests for shared ephemeral-storage are
-similar to those for memory and CPU consumption.
-
-The current mechanism relies on periodically walking each ephemeral
-volume (emptydir, logdir, or container writable layer) and summing the
-space consumption. This method is slow, can be fooled, and has high
-latency (i. e. a pod could consume a lot of storage prior to the
-kubelet being aware of its overage and terminating it).
-
-The mechanism proposed here utilizes filesystem project quotas to
-provide monitoring of resource consumption _and optionally enforcement
-of limits._ Project quotas, initially in XFS and more recently ported
-to ext4fs, offer a kernel-based means of monitoring _and restricting_
-filesystem consumption that can be applied to one or more directories.
-
-A prototype is in progress; see <https://github.com/kubernetes/kubernetes/pull/66928>.
-
-### Project Quotas
-
-Project quotas are a form of filesystem quota that apply to arbitrary
-groups of files, as opposed to file user or group ownership. They
-were first implemented in XFS, as described here:
-<http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html>.
-
-Project quotas for ext4fs were [proposed in late
-2014](https://lwn.net/Articles/623835/) and added to the Linux kernel
-in early 2016, with
-commit
-[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e).
-They were designed to be compatible with XFS project quotas.
-
-Each inode contains a 32-bit project ID, to which optionally quotas
-(hard and soft limits for blocks and inodes) may be applied. The
-total blocks and inodes for all files with the given project ID are
-maintained by the kernel. Project quotas can be managed from
-userspace by means of the `xfs_quota(8)` command in foreign filesystem
-(`-f`) mode; the traditional Linux quota tools do not manipulate
-project quotas. Programmatically, they are managed by the `quotactl(2)`
-system call, using in part the standard quota commands and in part the
-XFS quota commands; the man page implies incorrectly that the XFS
-quota commands apply only to XFS filesystems.
-
-The project ID applied to a directory is inherited by files created
-under it. Files cannot be (hard) linked across directories with
-different project IDs. A file's project ID cannot be changed by a
-non-privileged user, but a privileged user may use the `xfs_io(8)`
-command to change the project ID of a file.
-
-Filesystems using project quotas may be mounted with quotas either
-enforced or not; the non-enforcing mode tracks usage without enforcing
-it. A non-enforcing project quota may be implemented on a filesystem
-mounted with enforcing quotas by setting a quota too large to be hit.
-The maximum size that can be set varies with the filesystem; on a
-64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for
-ext4fs.
-
-Conventionally, project quota mappings are stored in `/etc/projects` and
-`/etc/projid`; these files exist for user convenience and do not have
-any direct importance to the kernel. `/etc/projects` contains a mapping
-from project ID to directory/file; this can be a one to many mapping
-(the same project ID can apply to multiple directories or files, but
-any given directory/file can be assigned only one project ID).
-`/etc/projid` contains a mapping from named projects to project IDs.
-
-This proposal utilizes hard project quotas for both monitoring _and
-enforcement_. Soft quotas are of no utility; they allow for temporary
-overage that, after a programmable period of time, is converted to the
-hard quota limit.
-
-
-## Motivation
-
-The mechanism presently used to monitor storage consumption involves
-use of `du` and `find` to periodically gather information about
-storage and inode consumption of volumes. This mechanism suffers from
-a number of drawbacks:
-
-* It is slow. If a volume contains a large number of files, walking
- the directory can take a significant amount of time. There has been
- at least one known report of nodes becoming not ready due to volume
- metrics: <https://github.com/kubernetes/kubernetes/issues/62917>
-* It is possible to conceal a file from the walker by creating it and
- removing it while holding an open file descriptor on it. POSIX
- behavior is to not remove the file until the last open file
- descriptor pointing to it is removed. This has legitimate uses; it
- ensures that a temporary file is deleted when the processes using it
- exit, and it minimizes the attack surface by not having a file that
- can be found by an attacker. The following pod does this; it will
- never be caught by the present mechanism:
-
-```yaml
-apiVersion: v1
-kind: Pod
-max:
-metadata:
- name: "diskhog"
-spec:
- containers:
- - name: "perl"
- resources:
- limits:
- ephemeral-storage: "2048Ki"
- image: "perl"
- command:
- - perl
- - -e
- - >
- my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999
- volumeMounts:
- - name: a
- mountPath: /data/a
- volumes:
- - name: a
- emptyDir: {}
-```
-* It is reactive rather than proactive. It does not prevent a pod
- from overshooting its limit; at best it catches it after the fact.
- On a fast storage medium, such as NVMe, a pod may write 50 GB or
- more of data before the housekeeping performed once per minute
- catches up to it. If the primary volume is the root partition, this
- will completely fill the partition, possibly causing serious
- problems elsewhere on the system. This proposal does not address
- this issue; _a future enforcing project would_.
-
-In many environments, these issues may not matter, but shared
-multi-tenant environments need these issues addressed.
-
-### Goals
-
-These goals apply only to local ephemeral storage, as described in
-<https://github.com/kubernetes/features/issues/361>.
-
-* Primary: improve performance of monitoring by using project quotas
- in a non-enforcing way to collect information about storage
- utilization of ephemeral volumes.
-* Primary: detect storage used by pods that is concealed by deleted
- files being held open.
-* Primary: this will not interfere with the more common user and group
- quotas.
-
-### Non-Goals
-
-* Application to storage other than local ephemeral storage.
-* Application to container copy on write layers. That will be managed
- by the container runtime. For a future project, we should work with
- the runtimes to use quotas for their monitoring.
-* Elimination of eviction as a means of enforcing ephemeral-storage
- limits. Pods that hit their ephemeral-storage limit will still be
- evicted by the kubelet even if their storage has been capped by
- enforcing quotas.
-* Enforcing node allocatable (limit over the sum of all pod's disk
- usage, including e. g. images).
-* Enforcing limits on total pod storage consumption by any means, such
- that the pod would be hard restricted to the desired storage limit.
-
-### Future Work
-
-* _Enforce limits on per-volume storage consumption by using
- enforced project quotas._
-
-## Proposal
-
-This proposal applies project quotas to emptydir volumes on qualifying
-filesystems (ext4fs and xfs with project quotas enabled). Project
-quotas are applied by selecting an unused project ID (a 32-bit
-unsigned integer), setting a limit on space and/or inode consumption,
-and attaching the ID to one or more files. By default (and as
-utilized herein), if a project ID is attached to a directory, it is
-inherited by any files created under that directory.
-
-_If we elect to use the quota as enforcing, we impose a quota
-consistent with the desired limit._ If we elect to use it as
-non-enforcing, we impose a large quota that in practice cannot be
-exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs).
-
-### Control over Use of Quotas
-
-At present, two feature gates control operation of quotas:
-
-* `LocalStorageCapacityIsolation` must be enabled for any use of
- quotas.
-
-* `LocalStorageCapacityIsolationFSMonitoring` must be enabled in addition. If this is
- enabled, quotas are used for monitoring, but not enforcement. At
- present, this defaults to False, but the intention is that this will
- default to True by initial release.
-
-* _`LocalStorageCapacityIsolationFSEnforcement` must be enabled, in addition to
- `LocalStorageCapacityIsolationFSMonitoring`, to use quotas for enforcement._
-
-### Operation Flow -- Applying a Quota
-
-* Caller (emptydir volume manager or container runtime) creates an
- emptydir volume, with an empty directory at a location of its
- choice.
-* Caller requests that a quota be applied to a directory.
-* Determine whether a quota can be imposed on the directory, by asking
- each quota provider (one per filesystem type) whether it can apply a
- quota to the directory. If no provider claims the directory, an
- error status is returned to the caller.
-* Select an unused project ID ([see below](#selecting-a-project-id)).
-* Set the desired limit on the project ID, in a filesystem-dependent
- manner ([see below](#notes-on-implementation)).
-* Apply the project ID to the directory in question, in a
- filesystem-dependent manner.
-
-An error at any point results in no quota being applied and no change
-to the state of the system. The caller in general should not assume a
-priori that the attempt will be successful. It could choose to reject
-a request if a quota cannot be applied, but at this time it will
-simply ignore the error and proceed as today.
-
-### Operation Flow -- Retrieving Storage Consumption
-
-* Caller (kubelet metrics code, cadvisor, container runtime) asks the
- quota code to compute the amount of storage used under the
- directory.
-* Determine whether a quota applies to the directory, in a
- filesystem-dependent manner ([see below](#notes-on-implementation)).
-* If so, determine how much storage or how many inodes are utilized,
- in a filesystem dependent manner.
-
-If the quota code is unable to retrieve the consumption, it returns an
-error status and it is up to the caller to utilize a fallback
-mechanism (such as the directory walk performed today).
-
-### Operation Flow -- Removing a Quota.
-
-* Caller requests that the quota be removed from a directory.
-* Determine whether a project quota applies to the directory.
-* Remove the limit from the project ID associated with the directory.
-* Remove the association between the directory and the project ID.
-* Return the project ID to the system to allow its use elsewhere ([see
- below](#return-a-project-id-to-the-system)).
-* Caller may delete the directory and its contents (normally it will).
-
-### Operation Notes
-
-#### Selecting a Project ID
-
-Project IDs are a shared space within a filesystem. If the same
-project ID is assigned to multiple directories, the space consumption
-reported by the quota will be the sum of that of all of the
-directories. Hence, it is important to ensure that each directory is
-assigned a unique project ID (unless it is desired to pool the storage
-use of multiple directories).
-
-The canonical mechanism to record persistently that a project ID is
-reserved is to store it in the `/etc/projid` (`projid[5]`) and/or
-`/etc/projects` (`projects(5)`) files. However, it is possible to utilize
-project IDs without recording them in those files; they exist for
-administrative convenience but neither the kernel nor the filesystem
-is aware of them. Other ways can be used to determine whether a
-project ID is in active use on a given filesystem:
-
-* The quota values (in blocks and/or inodes) assigned to the project
- ID are non-zero.
-* The storage consumption (in blocks and/or inodes) reported under the
- project ID are non-zero.
-
-The algorithm to be used is as follows:
-
-* Lock this instance of the quota code against re-entrancy.
-* open and `flock()` the `/etc/project` and `/etc/projid` files, so that
- other uses of this code are excluded.
-* Start from a high number (the prototype uses 1048577).
-* Iterate from there, performing the following tests:
- * Is the ID reserved by this instance of the quota code?
- * Is the ID present in `/etc/projects`?
- * Is the ID present in `/etc/projid`?
- * Are the quota values and/or consumption reported by the kernel
- non-zero? This test is restricted to 128 iterations to ensure
- that a bug here or elsewhere does not result in an infinite loop
- looking for a quota ID.
-* If an ID has been found:
- * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so
- that any other uses of project quotas do not reuse it.
- * Write temporary copies of `/etc/projects` and `/etc/projid` that are
- `flock()`ed
- * If successful, rename the temporary files appropriately (if
- rename of one succeeds but the other fails, we have a problem
- that we cannot recover from, and the files may be inconsistent).
-* Unlock `/etc/projid` and `/etc/projects`.
-* Unlock this instance of the quota code.
-
-A minor variation of this is used if we want to reuse an existing
-quota ID.
-
-#### Determine Whether a Project ID Applies To a Directory
-
-It is possible to determine whether a directory has a project ID
-applied to it by requesting (via the `quotactl(2)` system call) the
-project ID associated with the directory. Whie the specifics are
-filesystem-dependent, the basic method is the same for at least XFS
-and ext4fs.
-
-It is not possible to determine in constant operations the directory
-or directories to which a project ID is applied. It is possible to
-determine whether a given project ID has been applied to an existing
-directory or files (although those will not be known); the reported
-consumption will be non-zero.
-
-The code records internally the project ID applied to a directory, but
-it cannot always rely on this. In particular, if the kubelet has
-exited and has been restarted (and hence the quota applying to the
-directory should be removed), the map from directory to project ID is
-lost. If it cannot find a map entry, it falls back on the approach
-discussed above.
-
-#### Return a Project ID To the System
-
-The algorithm used to return a project ID to the system is very
-similar to the algorithm used to select a project ID, except of course
-for selecting a project ID. It performs the same sequence of locking
-`/etc/project` and `/etc/projid`, editing a copy of the file, and
-restoring it.
-
-If the project ID is applied to multiple directories and the code can
-determine that, it will not remove the project ID from `/etc/projid`
-until the last reference is removed. While it is not anticipated in
-this KEP that this mode of operation will be used, at least initially,
-this can be detected even on kubelet restart by looking at the
-reference count in `/etc/projects`.
-
-
-### Implementation Details/Notes/Constraints [optional]
-
-#### Notes on Implementation
-
-The primary new interface defined is the quota interface in
-`pkg/volume/util/quota/quota.go`. This defines five operations:
-
-* Does the specified directory support quotas?
-
-* Assign a quota to a directory. If a non-empty pod UID is provided,
- the quota assigned is that of any other directories under this pod
- UID; if an empty pod UID is provided, a unique quota is assigned.
-
-* Retrieve the consumption of the specified directory. If the quota
- code cannot handle it efficiently, it returns an error and the
- caller falls back on existing mechanism.
-
-* Retrieve the inode consumption of the specified directory; same
- description as above.
-
-* Remove quota from a directory. If a non-empty pod UID is passed, it
- is checked against that recorded in-memory (if any). The quota is
- removed from the specified directory. This can be used even if
- AssignQuota has not been used; it inspects the directory and removes
- the quota from it. This permits stale quotas from an interrupted
- kubelet to be cleaned up.
-
-Two implementations are provided: `quota_linux.go` (for Linux) and
-`quota_unsupported.go` (for other operating systems). The latter
-returns an error for all requests.
-
-As the quota mechanism is intended to support multiple filesystems,
-and different filesystems require different low level code for
-manipulating quotas, a provider is supplied that finds an appropriate
-quota applier implementation for the filesystem in question. The low
-level quota applier provides similar operations to the top level quota
-code, with two exceptions:
-
-* No operation exists to determine whether a quota can be applied
- (that is handled by the provider).
-
-* An additional operation is provided to determine whether a given
- quota ID is in use within the filesystem (outside of `/etc/projects`
- and `/etc/projid`).
-
-The two quota providers in the initial implementation are in
-`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While
-some quota operations do require different system calls, a lot of the
-code is common, and factored into
-`pkg/volume/util/quota/common/quota_linux_common_impl.go`.
-
-#### Notes on Code Changes
-
-The prototype for this project is mostly self-contained within
-`pkg/volume/util/quota` and a few changes to
-`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were
-required elsewhere:
-
-* The operation executor needs to pass the desired size limit to the
- volume plugin where appropriate so that the volume plugin can impose
- a quota. The limit is passed as 0 (do not use quotas), _positive
- number (impose an enforcing quota if possible, measured in bytes),_
- or -1 (impose a non-enforcing quota, if possible) on the volume.
-
- This requires changes to
- `pkg/volume/util/operationexecutor/operation_executor.go` (to add
- `DesiredSizeLimit` to `VolumeToMount`),
- `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and
- `pkg/kubelet/eviction/helpers.go` (the latter in order to determine
- whether the volume is a local ephemeral one).
-
-* The volume manager (in `pkg/volume/volume.go`) changes the
- `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new
- `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to
- allow passing the desired size and pod UID (in the event we choose
- to implement quotas shared between multiple volumes; [see
- below](#alternative-quota-based-implementation)). This required
- small changes to all volume plugins and their tests, but will in the
- future allow adding additional data without having to change code
- other than that which uses the new information.
-
-#### Testing Strategy
-
-The quota code is by an large not very amendable to unit tests. While
-there are simple unit tests for parsing the mounts file, and there
-could be tests for parsing the projects and projid files, the real
-work (and risk) involves interactions with the kernel and with
-multiple instances of this code (e. g. in the kubelet and the runtime
-manager, particularly under stress). It also requires setup in the
-form of a prepared filesystem. It would be better served by
-appropriate end to end tests.
-
-### Risks and Mitigations
-
-* The SIG raised the possibility of a container being unable to exit
- should we enforce quotas, and the quota interferes with writing the
- log. This can be mitigated by either not applying a quota to the
- log directory and using the du mechanism, or by applying a separate
- non-enforcing quota to the log directory.
-
- As log directories are write-only by the container, and consumption
- can be limited by other means (as the log is filtered by the
- runtime), I do not consider the ability to write uncapped to the log
- to be a serious exposure.
-
- Note in addition that even without quotas it is possible for writes
- to fail due to lack of filesystem space, which is effectively (and
- in some cases operationally) indistinguishable from exceeding quota,
- so even at present code must be able to handle those situations.
-
-* Filesystem quotas may impact performance to an unknown degree.
- Information on that is hard to come by in general, and one of the
- reasons for using quotas is indeed to improve performance. If this
- is a problem in the field, merely turning off quotas (or selectively
- disabling project quotas) on the filesystem in question will avoid
- the problem. Against the possibility that cannot be done
- (because project quotas are needed for other purposes), we should
- provide a way to disable use of quotas altogether via a feature
- gate.
-
- A report <https://blog.pythonanywhere.com/110/> notes that an
- unclean shutdown on Linux kernel versions between 3.11 and 3.17 can
- result in a prolonged downtime while quota information is restored.
- Unfortunately, [the link referenced
- here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no
- longer available.
-
-* Bugs in the quota code could result in a variety of regression
- behavior. For example, if a quota is incorrectly applied it could
- result in ability to write no data at all to the volume. This could
- be mitigated by use of non-enforcing quotas. XFS in particular
- offers the `pqnoenforce` mount option that makes all quotas
- non-enforcing.
-
-
-## Graduation Criteria
-
-How will we know that this has succeeded? Gathering user feedback is
-crucial for building high quality experiences and SIGs have the
-important responsibility of setting milestones for stability and
-completeness. Hopefully the content previously contained in [umbrella
-issues][] will be tracked in the `Graduation Criteria` section.
-
-[umbrella issues]: N/A
-
-## Implementation History
-
-Major milestones in the life cycle of a KEP should be tracked in
-`Implementation History`. Major milestones might include
-
-- the `Summary` and `Motivation` sections being merged signaling SIG
- acceptance
-- the `Proposal` section being merged signaling agreement on a
- proposed design
-- the date implementation started
-- the first Kubernetes release where an initial version of the KEP was
- available
-- the version of Kubernetes where the KEP graduated to general
- availability
-- when the KEP was retired or superseded
-
-## Drawbacks [optional]
-
-* Use of quotas, particularly the less commonly used project quotas,
- requires additional action on the part of the administrator. In
- particular:
- * ext4fs filesystems must be created with additional options that
- are not enabled by default:
-```
-mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_
-```
- * An additional option (`prjquota`) must be applied in `/etc/fstab`
- * If the root filesystem is to be quota-enabled, it must be set in
- the grub options.
-* Use of project quotas for this purpose will preclude future use
- within containers.
-
-## Alternatives [optional]
-
-I have considered two classes of alternatives:
-
-* Alternatives based on quotas, with different implementation
-
-* Alternatives based on loop filesystems without use of quotas
-
-### Alternative quota-based implementation
-
-Within the basic framework of using quotas to monitor and potentially
-enforce storage utilization, there are a number of possible options:
-
-* Utilize per-volume non-enforcing quotas to monitor storage (the
- first stage of this proposal).
-
- This mostly preserves the current behavior, but with more efficient
- determination of storage utilization and the possibility of building
- further on it. The one change from current behavior is the ability
- to detect space used by deleted files.
-
-* Utilize per-volume enforcing quotas to monitor and enforce storage
- (the second stage of this proposal).
-
- This allows partial enforcement of storage limits. As local storage
- capacity isolation works at the level of the pod, and we have no
- control of user utilization of ephemeral volumes, we would have to
- give each volume a quota of the full limit. For example, if a pod
- had a limit of 1 MB but had four ephemeral volumes mounted, it would
- be possible for storage utilization to reach (at least temporarily)
- 4MB before being capped.
-
-* Utilize per-pod enforcing user or group quotas to enforce storage
- consumption, and per-volume non-enforcing quotas for monitoring.
-
- This would offer the best of both worlds: a fully capped storage
- limit combined with efficient reporting. However, it would require
- each pod to run under a distinct UID or GID. This may prevent pods
- from using setuid or setgid or their variants, and would interfere
- with any other use of group or user quotas within Kubernetes.
-
-* Utilize per-pod enforcing quotas to monitor and enforce storage.
-
- This allows for full enforcement of storage limits, at the expense
- of being able to efficiently monitor per-volume storage
- consumption. As there have already been reports of monitoring
- causing trouble, I do not advise this option.
-
- A variant of this would report (1/N) storage for each covered
- volume, so with a pod with a 4MiB quota and 1MiB total consumption,
- spread across 4 ephemeral volumes, each volume would report a
- consumption of 256 KiB. Another variant would change the API to
- report statistics for all ephemeral volumes combined. I do not
- advise this option.
-
-### Alternative loop filesystem-based implementation
-
-Another way of isolating storage is to utilize filesystems of
-pre-determined size, using the loop filesystem facility within Linux.
-It is possible to create a file and run `mkfs(8)` on it, and then to
-mount that filesystem on the desired directory. This both limits the
-storage available within that directory and enables quick retrieval of
-it via `statfs(2)`.
-
-Cleanup of such a filesystem involves unmounting it and removing the
-backing file.
-
-The backing file can be created as a sparse file, and the `discard`
-option can be used to return unused space to the system, allowing for
-thin provisioning.
-
-I conducted preliminary investigations into this. While at first it
-appeared promising, it turned out to have multiple critical flaws:
-
-* If the filesystem is mounted without the `discard` option, it can
- grow to the full size of the backing file, negating any possibility
- of thin provisioning. If the file is created dense in the first
- place, there is never any possibility of thin provisioning without
- use of `discard`.
-
- If the backing file is created densely, it additionally may require
- significant time to create if the ephemeral limit is large.
-
-* If the filesystem is mounted `nosync`, and is sparse, it is possible
- for writes to succeed and then fail later with I/O errors when
- synced to the backing storage. This will lead to data corruption
- that cannot be detected at the time of write.
-
- This can easily be reproduced by e. g. creating a 64MB filesystem
- and within it creating a 128MB sparse file and building a filesystem
- on it. When that filesystem is in turn mounted, writes to it will
- succeed, but I/O errors will be seen in the log and the file will be
- incomplete:
-
-```
-# mkdir /var/tmp/d1 /var/tmp/d2
-# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383
-# mkfs.ext4 /var/tmp/fs1
-# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1
-# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767
-# mkfs.ext4 /var/tmp/d1/fs2
-# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2
-# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576
- ...will normally succeed...
-# sync
- ...fails with I/O error!...
-```
-
-* If the filesystem is mounted `sync`, all writes to it are
- immediately committed to the backing store, and the `dd` operation
- above fails as soon as it fills up `/var/tmp/d1`. However,
- performance is drastically slowed, particularly with small writes;
- with 1K writes, I observed performance degradation in some cases
- exceeding three orders of magnitude.
-
- I performed a test comparing writing 64 MB to a base (partitioned)
- filesystem, to a loop filesystem without `sync`, and a loop
- filesystem with `sync`. Total I/O was sufficient to run for at least
- 5 seconds in each case. All filesystems involved were XFS. Loop
- filesystems were 128 MB and dense. Times are in seconds. The
- erratic behavior (e. g. the 65536 case) was involved was observed
- repeatedly, although the exact amount of time and which I/O sizes
- were affected varied. The underlying device was an HP EX920 1TB
- NVMe SSD.
-
-| I/O Size | Partition | Loop w/sync | Loop w/o sync |
-| ---: | ---: | ---: | ---: |
-| 1024 | 0.104 | 0.120 | 140.390 |
-| 4096 | 0.045 | 0.077 | 21.850 |
-| 16384 | 0.045 | 0.067 | 5.550 |
-| 65536 | 0.044 | 0.061 | 20.440 |
-| 262144 | 0.043 | 0.087 | 0.545 |
-| 1048576 | 0.043 | 0.055 | 7.490 |
-| 4194304 | 0.043 | 0.053 | 0.587 |
-
- The only potentially viable combination in my view would be a dense
- loop filesystem without sync, but that would render any thin
- provisioning impossible.
-
-## Infrastructure Needed [optional]
-
-* Decision: who is responsible for quota management of all volume
- types (and especially ephemeral volumes of all types). At present,
- emptydir volumes are managed by the kubelet and logdirs and writable
- layers by either the kubelet or the runtime, depending upon the
- choice of runtime. Beyond the specific proposal that the runtime
- should manage quotas for volumes it creates, there are broader
- issues that I request assistance from the SIG in addressing.
-
-* Location of the quota code. If the quotas for different volume
- types are to be managed by different components, each such component
- needs access to the quota code. The code is substantial and should
- not be copied; it would more appropriately be vendored.
-
-## References
-
-### Bugs Opened Against Filesystem Quotas
-
-The following is a list of known security issues referencing
-filesystem quotas on Linux, and other bugs referencing filesystem
-quotas in Linux since 2012. These bugs are not necessarily in the
-quota system.
-
-#### CVE
-
-* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel
- before 3.3.6, when huge pages are enabled, allows local users to
- cause a denial of service (system crash) or possibly gain privileges
- by interacting with a hugetlbfs filesystem, as demonstrated by a
- umount operation that triggers improper handling of quota data.
-
- The issue is actually related to huge pages, not quotas
- specifically. The demonstration of the vulnerability resulted in
- incorrect handling of quota data.
-
-* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c)
- in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl
- function the first time without a host name, which might allow
- remote attackers to bypass TCP Wrappers rules in hosts.deny (related
- to rpc.rquotad; remote attackers might be able to bypass TCP
- Wrappers rules).
-
- This issue is related to remote quota handling, which is not the use
- case for the proposal at hand.
-
-#### Other Security Issues Without CVE
-
-* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and
- Create Large Files](https://securitytracker.com/id/1002610)
-
- A setuid root binary inheriting file descriptors from an
- unprivileged user process may write to the file without respecting
- quota limits. If this issue is still present, it would allow a
- setuid process to exceed any enforcing limits, but does not affect
- the quota accounting (use of quotas for monitoring).
-
-### Other Linux Quota-Related Bugs Since 2012
-
-* [ext4: report delalloc reserve as non-free in statfs mangled by
- project quota](https://lore.kernel.org/patchwork/patch/884530/)
-
- This bug, fixed in Feb. 2018, properly accounts for reserved but not
- committed space in project quotas. At this point I have not
- determined the impact of this issue.
-
-* [XFS quota doesn't work after rebooting because of
- crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730)
-
- This bug resulted in XFS quotas not working after a crash or forced
- reboot. Under this proposal, Kubernetes would fall back to du for
- monitoring should a bug of this nature manifest itself again.
-
-* [quota can show incorrect filesystem
- name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527)
-
- This issue, which will not be fixed, results in the quota command
- possibly printing an incorrect filesystem name when used on remote
- filesystems. It is a display issue with the quota command, not a
- quota bug at all, and does not result in incorrect quota information
- being reported. As this proposal does not utilize the quota command
- or rely on filesystem name, or currently use quotas on remote
- filesystems, it should not be affected by this bug.
-
-In addition, the e2fsprogs have had numerous fixes over the years.
+KEPs have moved to https://git.k8s.io/enhancements/.
+<!--
+This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first.
+--> \ No newline at end of file
diff --git a/keps/sig-node/compute-device-assignment.md b/keps/sig-node/compute-device-assignment.md
index 1ce72617..cfd1f5fa 100644
--- a/keps/sig-node/compute-device-assignment.md
+++ b/keps/sig-node/compute-device-assignment.md
@@ -1,150 +1,4 @@
----
-kep-number: 18
-title: Kubelet endpoint for device assignment observation details
-authors:
- - "@dashpole"
- - "@vikaschoudhary16"
-owning-sig: sig-node
-reviewers:
- - "@thockin"
- - "@derekwaynecarr"
- - "@dchen1107"
- - "@vishh"
-approvers:
- - "@sig-node-leads"
-editors:
- - "@dashpole"
- - "@vikaschoudhary16"
-creation-date: "2018-07-19"
-last-updated: "2018-07-19"
-status: provisional
----
-# Kubelet endpoint for device assignment observation details
-
-Table of Contents
-=================
-* [Abstract](#abstract)
-* [Background](#background)
-* [Objectives](#objectives)
-* [User Journeys](#user-journeys)
- * [Device Monitoring Agents](#device-monitoring-agents)
-* [Changes](#changes)
-* [Potential Future Improvements](#potential-future-improvements)
-* [Alternatives Considered](#alternatives-considered)
-
-## Abstract
-In this document we will discuss the motivation and code changes required for introducing a kubelet endpoint to expose device to container bindings.
-
-## Background
-[Device Monitoring](https://docs.google.com/document/d/1NYnqw-HDQ6Y3L_mk85Q3wkxDtGNWTxpsedsgw4NgWpg/edit?usp=sharing) requires external agents to be able to determine the set of devices in-use by containers and attach pod and container metadata for these devices.
-
-## Objectives
-
-* To remove current device-specific knowledge from the kubelet, such as [accellerator metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L229)
-* To enable future use-cases requiring device-specific knowledge to be out-of-tree
-
-## User Journeys
-
-### Device Monitoring Agents
-
-* As a _Cluster Administrator_, I provide a set of devices from various vendors in my cluster. Each vendor independently maintains their own agent, so I run monitoring agents only for devices I provide. Each agent adheres to to the [node monitoring guidelines](https://docs.google.com/document/d/1_CdNWIjPBqVDMvu82aJICQsSCbh2BR-y9a8uXjQm4TI/edit?usp=sharing), so I can use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even though they are maintained by different vendors.
-* As a _Device Vendor_, I manufacture devices and I have deep domain expertise in how to run and monitor them. Because I maintain my own Device Plugin implementation, as well as Device Monitoring Agent, I can provide consumers of my devices an easy way to consume and monitor my devices without requiring open-source contributions. The Device Monitoring Agent doesn't have any dependencies on the Device Plugin, so I can decouple monitoring from device lifecycle management. My Device Monitoring Agent works by periodically querying the `/devices/<ResourceName>` endpoint to discover which devices are being used, and to get the container/pod metadata associated with the metrics:
-
-![device monitoring architecture](https://user-images.githubusercontent.com/3262098/43926483-44331496-9bdf-11e8-82a0-14b47583b103.png)
-
-
-## Changes
-
-Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager. The GRPC Service returns a single PodResourcesResponse, which is shown in proto below:
-```protobuf
-// PodResources is a service provided by the kubelet that provides information about the
-// node resources consumed by pods and containers on the node
-service PodResources {
- rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
-}
-
-// ListPodResourcesRequest is the request made to the PodResources service
-message ListPodResourcesRequest {}
-
-// ListPodResourcesResponse is the response returned by List function
-message ListPodResourcesResponse {
- repeated PodResources pod_resources = 1;
-}
-
-// PodResources contains information about the node resources assigned to a pod
-message PodResources {
- string name = 1;
- string namespace = 2;
- repeated ContainerResources containers = 3;
-}
-
-// ContainerResources contains information about the resources assigned to a container
-message ContainerResources {
- string name = 1;
- repeated ContainerDevices devices = 2;
-}
-
-// ContainerDevices contains information about the devices assigned to a container
-message ContainerDevices {
- string resource_name = 1;
- repeated string device_ids = 2;
-}
-```
-
-### Potential Future Improvements
-
-* Add `ListAndWatch()` function to the GRPC endpoint so monitoring agents don't need to poll.
-* Add identifiers for other resources used by pods to the `PodResources` message.
- * For example, persistent volume location on disk
-
-## Alternatives Considered
-
-### Add v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns a list of [CreateContainerRequest](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L734)s used to create containers.
-* Pros:
- * Reuse an existing API for describing containers rather than inventing a new one
-* Cons:
- * It ties the endpoint to the CreateContainerRequest, and may prevent us from adding other information we want in the future
- * It does not contain any additional information that will be useful to monitoring agents other than device, and contains lots of irrelevant information for this use-case.
-* Notes:
- * Does not include any reference to resource names. Monitoring agentes must identify devices by the device or environment variables passed to the pod or container.
-
-### Add a field to Pod Status.
-* Pros:
- * Allows for observation of container to device bindings local to the node through the `/pods` endpoint
-* Cons:
- * Only consumed locally, which doesn't justify an API change
- * Device Bindings are immutable after allocation, and are _debatably_ observable (they can be "observed" from the local checkpoint file). Device bindings are generally a poor fit for status.
-
-### Use the Kubelet Device Manager Checkpoint file
-* Allows for observability of device to container bindings through what exists in the checkpoint file
- * Requires adding additional metadata to the checkpoint file as required by the monitoring agent
-* Requires implementing versioning for the checkpoint file, and handling version skew between readers and the kubelet
-* Future modifications to the checkpoint file are more difficult.
-
-### Add a field to the Pod Spec:
-* A new object `ComputeDevice` will be defined and a new variable `ComputeDevices` will be added in the `Container` (Spec) object which will represent a list of `ComputeDevice` objects.
-```golang
-// ComputeDevice describes the devices assigned to this container for a given ResourceName
-type ComputeDevice struct {
- // DeviceIDs is the list of devices assigned to this container
- DeviceIDs []string
- // ResourceName is the name of the compute resource
- ResourceName string
-}
-
-// Container represents a single container that is expected to be run on the host.
-type Container struct {
- ...
- // ComputeDevices contains the devices assigned to this container
- // This field is alpha-level and is only honored by servers that enable the ComputeDevices feature.
- // +optional
- ComputeDevices []ComputeDevice
- ...
-}
-```
-* During Kubelet pod admission, if `ComputeDevices` is found non-empty, specified devices will be allocated otherwise behaviour will remain same as it is today.
-* Before starting the pod, the kubelet writes the assigned `ComputeDevices` back to the pod spec.
- * Note: Writing to the Api Server and waiting to observe the updated pod spec in the kubelet's pod watch may add significant latency to pod startup.
-* Allows devices to potentially be assigned by a custom scheduler.
-* Serves as a permanent record of device assignments for the kubelet, and eliminates the need for the kubelet to maintain this state locally.
-
+KEPs have moved to https://git.k8s.io/enhancements/.
+<!--
+This file is a placeholder to preserve links. Please remove after 6 months or the release of Kubernetes 1.15, whichever comes first.
+--> \ No newline at end of file