From 8dcee5c894adef043bbdfb3df9e93c53807ec026 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Thu, 6 Sep 2018 10:24:27 -0400 Subject: First draft for quotas for ephemeral storage --- .../0028-20180906-quotas-for-ephemeral-storage.md | 229 +++++++++++++++++++++ 1 file changed, 229 insertions(+) create mode 100644 keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md new file mode 100644 index 00000000..b0d12610 --- /dev/null +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -0,0 +1,229 @@ +--- +kep-number: 0 +title: My First KEP +authors: + - "@janedoe" +owning-sig: sig-xxx +participating-sigs: + - sig-aaa + - sig-bbb +reviewers: + - TBD + - "@alicedoe" +approvers: + - TBD + - "@oscardoe" +editor: TBD +creation-date: yyyy-mm-dd +last-updated: yyyy-mm-dd +status: provisional +see-also: + - KEP-1 + - KEP-2 +replaces: + - KEP-3 +superseded-by: + - KEP-100 +--- + +# Quotas for Ephemeral Storaeg + +## Table of Contents + +A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. +[Tools for generating][] a table of contents from markdown are available. + +* [Table of Contents](#table-of-contents) +* [Summary](#summary) +* [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) +* [Proposal](#proposal) + * [User Stories [optional]](#user-stories-optional) + * [Story 1](#story-1) + * [Story 2](#story-2) + * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + * [Risks and Mitigations](#risks-and-mitigations) +* [Graduation Criteria](#graduation-criteria) +* [Implementation History](#implementation-history) +* [Drawbacks [optional]](#drawbacks-optional) +* [Alternatives [optional]](#alternatives-optional) + +[Tools for generating]: https://github.com/ekalinin/github-markdown-toc + +## Summary + +Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via https://github.com/kubernetes/features/issues/361. It provides support for capacity isolation of shared storage between pods, such that a pod can be limited in its consumption of shared resources and can be evicted if its consumption of shared storage exceeds that limit. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption. +The current mechanism relies on periodically walking each ephemeral volume (emptydir, logdir, or container writable layer) and summing the space consumption. This method is slow, can be fooled, and has high latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it). +The mechanism proposed here utilizes filesystem project quotas to provide monitoring of resource consumption and optionally enforcement of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of restricting and monitoring filesystem consumption that can be applied to one or more directories. + +## Motivation + +The mechanism presently used to monitor storage consumption involves use of `du` and `find` to periodically gather information about storage and inode consumption of volumes. This mechanism suffers from a number of drawbacks: + +* It is slow. If a volume contains a large number of files, walking the directory can take a significant amount of time. There has been at least one known report of nodes becoming not ready due to volume metrics: https://github.com/kubernetes/kubernetes/issues/62917 +* It is possible to conceal a file from the walker by creating it and removing it while holding an open file descriptor on it. POSIX behavior is to not remove the file until the last open file descriptor pointing to it is removed. This has legitimate uses; it ensures that a temporary file is deleted when the processes using it exit, and it minimizes the attack surface by not having a file that can be found by an attacker. The following pod does this; it will never be caught by the present mechanism: +```yaml +apiVersion: v1 +kind: Pod +max: +metadata: + name: "diskhog" +spec: + containers: + - name: "perl" + resources: + limits: + ephemeral-storage: "2048Ki" + image: "perl" + command: + - perl + - -e + - > + my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999 + volumeMounts: + - name: a + mountPath: /data/a + volumes: + - name: a + emptyDir: {} +``` +* It is reactive rather than proactive. It does not prevent a pod from overshooting its limit; at best it catches it after the fact. On a fast storage medium, such as NVMe, a pod may write 50 GB or more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious problems elsewhere on the system. + +In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed. + +### Goals + +* Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage utilization. +* Primary: detect storage used by pods that is concealed by deleted files being held open. +* Primary: this will not interfere with the more common user and group quotas. +* Stretch: enforce limits on per-volume storage consumption by using enforced project quotas. Each volume would be given an enforced quota of the total ephemeral storage limit of the pod. + +### Non-Goals + +* Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit. + +## Proposal + +This proposal applies project quotas to emptydir volumes on qualifying filesystems (ext4fs and xfs with project quotas enabled). Project quotas are applied by selecting an unused project ID (a 32-bit unsigned integer), setting a limit on space and/or inode consumption, and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory. +If we elect to use the quota as enforcing, we impose a quota consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^61-1 bytes for XFS, 2^58-1 bytes for ext4fs). + +### Operation Flow -- Applying a Quota + +* Caller (emptydir volume manager or container runtime) creates an emptydir volume, with an empty directory at a location of its choice. +* Caller requests that a quota be applied to a directory. +* Determine whether a quota can be imposed on the directory, by asking each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller. +* Select an unused project ID (see below). +* Set the desired limit on the project ID, in a filesystem-dependent manner. +* Apply the project ID to the directory in question, in a filesystem-dependent manner. + +An error at any point results in no quota being applied and no change to the state of the system. The caller in general should not assume a priori that the attempt will be successful. It could choose to reject a request if a quota cannot be applied, but at this time it will simply ignore the error and proceed as today. + +### Operation Flow -- Retrieving Storage Consumption + +* Caller (kubelet metrics code, cadvisor, container runtime) asks the quota code to compute the amount of storage used under the directory. +* Determine whether a quota applies to the directory, in a filesystem-dependent manner (see below). +* If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner. + +If the quota code is unable to retrieve the consumption, it returns an error status and it is up to the caller to utilize a fallback mechanism (such as the directory walk performed today). + +### Operation Flow -- Removing a Quota. + +* Caller requests that the quota be removed from a directory. +* Determine whether a project quota applies to the directory. +* Remove the limit from the project ID associated with the directory. +* Remove the association between the directory and the project ID. +* Return the project ID to the system to allow its use elsewhere (see below). +* Caller may delete the directory and its contents (normally it will). + +### Operation Notes + +#### Selecting a Project ID + +Project IDs are a shared space within a filesystem. If the same project ID is assigned to multiple directories, the space consumption reported by the quota will be the sum of that of all of the directories. Hence, it is important to ensure that each directory is assigned a unique project ID (unless it is desired to pool the storage use of multiple directories). + +The canonical mechanism to record persistently that a project ID is reserved is to store it in the /etc/projid (projid(5)) and/or /etc/projects (projects(5)) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a project ID is in active use on a given filesystem: + +* The quota values (in blocks and/or inodes) assigned to the project ID are non-zero. +* The storage consumption (in blocks and/or inodes) reported under the project ID are non-zero. + +The algorithm to be used is as follows: + +* Lock this instance of the quota code against re-entrancy. +* open and flock() the /etc/project and /etc/projid files, so that other uses of this code are excluded. +* Start from a high number (the prototype uses 1048577). +* Iterate from there, performing the following tests: + * Is the ID reserved by this instance of the quota code? + * Is the ID present in /etc/projects? + * Is the ID present in /etc/projid? + * Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID. +* If an ID has been found: + * Add it to an in-memory copy of /etc/projects and /etc/projid so that any other uses of project quotas do not reuse it. + * Write temporary copies of /etc/projects and /etc/projid that are flock()ed + * If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent). +* Unlock /etc/projid and /etc/projects. +* Unlock this instance of the quota code. + +A minor variation of this is used if we want to reuse an existing quota ID. + +#### Determine Whether a Project ID Applies To a Directory + +It is possible to determine whether a directory has a project ID applied to it by requesting (via the quotactl(2) system call) the project ID associated with the directory. Whie the specifics are filesystem-dependent, the basic method is the same for at least XFS and ext4fs. + +It is not possible to directly determine the directory or directories to which a project ID is applied. It is possible to determine whether a project ID has been applied to an existing directory or files; the reported consumption will be non-zero. + +The code records internally the project ID applied to a directory, but it cannot always rely on this. In particular, if the kubelet has exited and has been restarted, the map from directory to project ID is lost. If it cannot find a map entry, it falls back on the approach discussed above. + +#### Return a Project ID To the System + +The algorithm used to return a project ID to the system is very similar to the algorithm used to select a project ID, except of course for selecting a project ID. It performs the same sequence of locking /etc/project and /etc/projid, editing a copy of the file, and restoring it. + +If the project ID is applied to multiple directories and the code can determine that, it will not remove the project ID from /etc/projid until the last reference is removed. While it is not anticipated that this mode of operation will be used, at least initially, this can be detected even on kubelet restart by looking at the reference count in /etc/projects. + + +### Implementation Details/Notes/Constraints [optional] + +What are the caveats to the implementation? +What are some important details that didn't come across above. +Go in to as much detail as necessary here. +This might be a good place to talk about core concepts and how they releate. + +### Risks and Mitigations + +What are the risks of this proposal and how do we mitigate. +Think broadly. +For example, consider both security and how this will impact the larger kubernetes ecosystem. + +## Graduation Criteria + +How will we know that this has succeeded? +Gathering user feedback is crucial for building high quality experiences and SIGs have the important responsibility of setting milestones for stability and completeness. +Hopefully the content previously contained in [umbrella issues][] will be tracked in the `Graduation Criteria` section. + +[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 + +## Implementation History + +Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. +Major milestones might include + +- the `Summary` and `Motivation` sections being merged signaling SIG acceptance +- the `Proposal` section being merged signaling agreement on a proposed design +- the date implementation started +- the first Kubernetes release where an initial version of the KEP was available +- the version of Kubernetes where the KEP graduated to general availability +- when the KEP was retired or superseded + +## Drawbacks [optional] + +Why should this KEP _not_ be implemented. + +## Alternatives [optional] + +Similar to the `Drawbacks` section the `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a KEP. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project/SIG. +Examples include a new subproject, repos requested, github details. +Listing these here allows a SIG to get the process for these resources started right away. \ No newline at end of file -- cgit v1.2.3 From 4a754840c5578c44ac9784e002ade78323790085 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Mon, 10 Sep 2018 11:36:40 -0400 Subject: Updates --- .../0028-20180906-quotas-for-ephemeral-storage.md | 592 ++++++++++++++++++--- 1 file changed, 505 insertions(+), 87 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index b0d12610..8a095070 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -26,43 +26,87 @@ superseded-by: - KEP-100 --- -# Quotas for Ephemeral Storaeg +# Quotas for Ephemeral Storage ## Table of Contents -A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. -[Tools for generating][] a table of contents from markdown are available. - -* [Table of Contents](#table-of-contents) -* [Summary](#summary) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) -* [Proposal](#proposal) - * [User Stories [optional]](#user-stories-optional) - * [Story 1](#story-1) - * [Story 2](#story-2) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Risks and Mitigations](#risks-and-mitigations) -* [Graduation Criteria](#graduation-criteria) -* [Implementation History](#implementation-history) -* [Drawbacks [optional]](#drawbacks-optional) -* [Alternatives [optional]](#alternatives-optional) +A table of contents is helpful for quickly jumping to sections of a +KEP and for highlighting any additional information provided beyond +the standard KEP template. [Tools for generating][https://github.com/ekalinin/github-markdown-toc] a table of +contents from markdown are available. + +* [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) + * [Table of Contents](#table-of-contents) + * [Summary](#summary) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) + * [Proposal](#proposal) + * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) + * [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) + * [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) + * [Operation Notes](#operation-notes) + * [Selecting a Project ID](#selecting-a-project-id) + * [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) + * [Return a Project ID To the System](#return-a-project-id-to-the-system) + * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + * [Notes on Implementation](#notes-on-implementation) + * [Notes on Code Changes](#notes-on-code-changes) + * [Testing Strategy](#testing-strategy) + * [Risks and Mitigations](#risks-and-mitigations) + * [Graduation Criteria](#graduation-criteria) + * [Implementation History](#implementation-history) + * [Drawbacks [optional]](#drawbacks-optional) + * [Alternatives [optional]](#alternatives-optional) + * [Alternative quota-based implementation](#alternative-quota-based-implementation) + * [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) + * [Infrastructure Needed [optional]](#infrastructure-needed-optional) [Tools for generating]: https://github.com/ekalinin/github-markdown-toc ## Summary -Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via https://github.com/kubernetes/features/issues/361. It provides support for capacity isolation of shared storage between pods, such that a pod can be limited in its consumption of shared resources and can be evicted if its consumption of shared storage exceeds that limit. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption. -The current mechanism relies on periodically walking each ephemeral volume (emptydir, logdir, or container writable layer) and summing the space consumption. This method is slow, can be fooled, and has high latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it). -The mechanism proposed here utilizes filesystem project quotas to provide monitoring of resource consumption and optionally enforcement of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of restricting and monitoring filesystem consumption that can be applied to one or more directories. +Local storage capacity isolation, aka ephemeral-storage, was +introduced into Kubernetes via +. It provides +support for capacity isolation of shared storage between pods, such +that a pod can be limited in its consumption of shared resources and +can be evicted if its consumption of shared storage exceeds that +limit. The limits and requests for shared ephemeral-storage are +similar to those for memory and CPU consumption. + +The current mechanism relies on periodically walking each ephemeral +volume (emptydir, logdir, or container writable layer) and summing the +space consumption. This method is slow, can be fooled, and has high +latency (i. e. a pod could consume a lot of storage prior to the +kubelet being aware of its overage and terminating it). + +The mechanism proposed here utilizes filesystem project quotas to +provide monitoring of resource consumption and optionally enforcement +of limits. Project quotas, initially in XFS and more recently ported +to ext4fs, offer a kernel-based means of restricting and monitoring +filesystem consumption that can be applied to one or more directories. ## Motivation -The mechanism presently used to monitor storage consumption involves use of `du` and `find` to periodically gather information about storage and inode consumption of volumes. This mechanism suffers from a number of drawbacks: +The mechanism presently used to monitor storage consumption involves +use of `du` and `find` to periodically gather information about +storage and inode consumption of volumes. This mechanism suffers from +a number of drawbacks: + +* It is slow. If a volume contains a large number of files, walking + the directory can take a significant amount of time. There has been + at least one known report of nodes becoming not ready due to volume + metrics: +* It is possible to conceal a file from the walker by creating it and + removing it while holding an open file descriptor on it. POSIX + behavior is to not remove the file until the last open file + descriptor pointing to it is removed. This has legitimate uses; it + ensures that a temporary file is deleted when the processes using it + exit, and it minimizes the attack surface by not having a file that + can be found by an attacker. The following pod does this; it will + never be caught by the present mechanism: -* It is slow. If a volume contains a large number of files, walking the directory can take a significant amount of time. There has been at least one known report of nodes becoming not ready due to volume metrics: https://github.com/kubernetes/kubernetes/issues/62917 -* It is possible to conceal a file from the walker by creating it and removing it while holding an open file descriptor on it. POSIX behavior is to not remove the file until the last open file descriptor pointing to it is removed. This has legitimate uses; it ensures that a temporary file is deleted when the processes using it exit, and it minimizes the attack surface by not having a file that can be found by an attacker. The following pod does this; it will never be caught by the present mechanism: ```yaml apiVersion: v1 kind: Pod @@ -88,44 +132,85 @@ spec: - name: a emptyDir: {} ``` -* It is reactive rather than proactive. It does not prevent a pod from overshooting its limit; at best it catches it after the fact. On a fast storage medium, such as NVMe, a pod may write 50 GB or more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious problems elsewhere on the system. +* It is reactive rather than proactive. It does not prevent a pod + from overshooting its limit; at best it catches it after the fact. + On a fast storage medium, such as NVMe, a pod may write 50 GB or + more of data before the housekeeping performed once per minute + catches up to it. If the primary volume is the root partition, this + will completely fill the partition, possibly causing serious + problems elsewhere on the system. -In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed. +In many environments, these issues may not matter, but shared +multi-tenant environments need these issues addressed. ### Goals -* Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage utilization. -* Primary: detect storage used by pods that is concealed by deleted files being held open. -* Primary: this will not interfere with the more common user and group quotas. -* Stretch: enforce limits on per-volume storage consumption by using enforced project quotas. Each volume would be given an enforced quota of the total ephemeral storage limit of the pod. +* Primary: improve performance of monitoring by using project quotas + in a non-enforcing way to collect information about storage + utilization. +* Primary: detect storage used by pods that is concealed by deleted + files being held open. +* Primary: this will not interfere with the more common user and group + quotas. +* Stretch: enforce limits on per-volume storage consumption by using + enforced project quotas. Each volume would be given an enforced + quota of the total ephemeral storage limit of the pod. ### Non-Goals -* Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit. +* Enforcing limits on total pod storage consumption by any means, such + that the pod would be hard restricted to the desired storage limit. ## Proposal -This proposal applies project quotas to emptydir volumes on qualifying filesystems (ext4fs and xfs with project quotas enabled). Project quotas are applied by selecting an unused project ID (a 32-bit unsigned integer), setting a limit on space and/or inode consumption, and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory. -If we elect to use the quota as enforcing, we impose a quota consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^61-1 bytes for XFS, 2^58-1 bytes for ext4fs). +This proposal applies project quotas to emptydir volumes on qualifying +filesystems (ext4fs and xfs with project quotas enabled). Project +quotas are applied by selecting an unused project ID (a 32-bit +unsigned integer), setting a limit on space and/or inode consumption, +and attaching the ID to one or more files. By default (and as +utilized herein), if a project ID is attached to a directory, it is +inherited by any files created under that directory. + +If we elect to use the quota as enforcing, we impose a quota +consistent with the desired limit. If we elect to use it as +non-enforcing, we impose a large quota that in practice cannot be +exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). ### Operation Flow -- Applying a Quota -* Caller (emptydir volume manager or container runtime) creates an emptydir volume, with an empty directory at a location of its choice. +* Caller (emptydir volume manager or container runtime) creates an + emptydir volume, with an empty directory at a location of its + choice. * Caller requests that a quota be applied to a directory. -* Determine whether a quota can be imposed on the directory, by asking each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller. -* Select an unused project ID (see below). -* Set the desired limit on the project ID, in a filesystem-dependent manner. -* Apply the project ID to the directory in question, in a filesystem-dependent manner. - -An error at any point results in no quota being applied and no change to the state of the system. The caller in general should not assume a priori that the attempt will be successful. It could choose to reject a request if a quota cannot be applied, but at this time it will simply ignore the error and proceed as today. +* Determine whether a quota can be imposed on the directory, by asking + each quota provider (one per filesystem type) whether it can apply a + quota to the directory. If no provider claims the directory, an + error status is returned to the caller. +* Select an unused project ID (see [below](#selecting-a-project-id)). +* Set the desired limit on the project ID, in a filesystem-dependent + manner (see [below](#notes-on-implementation)). +* Apply the project ID to the directory in question, in a + filesystem-dependent manner. + +An error at any point results in no quota being applied and no change +to the state of the system. The caller in general should not assume a +priori that the attempt will be successful. It could choose to reject +a request if a quota cannot be applied, but at this time it will +simply ignore the error and proceed as today. ### Operation Flow -- Retrieving Storage Consumption -* Caller (kubelet metrics code, cadvisor, container runtime) asks the quota code to compute the amount of storage used under the directory. -* Determine whether a quota applies to the directory, in a filesystem-dependent manner (see below). -* If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner. +* Caller (kubelet metrics code, cadvisor, container runtime) asks the + quota code to compute the amount of storage used under the + directory. +* Determine whether a quota applies to the directory, in a + filesystem-dependent manner (see [below](#notes-on-implementation)). +* If so, determine how much storage or how many inodes are utilized, + in a filesystem dependent manner. -If the quota code is unable to retrieve the consumption, it returns an error status and it is up to the caller to utilize a fallback mechanism (such as the directory walk performed today). +If the quota code is unable to retrieve the consumption, it returns an +error status and it is up to the caller to utilize a fallback +mechanism (such as the directory walk performed today). ### Operation Flow -- Removing a Quota. @@ -133,97 +218,430 @@ If the quota code is unable to retrieve the consumption, it returns an error sta * Determine whether a project quota applies to the directory. * Remove the limit from the project ID associated with the directory. * Remove the association between the directory and the project ID. -* Return the project ID to the system to allow its use elsewhere (see below). +* Return the project ID to the system to allow its use elsewhere (see + [below](#return-a-project-id-to-the-system). * Caller may delete the directory and its contents (normally it will). ### Operation Notes #### Selecting a Project ID -Project IDs are a shared space within a filesystem. If the same project ID is assigned to multiple directories, the space consumption reported by the quota will be the sum of that of all of the directories. Hence, it is important to ensure that each directory is assigned a unique project ID (unless it is desired to pool the storage use of multiple directories). - -The canonical mechanism to record persistently that a project ID is reserved is to store it in the /etc/projid (projid(5)) and/or /etc/projects (projects(5)) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a project ID is in active use on a given filesystem: - -* The quota values (in blocks and/or inodes) assigned to the project ID are non-zero. -* The storage consumption (in blocks and/or inodes) reported under the project ID are non-zero. +Project IDs are a shared space within a filesystem. If the same +project ID is assigned to multiple directories, the space consumption +reported by the quota will be the sum of that of all of the +directories. Hence, it is important to ensure that each directory is +assigned a unique project ID (unless it is desired to pool the storage +use of multiple directories). + +The canonical mechanism to record persistently that a project ID is +reserved is to store it in the /etc/projid (projid(5)) and/or +/etc/projects (projects(5)) files. However, it is possible to utilize +project IDs without recording them in those files; they exist for +administrative convenience but neither the kernel nor the filesystem +is aware of them. Other ways can be used to determine whether a +project ID is in active use on a given filesystem: + +* The quota values (in blocks and/or inodes) assigned to the project + ID are non-zero. +* The storage consumption (in blocks and/or inodes) reported under the + project ID are non-zero. The algorithm to be used is as follows: * Lock this instance of the quota code against re-entrancy. -* open and flock() the /etc/project and /etc/projid files, so that other uses of this code are excluded. +* open and flock() the /etc/project and /etc/projid files, so that + other uses of this code are excluded. * Start from a high number (the prototype uses 1048577). * Iterate from there, performing the following tests: * Is the ID reserved by this instance of the quota code? * Is the ID present in /etc/projects? * Is the ID present in /etc/projid? - * Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID. + * Are the quota values and/or consumption reported by the kernel + non-zero? This test is restricted to 128 iterations to ensure + that a bug here or elsewhere does not result in an infinite loop + looking for a quota ID. * If an ID has been found: - * Add it to an in-memory copy of /etc/projects and /etc/projid so that any other uses of project quotas do not reuse it. - * Write temporary copies of /etc/projects and /etc/projid that are flock()ed - * If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent). + * Add it to an in-memory copy of /etc/projects and /etc/projid so + that any other uses of project quotas do not reuse it. + * Write temporary copies of /etc/projects and /etc/projid that are + flock()ed + * If successful, rename the temporary files appropriately (if + rename of one succeeds but the other fails, we have a problem + that we cannot recover from, and the files may be inconsistent). * Unlock /etc/projid and /etc/projects. * Unlock this instance of the quota code. -A minor variation of this is used if we want to reuse an existing quota ID. +A minor variation of this is used if we want to reuse an existing +quota ID. #### Determine Whether a Project ID Applies To a Directory -It is possible to determine whether a directory has a project ID applied to it by requesting (via the quotactl(2) system call) the project ID associated with the directory. Whie the specifics are filesystem-dependent, the basic method is the same for at least XFS and ext4fs. - -It is not possible to directly determine the directory or directories to which a project ID is applied. It is possible to determine whether a project ID has been applied to an existing directory or files; the reported consumption will be non-zero. - -The code records internally the project ID applied to a directory, but it cannot always rely on this. In particular, if the kubelet has exited and has been restarted, the map from directory to project ID is lost. If it cannot find a map entry, it falls back on the approach discussed above. +It is possible to determine whether a directory has a project ID +applied to it by requesting (via the quotactl(2) system call) the +project ID associated with the directory. Whie the specifics are +filesystem-dependent, the basic method is the same for at least XFS +and ext4fs. + +It is not possible to determine in constant operations the directory +or directories to which a project ID is applied. It is possible to +determine whether a given project ID has been applied to an existing +directory or files (although those will not be known); the reported +consumption will be non-zero. + +The code records internally the project ID applied to a directory, but +it cannot always rely on this. In particular, if the kubelet has +exited and has been restarted (and hence the quota applying to the +directory should be removed), the map from directory to project ID is +lost. If it cannot find a map entry, it falls back on the approach +discussed above. #### Return a Project ID To the System -The algorithm used to return a project ID to the system is very similar to the algorithm used to select a project ID, except of course for selecting a project ID. It performs the same sequence of locking /etc/project and /etc/projid, editing a copy of the file, and restoring it. +The algorithm used to return a project ID to the system is very +similar to the algorithm used to select a project ID, except of course +for selecting a project ID. It performs the same sequence of locking +/etc/project and /etc/projid, editing a copy of the file, and +restoring it. -If the project ID is applied to multiple directories and the code can determine that, it will not remove the project ID from /etc/projid until the last reference is removed. While it is not anticipated that this mode of operation will be used, at least initially, this can be detected even on kubelet restart by looking at the reference count in /etc/projects. +If the project ID is applied to multiple directories and the code can +determine that, it will not remove the project ID from /etc/projid +until the last reference is removed. While it is not anticipated in +this KEP that this mode of operation will be used, at least initially, +this can be detected even on kubelet restart by looking at the +reference count in /etc/projects. ### Implementation Details/Notes/Constraints [optional] -What are the caveats to the implementation? -What are some important details that didn't come across above. -Go in to as much detail as necessary here. -This might be a good place to talk about core concepts and how they releate. +#### Notes on Implementation + +The primary new interface defined is the quota interface in +`pkg/volume/util/quota/quota.go`. This defines five operations: + +* Does the specified directory support quotas + +* Assign a quota to a directory. If a non-empty pod UID is provided, + the quota assigned is that of any other directories under this pod + UID; if an empty pod UID is provided, a unique quota is assigned. + +* Retrieve the consumption of the specified directory. If the quota + code cannot handle it efficiently, it returns an error and the + caller falls back on existing mechanism. + +* Retrieve the inode consumption of the specified directory; same + description as above. + +* Remove quota from a directory. If a non-empty pod UID is passed, it + is checked against that recorded in-memory (if any). The quota is + removed from the specified directory. This can be used even if + AssignQuota has not been used; it inspects the directory and removes + the quota from it. This permits stale quotas from an interrupted + kubelet to be cleaned up. + +Two implementations are provided: `quota_linux.go` (for Linux) and +`quota_unsupported.go` (for other operating systems). The latter +returns an error for all requests. + +As the quota mechanism is intended to support multiple filesystems, +and different filesystems require different low level code for +manipulating quotas, a provider is supplied that finds an appropriate +quota applier implementation for the filesystem in question. The low +level quota applier provides similar operations to the top level quota +code, with two exceptions: + +* No operation exists to determine whether a quota can be applied + (that is handled by the provider). + +* An additional operation is provided to determine whether a given + quota ID is in use within the filesystem (outside of /etc/projects + and /etc/projid). + +The two quota providers in the initial implementation are in +`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While +some quota operations do require different system calls, a lot of the +code is common, and factored into +`pkg/volume/util/quota/common/quota_linux_common_impl.go`. + +#### Notes on Code Changes + +The prototype for this project is mostly self-contained within +`pkg/volume/util/quota` and a few changes to +`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were +required elsewhere: + +* The operation executor needs to pass the desired size limit to the + volume plugin where appropriate so that the volume plugin can impose + a quota. The limit is passed as 0 (do not use quotas), positive + number (impose an enforcing quota if possible, measured in bytes), + or -1 (impose a non-enforcing quota, if possible) on the volume. + + This requires changes to + `pkg/volume/util/operationexecutor/operation_executor.go` (to add + `DesiredSizeLimit` to `VolumeToMount`), + `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and + `pkg/kubelet/eviction/helpers.go` (the latter in order to determine + whether the volume is a local ephemeral one). + +* The volume manager (in `pkg/volume/volume.go`) changes the + `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new + `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to + allow passing the desired size and pod UID (in the event we choose + to implement quotas shared between multiple volumes; see + [below](#alternative-quota-based-implementation)). This required + small changes to all volume plugins and their tests, but will in the + future allow adding additional data without having to change code + other than that which uses the new information. + +#### Testing Strategy + +The quota code is by an large not very amendable to unit tests. While +there are simple unit tests for parsing the mounts file, and there +could be tests for parsing the projects and projid files, the real +work (and risk) involves interactions with the kernel and with +multiple instances of this code (e. g. in the kubelet and the runtime +manager, particularly under stress). It also requires setup in the +form of a prepared filesystem. It would be better served by +appropriate end to end tests. ### Risks and Mitigations -What are the risks of this proposal and how do we mitigate. -Think broadly. -For example, consider both security and how this will impact the larger kubernetes ecosystem. +* The SIG raised the possibility of a container being unable to exit + should we enforce quotas, and the quota interferes with writing the + log. This can be mitigated by either not applying a quota to the + log directory and using the du mechanism, or by applying a separate + non-enforcing quota to the log directory. + + As log directories are write-only by the container, and consumption + can be limited by other means (as the log is filtered by the + runtime), I do not consider the ability to write uncapped to the log + to be a serious exposure. + + Note in addition that even without quotas it is possible for writes + to fail due to lack of filesystem space, which is effectively (and + in some cases operationally) indistinguishable from exceeding quota, + so even at present code must be able to handle those situations. + +* Filesystem quotas may impact performance to an unknown degree. + Information on that is hard to come by in general, and one of the + reasons for using quotas is indeed to improve performance. If this + is a problem in the field, merely turning off quotas (or selectively + disabling project quotas) on the filesystem in question will avoid + the problem. Against the possibility that that cannot be done + (because project quotas are needed for other purposes), we should + provide a way to disable use of quotas altogether via a feature + gate. + + A report notes that an + unclean shutdown on Linux kernel versions between 3.11 and 3.17 can + result in a prolonged downtime while quota information is restored. + Unfortunately, [the link referenced + here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no + longer available. + +* Bugs in the quota code could result in a variety of regression + behavior. For example, if a quota is incorrectly applied it could + result in ability to write no data at all to the volume. This could + be mitigated by use of non-enforcing quotas. XFS in particular + offers the pqnoenforce mount option that makes all quotas + non-enforcing. + + We should offer two feature gates, one to enable quotas at all (on + by default) and one to enable enforcing quotas (initially off, but + with intention of enabling in the near future). + ## Graduation Criteria -How will we know that this has succeeded? -Gathering user feedback is crucial for building high quality experiences and SIGs have the important responsibility of setting milestones for stability and completeness. -Hopefully the content previously contained in [umbrella issues][] will be tracked in the `Graduation Criteria` section. +How will we know that this has succeeded? Gathering user feedback is +crucial for building high quality experiences and SIGs have the +important responsibility of setting milestones for stability and +completeness. Hopefully the content previously contained in [umbrella +issues][] will be tracked in the `Graduation Criteria` section. -[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 +[umbrella issues]: N/A ## Implementation History -Major milestones in the life cycle of a KEP should be tracked in `Implementation History`. -Major milestones might include +Major milestones in the life cycle of a KEP should be tracked in +`Implementation History`. Major milestones might include -- the `Summary` and `Motivation` sections being merged signaling SIG acceptance -- the `Proposal` section being merged signaling agreement on a proposed design +- the `Summary` and `Motivation` sections being merged signaling SIG + acceptance +- the `Proposal` section being merged signaling agreement on a + proposed design - the date implementation started -- the first Kubernetes release where an initial version of the KEP was available -- the version of Kubernetes where the KEP graduated to general availability +- the first Kubernetes release where an initial version of the KEP was + available +- the version of Kubernetes where the KEP graduated to general + availability - when the KEP was retired or superseded ## Drawbacks [optional] -Why should this KEP _not_ be implemented. +* Use of quotas, particularly the less commonly used project quotas, + requires additional action on the part of the administrator. In + particular: + * ext4fs filesystems must be created with additional options that + are not enabled by default: +``` +mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ +``` + * An additional option (`prjquota`) must be applied in /etc/fstab + * If the root filesystem is to be quota-enabled, it must be set in + the grub options. +* Use of project quotas for this purpose will preclude future use + within containers. ## Alternatives [optional] -Similar to the `Drawbacks` section the `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a KEP. +I have considered two classes of alternatives: + +* Alternatives based on quotas, with different implementation + +* Alternatives based on loop filesystems without use of quotas + +### Alternative quota-based implementation + +Within the basic framework of using quotas to monitor and potentially +enforce storage utilization, there are a number of possible options: + +* Utilize per-volume non-enforcing quotas to monitor storage (the + first stage of this proposal). + + This mostly preserves the current behavior, but with more efficient + determination of storage utilization and the possibility of building + further on it. The one change from current behavior is the ability + to detect space used by deleted files. + +* Utilize per-volume enforcing quotas to monitor and enforce storage + (the second stage of this proposal). + + This allows partial enforcement of storage limits. As local storage + capacity isolation works at the level of the pod, and we have no + control of user utilization of ephemeral volumes, we would have to + give each volume a quota of the full limit. For example, if a pod + had a limit of 1 MB but had four ephemeral volumes mounted, it would + be possible for storage utilization to reach (at least temporarily) + 4MB before being capped. + +* Utilize per-pod enforcing user or group quotas to enforce storage + consumption, and per-volume non-enforcing quotas for monitoring. + + This would offer the best of both worlds: a fully capped storage + limit combined with efficient reporting. However, it would require + each pod to run under a distinct UID or GID. This may prevent pods + from using setuid or setgid or their variants, and would interfere + with any other use of group or user quotas within Kubernetes. + +* Utilize per-pod enforcing quotas to monitor and enforce storage. + + This allows for full enforcement of storage limits, at the expense + of being able to efficiently monitor per-volume storage + consumption. As there have already been reports of monitoring + causing trouble, I do not advise this option. + + A variant of this would report (1/N) storage for each covered + volume, so with a pod with a 4MiB quota and 1MiB total consumption, + spread across 4 ephemeral volumes, each volume would report a + consumption of 256 KiB. Another variant would change the API to + report statistics for all ephemeral volumes combined. I do not + advise this option. + +### Alternative loop filesystem-based implementation + +Another way of isolating storage is to utilize filesystems of +pre-determined size, using the loop filesystem facility within Linux. +It is possible to create a file and run mkfs(8) on it, and then to +mount that filesystem on the desired directory. This both limits the +storage available within that directory and enables quick retrieval of +it via statfs(2). + +Cleanup of such a filesystem involves unmounting it and removing the +backing file. + +The backing file can be created as a sparse file, and the `discard` +option can be used to return unused space to the system, allowing for +thin provisioning. + +I conducted preliminary investigations into this. While at first it +appeared promising, it turned out to have multiple critical flaws: + +* If the filesystem is mounted without `discard`, it can grow to the + full size of the backing file, negating any possibility of thin + provisioning. If the file is created dense in the first place, + there is never any possibility of thin provisioning without use of + `discard`. + + If the backing file is created densely, it additionally may require + significant time to create if the ephemeral limit is large. + +* If the filesystem is mounted `nosync`, and is sparse, it is possible + for writes to succeed and then fail later with I/O errors when + synced to the backing storage. This will lead to data corruption + that cannot be detected at the time of write. + + This can easily be reproduced by e. g. creating a 64MB filesystem + and within it creating a 128MB sparse file and building a filesystem + on it. When that filesystem is in turn mounted, writes to it will + succeed, but I/O errors will be seen in the log and the file will be + incomplete: + +``` +# mkdir /var/tmp/d1 /var/tmp/d2 +# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383 +# mkfs.ext4 /var/tmp/fs1 +# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1 +# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767 +# mkfs.ext4 /var/tmp/d1/fs2 +# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 +# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 + _...will normally succeed..._ +# sync + _...fails with I/O error!..._ +``` + +* If the filesystem is mounted `sync`, all writes to it are + immediately committed to the backing store, and the _dd_ operation + above fails as soon as it fills up _/var/tmp/d1_. However, + performance is drastically slowed, particularly with small writes; + with 1K writes, I observed performance degradation in some cases + exceeding three orders of magnitude. + + I performed a test comparing writing 64 MB to a base (partitioned) + filesystem, to a loop filesystem without _sync_, and a loop + filesystem with _sync. Total I/O was sufficient to run for at least + 5 seconds in each case. All filesystems involved were XFS. Loop + filesystems were 128 MB and dense. Times are in seconds. The + erratic behavior (e. g. the 65536 case) was involved was observed + repeatedly, although the exact amount of time and which I/O sizes + were affected varied. The underlying device was an HP EX920 1TB + NVMe SSD. + +| I/O Size | Partition | Loop w/sync | Loop w/o sync | +| ---: | ---: | ---: | ---: | +| 1024 | 0.104 | 0.120 | 140.390 | +| 4096 | 0.045 | 0.077 | 21.850 | +| 16384 | 0.045 | 0.067 | 5.550 | +| 65536 | 0.044 | 0.061 | 20.440 | +| 262144 | 0.043 | 0.087 | 0.545 | +| 1048576 | 0.043 | 0.055 | 7.490 | +| 4194304 | 0.043 | 0.053 | 0.587 | + + The only potentially viable combination in my view would be a dense + loop filesystem without sync, but that would render any thin + provisioning impossible. ## Infrastructure Needed [optional] -Use this section if you need things from the project/SIG. -Examples include a new subproject, repos requested, github details. -Listing these here allows a SIG to get the process for these resources started right away. \ No newline at end of file +* Decision: who is responsible for quota management of all volume + types (and especially ephemeral volumes of all types). At present, + emptydir volumes are managed by the kubelet and logdirs and writable + layers by either the kubelet or the runtime, depending upon the + choice of runtime. Beyond the specific proposal that the runtime + should manage quotas for volumes it creates, there are broader + issues that I request assistance from the SIG in addressing. + +* Location of the quota code. If the quotas for different volume + types are to be managed by different components, each such component + needs access to the quota code. The code is substantial and should + not be copied; it would more appropriately be vendored. -- cgit v1.2.3 From ac338c9414772ee9a07f9eb4492c1a0a3486d12a Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Tue, 11 Sep 2018 12:01:34 -0400 Subject: Updates from first round comments. --- .../0028-20180906-quotas-for-ephemeral-storage.md | 125 ++++++++++++++++----- 1 file changed, 97 insertions(+), 28 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index 8a095070..fb15703b 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -35,32 +35,34 @@ KEP and for highlighting any additional information provided beyond the standard KEP template. [Tools for generating][https://github.com/ekalinin/github-markdown-toc] a table of contents from markdown are available. -* [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Proposal](#proposal) - * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) - * [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) - * [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) - * [Operation Notes](#operation-notes) - * [Selecting a Project ID](#selecting-a-project-id) - * [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) - * [Return a Project ID To the System](#return-a-project-id-to-the-system) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Notes on Implementation](#notes-on-implementation) - * [Notes on Code Changes](#notes-on-code-changes) - * [Testing Strategy](#testing-strategy) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Drawbacks [optional]](#drawbacks-optional) - * [Alternatives [optional]](#alternatives-optional) - * [Alternative quota-based implementation](#alternative-quota-based-implementation) - * [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) - * [Infrastructure Needed [optional]](#infrastructure-needed-optional) + * [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) + * [Table of Contents](#table-of-contents) + * [Summary](#summary) + * [Project Quotas](#project-quotas) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) + * [Proposal](#proposal) + * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) + * [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) + * [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) + * [Operation Notes](#operation-notes) + * [Selecting a Project ID](#selecting-a-project-id) + * [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) + * [Return a Project ID To the System](#return-a-project-id-to-the-system) + * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + * [Notes on Implementation](#notes-on-implementation) + * [Notes on Code Changes](#notes-on-code-changes) + * [Testing Strategy](#testing-strategy) + * [Risks and Mitigations](#risks-and-mitigations) + * [Graduation Criteria](#graduation-criteria) + * [Implementation History](#implementation-history) + * [Drawbacks [optional]](#drawbacks-optional) + * [Alternatives [optional]](#alternatives-optional) + * [Alternative quota-based implementation](#alternative-quota-based-implementation) + * [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) + * [Infrastructure Needed [optional]](#infrastructure-needed-optional) + [Tools for generating]: https://github.com/ekalinin/github-markdown-toc @@ -87,6 +89,58 @@ of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of restricting and monitoring filesystem consumption that can be applied to one or more directories. +### Project Quotas + +Project quotas are a form of filesystem quota that apply to arbitrary +groups of files, as opposed to file user or group ownership. They +were first implemented in XFS, as described here: +. + +Project quotas for ext4fs were [proposed in late +2014](https://lwn.net/Articles/623835/) and added to the Linux kernel +in early 2016, with +commit +[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e). +They were designed to be compatible with XFS project quotas. + +Each inode contains a 32-bit project ID, to which optionally quotas +(hard and soft limits for blocks and inodes) may be applied. The +total blocks and inodes for all files with the given project ID are +maintained by the kernel. Project quotas can be managed from +userspace by means of the xfs_quota(8) command in foreign filesystem +(`-f`) mode; the traditional Linux quota tools do not manipulate +project quotas. Programmatically, they are managed by the quotactl(2) +system call, using in part the standard quota commands and in part the +XFS quota commands; the man page implies incorrectly that the XFS +quota commands apply only to XFS filesystems. + +The project ID applied to a directory is inherited by files created +under it. Files cannot be (hard) linked across directories with +different project IDs. A file's project ID cannot be changed by a +non-privileged user, but a privileged user may use the xfs_io(8) +command to change the project ID of a file. + +Filesystems using project quotas may be mounted with quotas either +enforced or not; the non-enforcing mode tracks usage without enforcing +it. A non-enforcing project quota may be implemented on a filesystem +mounted with enforcing quotas by setting a quota too large to be hit. +The maximum size that can be set varies with the filesystem; on a +64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for +ext4fs. + +Conventionally, project quota mappings are stored in /etc/projects and +/etc/projid; these files exist for user convenience and do not have +any direct importance to the kernel. /etc/projects contains a mapping +from project ID to directory/file; this can be a one to many mapping +(the same project ID can apply to multiple directories or files, but +any given directory/file can be assigned only one project ID). +/etc/projid contains a mapping from named projects to project IDs. + +This proposal utilizes hard project quotas. Soft quotas are of no +utility; they allow for temporary overage that, after a programmable +period of time, is converted to the hard quota limit. + + ## Motivation The mechanism presently used to monitor storage consumption involves @@ -145,19 +199,34 @@ multi-tenant environments need these issues addressed. ### Goals +These goals apply only to local ephemeral storage, as described in +. + * Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage - utilization. + utilization of ephemeral volumes. * Primary: detect storage used by pods that is concealed by deleted files being held open. * Primary: this will not interfere with the more common user and group quotas. * Stretch: enforce limits on per-volume storage consumption by using enforced project quotas. Each volume would be given an enforced - quota of the total ephemeral storage limit of the pod. + quota of the total ephemeral storage limit of the pod. _This will + only be done if a mechanism is devised to allow quota enforcement on + container writable layers; enforcement on emptydir volumes without + such on writable layers does not restrict the user._ If we cannot + do this, enforcing quotas will either be disabled or enabled by an + optional feature gate that is disabled by default. ### Non-Goals +* Application to storage other than local ephemeral storage. +* Elimination of eviction as a means of enforcing ephemeral-storage + limits. Pods that hit their ephemeral-storage limit will still be + evicted by the kubelet even if their storage has been capped by + enforcing quotas. +* Enforcing node allocatable (limit over the sum of all pod's disk + usage, including e. g. images). * Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit. -- cgit v1.2.3 From 3a19ee6516d18d18bfdbcab418781513bcfff326 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Thu, 13 Sep 2018 11:33:15 -0400 Subject: Link to PR --- keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index fb15703b..e4364b1d 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -30,11 +30,6 @@ superseded-by: ## Table of Contents -A table of contents is helpful for quickly jumping to sections of a -KEP and for highlighting any additional information provided beyond -the standard KEP template. [Tools for generating][https://github.com/ekalinin/github-markdown-toc] a table of -contents from markdown are available. - * [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) * [Table of Contents](#table-of-contents) * [Summary](#summary) @@ -89,6 +84,8 @@ of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of restricting and monitoring filesystem consumption that can be applied to one or more directories. +A prototype is in progress; see . + ### Project Quotas Project quotas are a form of filesystem quota that apply to arbitrary -- cgit v1.2.3 From 3980916f6868c6e950f536d372d3fbd631ee45b1 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Mon, 17 Sep 2018 12:52:14 -0400 Subject: Update for use of feature gates --- .../0028-20180906-quotas-for-ephemeral-storage.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index e4364b1d..f8f9e698 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -38,6 +38,7 @@ superseded-by: * [Goals](#goals) * [Non-Goals](#non-goals) * [Proposal](#proposal) + * [Control over Use of Quotas](#control-over-use-of-quotas) * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) * [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) * [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) @@ -242,6 +243,23 @@ consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). +### Control over Use of Quotas + +At present, three feature gates control operation of quotas: + +* `LocalStorageCapacityIsolation` must be enabled for any use of + quotas. + +* `FSQuotaForLSCIMonitoring` must be enabled in addition. If this is + enabled, quotas are used for monitoring, but not enforcement. At + present, this defaults to False, but the intention is that this will + default to True by initial release. + +* `FSQuotaForLSCIEnforcement` must be enabled, in addition to + `FSQuotaForLSCIMonitoring`, to use quotas for enforcement. This + defaults to False and is expected to remain in that state for + initial release. + ### Operation Flow -- Applying a Quota * Caller (emptydir volume manager or container runtime) creates an -- cgit v1.2.3 From 45bafb954238a224ed4669dc92d5072b5aa7669a Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Wed, 19 Sep 2018 11:58:08 -0400 Subject: Formatting updates --- .../0028-20180906-quotas-for-ephemeral-storage.md | 80 +++++++++++----------- 1 file changed, 40 insertions(+), 40 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index f8f9e698..5eee069c 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -105,7 +105,7 @@ Each inode contains a 32-bit project ID, to which optionally quotas (hard and soft limits for blocks and inodes) may be applied. The total blocks and inodes for all files with the given project ID are maintained by the kernel. Project quotas can be managed from -userspace by means of the xfs_quota(8) command in foreign filesystem +userspace by means of the `xfs_quota(8)` command in foreign filesystem (`-f`) mode; the traditional Linux quota tools do not manipulate project quotas. Programmatically, they are managed by the quotactl(2) system call, using in part the standard quota commands and in part the @@ -126,13 +126,13 @@ The maximum size that can be set varies with the filesystem; on a 64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for ext4fs. -Conventionally, project quota mappings are stored in /etc/projects and -/etc/projid; these files exist for user convenience and do not have -any direct importance to the kernel. /etc/projects contains a mapping +Conventionally, project quota mappings are stored in `/etc/projects` and +`/etc/projid`; these files exist for user convenience and do not have +any direct importance to the kernel. `/etc/projects` contains a mapping from project ID to directory/file; this can be a one to many mapping (the same project ID can apply to multiple directories or files, but any given directory/file can be assigned only one project ID). -/etc/projid contains a mapping from named projects to project IDs. +`/etc/projid` contains a mapping from named projects to project IDs. This proposal utilizes hard project quotas. Soft quotas are of no utility; they allow for temporary overage that, after a programmable @@ -270,9 +270,9 @@ At present, three feature gates control operation of quotas: each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller. -* Select an unused project ID (see [below](#selecting-a-project-id)). +* Select an unused project ID ([see below](#selecting-a-project-id)). * Set the desired limit on the project ID, in a filesystem-dependent - manner (see [below](#notes-on-implementation)). + manner ([see below](#notes-on-implementation)). * Apply the project ID to the directory in question, in a filesystem-dependent manner. @@ -288,7 +288,7 @@ simply ignore the error and proceed as today. quota code to compute the amount of storage used under the directory. * Determine whether a quota applies to the directory, in a - filesystem-dependent manner (see [below](#notes-on-implementation)). + filesystem-dependent manner ([see below](#notes-on-implementation)). * If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner. @@ -302,8 +302,8 @@ mechanism (such as the directory walk performed today). * Determine whether a project quota applies to the directory. * Remove the limit from the project ID associated with the directory. * Remove the association between the directory and the project ID. -* Return the project ID to the system to allow its use elsewhere (see - [below](#return-a-project-id-to-the-system). +* Return the project ID to the system to allow its use elsewhere ([see + below](#return-a-project-id-to-the-system)). * Caller may delete the directory and its contents (normally it will). ### Operation Notes @@ -318,8 +318,8 @@ assigned a unique project ID (unless it is desired to pool the storage use of multiple directories). The canonical mechanism to record persistently that a project ID is -reserved is to store it in the /etc/projid (projid(5)) and/or -/etc/projects (projects(5)) files. However, it is possible to utilize +reserved is to store it in the `/etc/projid` (projid[5]) and/or +`/etc/projects` (projects(5)) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a @@ -333,26 +333,26 @@ project ID is in active use on a given filesystem: The algorithm to be used is as follows: * Lock this instance of the quota code against re-entrancy. -* open and flock() the /etc/project and /etc/projid files, so that +* open and `flock()` the `/etc/project` and `/etc/projid` files, so that other uses of this code are excluded. * Start from a high number (the prototype uses 1048577). * Iterate from there, performing the following tests: * Is the ID reserved by this instance of the quota code? - * Is the ID present in /etc/projects? - * Is the ID present in /etc/projid? + * Is the ID present in `/etc/projects`? + * Is the ID present in `/etc/projid`? * Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID. * If an ID has been found: - * Add it to an in-memory copy of /etc/projects and /etc/projid so + * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so that any other uses of project quotas do not reuse it. - * Write temporary copies of /etc/projects and /etc/projid that are - flock()ed + * Write temporary copies of `/etc/projects` and `/etc/projid` that are + `flock()`ed * If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent). -* Unlock /etc/projid and /etc/projects. +* Unlock `/etc/projid` and `/etc/projects`. * Unlock this instance of the quota code. A minor variation of this is used if we want to reuse an existing @@ -361,7 +361,7 @@ quota ID. #### Determine Whether a Project ID Applies To a Directory It is possible to determine whether a directory has a project ID -applied to it by requesting (via the quotactl(2) system call) the +applied to it by requesting (via the `quotactl(2)` system call) the project ID associated with the directory. Whie the specifics are filesystem-dependent, the basic method is the same for at least XFS and ext4fs. @@ -384,15 +384,15 @@ discussed above. The algorithm used to return a project ID to the system is very similar to the algorithm used to select a project ID, except of course for selecting a project ID. It performs the same sequence of locking -/etc/project and /etc/projid, editing a copy of the file, and +`/etc/project` and `/etc/projid`, editing a copy of the file, and restoring it. If the project ID is applied to multiple directories and the code can -determine that, it will not remove the project ID from /etc/projid +determine that, it will not remove the project ID from `/etc/projid` until the last reference is removed. While it is not anticipated in this KEP that this mode of operation will be used, at least initially, this can be detected even on kubelet restart by looking at the -reference count in /etc/projects. +reference count in `/etc/projects`. ### Implementation Details/Notes/Constraints [optional] @@ -402,7 +402,7 @@ reference count in /etc/projects. The primary new interface defined is the quota interface in `pkg/volume/util/quota/quota.go`. This defines five operations: -* Does the specified directory support quotas +* Does the specified directory support quotas? * Assign a quota to a directory. If a non-empty pod UID is provided, the quota assigned is that of any other directories under this pod @@ -437,8 +437,8 @@ code, with two exceptions: (that is handled by the provider). * An additional operation is provided to determine whether a given - quota ID is in use within the filesystem (outside of /etc/projects - and /etc/projid). + quota ID is in use within the filesystem (outside of `/etc/projects` + and `/etc/projid`). The two quota providers in the initial implementation are in `pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While @@ -470,8 +470,8 @@ required elsewhere: `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to allow passing the desired size and pod UID (in the event we choose - to implement quotas shared between multiple volumes; see - [below](#alternative-quota-based-implementation)). This required + to implement quotas shared between multiple volumes; [see + below](#alternative-quota-based-implementation)). This required small changes to all volume plugins and their tests, but will in the future allow adding additional data without having to change code other than that which uses the new information. @@ -570,7 +570,7 @@ Major milestones in the life cycle of a KEP should be tracked in ``` mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ ``` - * An additional option (`prjquota`) must be applied in /etc/fstab + * An additional option (`prjquota`) must be applied in `/etc/fstab` * If the root filesystem is to be quota-enabled, it must be set in the grub options. * Use of project quotas for this purpose will preclude future use @@ -635,10 +635,10 @@ enforce storage utilization, there are a number of possible options: Another way of isolating storage is to utilize filesystems of pre-determined size, using the loop filesystem facility within Linux. -It is possible to create a file and run mkfs(8) on it, and then to +It is possible to create a file and run `mkfs(8)` on it, and then to mount that filesystem on the desired directory. This both limits the storage available within that directory and enables quick retrieval of -it via statfs(2). +it via `statfs(2)`. Cleanup of such a filesystem involves unmounting it and removing the backing file. @@ -650,11 +650,11 @@ thin provisioning. I conducted preliminary investigations into this. While at first it appeared promising, it turned out to have multiple critical flaws: -* If the filesystem is mounted without `discard`, it can grow to the - full size of the backing file, negating any possibility of thin - provisioning. If the file is created dense in the first place, - there is never any possibility of thin provisioning without use of - `discard`. +* If the filesystem is mounted without the `discard` option, it can + grow to the full size of the backing file, negating any possibility + of thin provisioning. If the file is created dense in the first + place, there is never any possibility of thin provisioning without + use of `discard`. If the backing file is created densely, it additionally may require significant time to create if the ephemeral limit is large. @@ -679,20 +679,20 @@ appeared promising, it turned out to have multiple critical flaws: # mkfs.ext4 /var/tmp/d1/fs2 # mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 # dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 - _...will normally succeed..._ + ...will normally succeed... # sync - _...fails with I/O error!..._ + ...fails with I/O error!... ``` * If the filesystem is mounted `sync`, all writes to it are immediately committed to the backing store, and the _dd_ operation - above fails as soon as it fills up _/var/tmp/d1_. However, + above fails as soon as it fills up `/var/tmp/d1`. However, performance is drastically slowed, particularly with small writes; with 1K writes, I observed performance degradation in some cases exceeding three orders of magnitude. I performed a test comparing writing 64 MB to a base (partitioned) - filesystem, to a loop filesystem without _sync_, and a loop + filesystem, to a loop filesystem without `sync`, and a loop filesystem with _sync. Total I/O was sufficient to run for at least 5 seconds in each case. All filesystems involved were XFS. Loop filesystems were 128 MB and dense. Times are in seconds. The -- cgit v1.2.3 From d0879693b34e66dbccacd4ecf94e8b0f34952688 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Tue, 2 Oct 2018 12:02:32 -0400 Subject: Further updates per SIG Node comments --- .../0028-20180906-quotas-for-ephemeral-storage.md | 142 ++++++++++++++++----- 1 file changed, 110 insertions(+), 32 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index 5eee069c..fd1633be 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -37,6 +37,7 @@ superseded-by: * [Motivation](#motivation) * [Goals](#goals) * [Non-Goals](#non-goals) + * [Future Work](#future-work) * [Proposal](#proposal) * [Control over Use of Quotas](#control-over-use-of-quotas) * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) @@ -58,12 +59,22 @@ superseded-by: * [Alternative quota-based implementation](#alternative-quota-based-implementation) * [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) * [Infrastructure Needed [optional]](#infrastructure-needed-optional) - + * [References](#references) + * [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) + * [CVE](#cve) + * [Other Security Issues Without CVE](#other-security-issues-without-cve) + * [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) [Tools for generating]: https://github.com/ekalinin/github-markdown-toc ## Summary +This proposal applies to the use of quotas for ephemeral-storage +metrics gathering. Use of quotas for ephemeral-storage limit +enforcement is a [non-goal](#non-goals), but as the architecture and +code will be very similar, there are comments interspersed related to +enforcement. _These comments will be italicized_. + Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via . It provides @@ -80,9 +91,9 @@ latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it). The mechanism proposed here utilizes filesystem project quotas to -provide monitoring of resource consumption and optionally enforcement -of limits. Project quotas, initially in XFS and more recently ported -to ext4fs, offer a kernel-based means of restricting and monitoring +provide monitoring of resource consumption _and optionally enforcement +of limits._ Project quotas, initially in XFS and more recently ported +to ext4fs, offer a kernel-based means of monitoring _and restricting_ filesystem consumption that can be applied to one or more directories. A prototype is in progress; see . @@ -107,7 +118,7 @@ total blocks and inodes for all files with the given project ID are maintained by the kernel. Project quotas can be managed from userspace by means of the `xfs_quota(8)` command in foreign filesystem (`-f`) mode; the traditional Linux quota tools do not manipulate -project quotas. Programmatically, they are managed by the quotactl(2) +project quotas. Programmatically, they are managed by the `quotactl(2)` system call, using in part the standard quota commands and in part the XFS quota commands; the man page implies incorrectly that the XFS quota commands apply only to XFS filesystems. @@ -115,7 +126,7 @@ quota commands apply only to XFS filesystems. The project ID applied to a directory is inherited by files created under it. Files cannot be (hard) linked across directories with different project IDs. A file's project ID cannot be changed by a -non-privileged user, but a privileged user may use the xfs_io(8) +non-privileged user, but a privileged user may use the `xfs_io(8)` command to change the project ID of a file. Filesystems using project quotas may be mounted with quotas either @@ -134,9 +145,10 @@ from project ID to directory/file; this can be a one to many mapping any given directory/file can be assigned only one project ID). `/etc/projid` contains a mapping from named projects to project IDs. -This proposal utilizes hard project quotas. Soft quotas are of no -utility; they allow for temporary overage that, after a programmable -period of time, is converted to the hard quota limit. +This proposal utilizes hard project quotas for both monitoring _and +enforcement_. Soft quotas are of no utility; they allow for temporary +overage that, after a programmable period of time, is converted to the +hard quota limit. ## Motivation @@ -190,7 +202,8 @@ spec: more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious - problems elsewhere on the system. + problems elsewhere on the system. This proposal does not address + this issue; _a future enforcing project would_. In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed. @@ -207,14 +220,6 @@ These goals apply only to local ephemeral storage, as described in files being held open. * Primary: this will not interfere with the more common user and group quotas. -* Stretch: enforce limits on per-volume storage consumption by using - enforced project quotas. Each volume would be given an enforced - quota of the total ephemeral storage limit of the pod. _This will - only be done if a mechanism is devised to allow quota enforcement on - container writable layers; enforcement on emptydir volumes without - such on writable layers does not restrict the user._ If we cannot - do this, enforcing quotas will either be disabled or enabled by an - optional feature gate that is disabled by default. ### Non-Goals @@ -227,6 +232,11 @@ These goals apply only to local ephemeral storage, as described in usage, including e. g. images). * Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit. + +### Future Work + +* _Enforce limits on per-volume storage consumption by using + enforced project quotas._ ## Proposal @@ -238,8 +248,8 @@ and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory. -If we elect to use the quota as enforcing, we impose a quota -consistent with the desired limit. If we elect to use it as +_If we elect to use the quota as enforcing, we impose a quota +consistent with the desired limit._ If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). @@ -258,7 +268,8 @@ At present, three feature gates control operation of quotas: * `FSQuotaForLSCIEnforcement` must be enabled, in addition to `FSQuotaForLSCIMonitoring`, to use quotas for enforcement. This defaults to False and is expected to remain in that state for - initial release. + initial release. _A future project to use quotas for enforcing may + change this default to True._ ### Operation Flow -- Applying a Quota @@ -318,8 +329,8 @@ assigned a unique project ID (unless it is desired to pool the storage use of multiple directories). The canonical mechanism to record persistently that a project ID is -reserved is to store it in the `/etc/projid` (projid[5]) and/or -`/etc/projects` (projects(5)) files. However, it is possible to utilize +reserved is to store it in the `/etc/projid` (`projid[5]`) and/or +`/etc/projects` (`projects(5)`) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a @@ -455,8 +466,8 @@ required elsewhere: * The operation executor needs to pass the desired size limit to the volume plugin where appropriate so that the volume plugin can impose - a quota. The limit is passed as 0 (do not use quotas), positive - number (impose an enforcing quota if possible, measured in bytes), + a quota. The limit is passed as 0 (do not use quotas), _positive + number (impose an enforcing quota if possible, measured in bytes),_ or -1 (impose a non-enforcing quota, if possible) on the volume. This requires changes to @@ -526,13 +537,9 @@ appropriate end to end tests. behavior. For example, if a quota is incorrectly applied it could result in ability to write no data at all to the volume. This could be mitigated by use of non-enforcing quotas. XFS in particular - offers the pqnoenforce mount option that makes all quotas + offers the `pqnoenforce` mount option that makes all quotas non-enforcing. - We should offer two feature gates, one to enable quotas at all (on - by default) and one to enable enforcing quotas (initially off, but - with intention of enabling in the near future). - ## Graduation Criteria @@ -685,7 +692,7 @@ appeared promising, it turned out to have multiple critical flaws: ``` * If the filesystem is mounted `sync`, all writes to it are - immediately committed to the backing store, and the _dd_ operation + immediately committed to the backing store, and the `dd` operation above fails as soon as it fills up `/var/tmp/d1`. However, performance is drastically slowed, particularly with small writes; with 1K writes, I observed performance degradation in some cases @@ -693,7 +700,7 @@ appeared promising, it turned out to have multiple critical flaws: I performed a test comparing writing 64 MB to a base (partitioned) filesystem, to a loop filesystem without `sync`, and a loop - filesystem with _sync. Total I/O was sufficient to run for at least + filesystem with `sync`. Total I/O was sufficient to run for at least 5 seconds in each case. All filesystems involved were XFS. Loop filesystems were 128 MB and dense. Times are in seconds. The erratic behavior (e. g. the 65536 case) was involved was observed @@ -729,3 +736,74 @@ appeared promising, it turned out to have multiple critical flaws: types are to be managed by different components, each such component needs access to the quota code. The code is substantial and should not be copied; it would more appropriately be vendored. + +## References + +### Bugs Opened Against Filesystem Quotas + +The following is a list of known security issues referencing +filesystem quotas on Linux, and other bugs referencing filesystem +quotas in Linux since 2012. These bugs are not necessarily in the +quota system. + +#### CVE + +* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel + before 3.3.6, when huge pages are enabled, allows local users to + cause a denial of service (system crash) or possibly gain privileges + by interacting with a hugetlbfs filesystem, as demonstrated by a + umount operation that triggers improper handling of quota data. + + The issue is actually related to huge pages, not quotas + specifically. The demonstration of the vulnerability resulted in + incorrect handling of quota data. + +* *CVE-2012-3417* The good\_client function in rquotad (rquota\_svc.c) + in Linux DiskQuota (aka quota) before 3.17 invokes the hosts\_ctl + function the first time without a host name, which might allow + remote attackers to bypass TCP Wrappers rules in hosts.deny (related + to rpc.rquotad; remote attackers might be able to bypass TCP + Wrappers rules). + + This issue is related to remote quota handling, which is not the use + case for the proposal at hand. + +#### Other Security Issues Without CVE + +* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and + Create Large Files](https://securitytracker.com/id/1002610) + + A setuid root binary inheriting file descriptors from an + unprivileged user process may write to the file without respecting + quota limits. If this issue is still present, it would allow a + setuid process to exceed any enforcing limits, but does not affect + the quota accounting (use of quotas for monitoring). + +### Other Linux Quota-Related Bugs Since 2012 + +* [ext4: report delalloc reserve as non-free in statfs mangled by + project quota](https://lore.kernel.org/patchwork/patch/884530/) + + This bug, fixed in Feb. 2018, properly accounts for reserved but not + committed space in project quotas. At this point I have not + determined the impact of this issue. + +* [XFS quota doesn't work after rebooting because of + crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730) + + This bug resulted in XFS quotas not working after a crash or forced + reboot. Under this proposal, Kubernetes would fall back to du for + monitoring should a bug of this nature manifest itself again. + +* [quota can show incorrect filesystem + name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527) + + This issue, which will not be fixed, results in the quota command + possibly printing an incorrect filesystem name when used on remote + filesystems. It is a display issue with the quota command, not a + quota bug at all, and does not result in incorrect quota information + being reported. As this proposal does not utilize the quota command + or rely on filesystem name, or currently use quotas on remote + filesystems, it should not be affected by this bug. + +In addition, the e2fsprogs have had numerous fixes over the years. -- cgit v1.2.3 From f6407579fd12fdea540c40f2281a74ad7cfbd25c Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Tue, 2 Oct 2018 14:30:33 -0400 Subject: Generate TOC via emacs markdown-toc --- .../0028-20180906-quotas-for-ephemeral-storage.md | 74 ++++++++++++---------- 1 file changed, 39 insertions(+), 35 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index fd1633be..74e6c03a 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -29,41 +29,45 @@ superseded-by: # Quotas for Ephemeral Storage ## Table of Contents - - * [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) - * [Table of Contents](#table-of-contents) - * [Summary](#summary) - * [Project Quotas](#project-quotas) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Future Work](#future-work) - * [Proposal](#proposal) - * [Control over Use of Quotas](#control-over-use-of-quotas) - * [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) - * [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) - * [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) - * [Operation Notes](#operation-notes) - * [Selecting a Project ID](#selecting-a-project-id) - * [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) - * [Return a Project ID To the System](#return-a-project-id-to-the-system) - * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - * [Notes on Implementation](#notes-on-implementation) - * [Notes on Code Changes](#notes-on-code-changes) - * [Testing Strategy](#testing-strategy) - * [Risks and Mitigations](#risks-and-mitigations) - * [Graduation Criteria](#graduation-criteria) - * [Implementation History](#implementation-history) - * [Drawbacks [optional]](#drawbacks-optional) - * [Alternatives [optional]](#alternatives-optional) - * [Alternative quota-based implementation](#alternative-quota-based-implementation) - * [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) - * [Infrastructure Needed [optional]](#infrastructure-needed-optional) - * [References](#references) - * [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) - * [CVE](#cve) - * [Other Security Issues Without CVE](#other-security-issues-without-cve) - * [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) + +**Table of Contents** + +- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Project Quotas](#project-quotas) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Future Work](#future-work) + - [Proposal](#proposal) + - [Control over Use of Quotas](#control-over-use-of-quotas) + - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) + - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) + - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) + - [Operation Notes](#operation-notes) + - [Selecting a Project ID](#selecting-a-project-id) + - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) + - [Return a Project ID To the System](#return-a-project-id-to-the-system) + - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + - [Notes on Implementation](#notes-on-implementation) + - [Notes on Code Changes](#notes-on-code-changes) + - [Testing Strategy](#testing-strategy) + - [Risks and Mitigations](#risks-and-mitigations) + - [Graduation Criteria](#graduation-criteria) + - [Implementation History](#implementation-history) + - [Drawbacks [optional]](#drawbacks-optional) + - [Alternatives [optional]](#alternatives-optional) + - [Alternative quota-based implementation](#alternative-quota-based-implementation) + - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) + - [Infrastructure Needed [optional]](#infrastructure-needed-optional) + - [References](#references) + - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) + - [CVE](#cve) + - [Other Security Issues Without CVE](#other-security-issues-without-cve) + - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) + + [Tools for generating]: https://github.com/ekalinin/github-markdown-toc -- cgit v1.2.3 From e78892b01cca653eb74444eff986ad6b31370700 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Tue, 2 Oct 2018 18:41:14 -0400 Subject: Remove FSQuotaForLSCIEnforcement --- keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md index 74e6c03a..2564455f 100644 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md @@ -259,7 +259,7 @@ exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). ### Control over Use of Quotas -At present, three feature gates control operation of quotas: +At present, two feature gates control operation of quotas: * `LocalStorageCapacityIsolation` must be enabled for any use of quotas. @@ -269,11 +269,8 @@ At present, three feature gates control operation of quotas: present, this defaults to False, but the intention is that this will default to True by initial release. -* `FSQuotaForLSCIEnforcement` must be enabled, in addition to - `FSQuotaForLSCIMonitoring`, to use quotas for enforcement. This - defaults to False and is expected to remain in that state for - initial release. _A future project to use quotas for enforcing may - change this default to True._ +* _`FSQuotaForLSCIEnforcement` must be enabled, in addition to + `FSQuotaForLSCIMonitoring`, to use quotas for enforcement._ ### Operation Flow -- Applying a Quota @@ -762,8 +759,8 @@ quota system. specifically. The demonstration of the vulnerability resulted in incorrect handling of quota data. -* *CVE-2012-3417* The good\_client function in rquotad (rquota\_svc.c) - in Linux DiskQuota (aka quota) before 3.17 invokes the hosts\_ctl +* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c) + in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl function the first time without a host name, which might allow remote attackers to bypass TCP Wrappers rules in hosts.deny (related to rpc.rquotad; remote attackers might be able to bypass TCP -- cgit v1.2.3 From 8a7783631da738b8aac096521b628bf2858c75a1 Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Thu, 11 Oct 2018 17:50:47 -0400 Subject: Bump KEP number --- keps/NEXT_KEP_NUMBER | 2 +- .../0028-20180906-quotas-for-ephemeral-storage.md | 810 --------------------- .../0030-20180906-quotas-for-ephemeral-storage.md | 810 +++++++++++++++++++++ 3 files changed, 811 insertions(+), 811 deletions(-) delete mode 100644 keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md create mode 100644 keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER index 64bb6b74..e85087af 100644 --- a/keps/NEXT_KEP_NUMBER +++ b/keps/NEXT_KEP_NUMBER @@ -1 +1 @@ -30 +31 diff --git a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md deleted file mode 100644 index 2564455f..00000000 --- a/keps/sig-node/0028-20180906-quotas-for-ephemeral-storage.md +++ /dev/null @@ -1,810 +0,0 @@ ---- -kep-number: 0 -title: My First KEP -authors: - - "@janedoe" -owning-sig: sig-xxx -participating-sigs: - - sig-aaa - - sig-bbb -reviewers: - - TBD - - "@alicedoe" -approvers: - - TBD - - "@oscardoe" -editor: TBD -creation-date: yyyy-mm-dd -last-updated: yyyy-mm-dd -status: provisional -see-also: - - KEP-1 - - KEP-2 -replaces: - - KEP-3 -superseded-by: - - KEP-100 ---- - -# Quotas for Ephemeral Storage - -## Table of Contents - -**Table of Contents** - -- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Project Quotas](#project-quotas) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Future Work](#future-work) - - [Proposal](#proposal) - - [Control over Use of Quotas](#control-over-use-of-quotas) - - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) - - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) - - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) - - [Operation Notes](#operation-notes) - - [Selecting a Project ID](#selecting-a-project-id) - - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) - - [Return a Project ID To the System](#return-a-project-id-to-the-system) - - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) - - [Notes on Implementation](#notes-on-implementation) - - [Notes on Code Changes](#notes-on-code-changes) - - [Testing Strategy](#testing-strategy) - - [Risks and Mitigations](#risks-and-mitigations) - - [Graduation Criteria](#graduation-criteria) - - [Implementation History](#implementation-history) - - [Drawbacks [optional]](#drawbacks-optional) - - [Alternatives [optional]](#alternatives-optional) - - [Alternative quota-based implementation](#alternative-quota-based-implementation) - - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) - - [Infrastructure Needed [optional]](#infrastructure-needed-optional) - - [References](#references) - - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) - - [CVE](#cve) - - [Other Security Issues Without CVE](#other-security-issues-without-cve) - - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) - - - -[Tools for generating]: https://github.com/ekalinin/github-markdown-toc - -## Summary - -This proposal applies to the use of quotas for ephemeral-storage -metrics gathering. Use of quotas for ephemeral-storage limit -enforcement is a [non-goal](#non-goals), but as the architecture and -code will be very similar, there are comments interspersed related to -enforcement. _These comments will be italicized_. - -Local storage capacity isolation, aka ephemeral-storage, was -introduced into Kubernetes via -. It provides -support for capacity isolation of shared storage between pods, such -that a pod can be limited in its consumption of shared resources and -can be evicted if its consumption of shared storage exceeds that -limit. The limits and requests for shared ephemeral-storage are -similar to those for memory and CPU consumption. - -The current mechanism relies on periodically walking each ephemeral -volume (emptydir, logdir, or container writable layer) and summing the -space consumption. This method is slow, can be fooled, and has high -latency (i. e. a pod could consume a lot of storage prior to the -kubelet being aware of its overage and terminating it). - -The mechanism proposed here utilizes filesystem project quotas to -provide monitoring of resource consumption _and optionally enforcement -of limits._ Project quotas, initially in XFS and more recently ported -to ext4fs, offer a kernel-based means of monitoring _and restricting_ -filesystem consumption that can be applied to one or more directories. - -A prototype is in progress; see . - -### Project Quotas - -Project quotas are a form of filesystem quota that apply to arbitrary -groups of files, as opposed to file user or group ownership. They -were first implemented in XFS, as described here: -. - -Project quotas for ext4fs were [proposed in late -2014](https://lwn.net/Articles/623835/) and added to the Linux kernel -in early 2016, with -commit -[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e). -They were designed to be compatible with XFS project quotas. - -Each inode contains a 32-bit project ID, to which optionally quotas -(hard and soft limits for blocks and inodes) may be applied. The -total blocks and inodes for all files with the given project ID are -maintained by the kernel. Project quotas can be managed from -userspace by means of the `xfs_quota(8)` command in foreign filesystem -(`-f`) mode; the traditional Linux quota tools do not manipulate -project quotas. Programmatically, they are managed by the `quotactl(2)` -system call, using in part the standard quota commands and in part the -XFS quota commands; the man page implies incorrectly that the XFS -quota commands apply only to XFS filesystems. - -The project ID applied to a directory is inherited by files created -under it. Files cannot be (hard) linked across directories with -different project IDs. A file's project ID cannot be changed by a -non-privileged user, but a privileged user may use the `xfs_io(8)` -command to change the project ID of a file. - -Filesystems using project quotas may be mounted with quotas either -enforced or not; the non-enforcing mode tracks usage without enforcing -it. A non-enforcing project quota may be implemented on a filesystem -mounted with enforcing quotas by setting a quota too large to be hit. -The maximum size that can be set varies with the filesystem; on a -64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for -ext4fs. - -Conventionally, project quota mappings are stored in `/etc/projects` and -`/etc/projid`; these files exist for user convenience and do not have -any direct importance to the kernel. `/etc/projects` contains a mapping -from project ID to directory/file; this can be a one to many mapping -(the same project ID can apply to multiple directories or files, but -any given directory/file can be assigned only one project ID). -`/etc/projid` contains a mapping from named projects to project IDs. - -This proposal utilizes hard project quotas for both monitoring _and -enforcement_. Soft quotas are of no utility; they allow for temporary -overage that, after a programmable period of time, is converted to the -hard quota limit. - - -## Motivation - -The mechanism presently used to monitor storage consumption involves -use of `du` and `find` to periodically gather information about -storage and inode consumption of volumes. This mechanism suffers from -a number of drawbacks: - -* It is slow. If a volume contains a large number of files, walking - the directory can take a significant amount of time. There has been - at least one known report of nodes becoming not ready due to volume - metrics: -* It is possible to conceal a file from the walker by creating it and - removing it while holding an open file descriptor on it. POSIX - behavior is to not remove the file until the last open file - descriptor pointing to it is removed. This has legitimate uses; it - ensures that a temporary file is deleted when the processes using it - exit, and it minimizes the attack surface by not having a file that - can be found by an attacker. The following pod does this; it will - never be caught by the present mechanism: - -```yaml -apiVersion: v1 -kind: Pod -max: -metadata: - name: "diskhog" -spec: - containers: - - name: "perl" - resources: - limits: - ephemeral-storage: "2048Ki" - image: "perl" - command: - - perl - - -e - - > - my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999 - volumeMounts: - - name: a - mountPath: /data/a - volumes: - - name: a - emptyDir: {} -``` -* It is reactive rather than proactive. It does not prevent a pod - from overshooting its limit; at best it catches it after the fact. - On a fast storage medium, such as NVMe, a pod may write 50 GB or - more of data before the housekeeping performed once per minute - catches up to it. If the primary volume is the root partition, this - will completely fill the partition, possibly causing serious - problems elsewhere on the system. This proposal does not address - this issue; _a future enforcing project would_. - -In many environments, these issues may not matter, but shared -multi-tenant environments need these issues addressed. - -### Goals - -These goals apply only to local ephemeral storage, as described in -. - -* Primary: improve performance of monitoring by using project quotas - in a non-enforcing way to collect information about storage - utilization of ephemeral volumes. -* Primary: detect storage used by pods that is concealed by deleted - files being held open. -* Primary: this will not interfere with the more common user and group - quotas. - -### Non-Goals - -* Application to storage other than local ephemeral storage. -* Elimination of eviction as a means of enforcing ephemeral-storage - limits. Pods that hit their ephemeral-storage limit will still be - evicted by the kubelet even if their storage has been capped by - enforcing quotas. -* Enforcing node allocatable (limit over the sum of all pod's disk - usage, including e. g. images). -* Enforcing limits on total pod storage consumption by any means, such - that the pod would be hard restricted to the desired storage limit. - -### Future Work - -* _Enforce limits on per-volume storage consumption by using - enforced project quotas._ - -## Proposal - -This proposal applies project quotas to emptydir volumes on qualifying -filesystems (ext4fs and xfs with project quotas enabled). Project -quotas are applied by selecting an unused project ID (a 32-bit -unsigned integer), setting a limit on space and/or inode consumption, -and attaching the ID to one or more files. By default (and as -utilized herein), if a project ID is attached to a directory, it is -inherited by any files created under that directory. - -_If we elect to use the quota as enforcing, we impose a quota -consistent with the desired limit._ If we elect to use it as -non-enforcing, we impose a large quota that in practice cannot be -exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). - -### Control over Use of Quotas - -At present, two feature gates control operation of quotas: - -* `LocalStorageCapacityIsolation` must be enabled for any use of - quotas. - -* `FSQuotaForLSCIMonitoring` must be enabled in addition. If this is - enabled, quotas are used for monitoring, but not enforcement. At - present, this defaults to False, but the intention is that this will - default to True by initial release. - -* _`FSQuotaForLSCIEnforcement` must be enabled, in addition to - `FSQuotaForLSCIMonitoring`, to use quotas for enforcement._ - -### Operation Flow -- Applying a Quota - -* Caller (emptydir volume manager or container runtime) creates an - emptydir volume, with an empty directory at a location of its - choice. -* Caller requests that a quota be applied to a directory. -* Determine whether a quota can be imposed on the directory, by asking - each quota provider (one per filesystem type) whether it can apply a - quota to the directory. If no provider claims the directory, an - error status is returned to the caller. -* Select an unused project ID ([see below](#selecting-a-project-id)). -* Set the desired limit on the project ID, in a filesystem-dependent - manner ([see below](#notes-on-implementation)). -* Apply the project ID to the directory in question, in a - filesystem-dependent manner. - -An error at any point results in no quota being applied and no change -to the state of the system. The caller in general should not assume a -priori that the attempt will be successful. It could choose to reject -a request if a quota cannot be applied, but at this time it will -simply ignore the error and proceed as today. - -### Operation Flow -- Retrieving Storage Consumption - -* Caller (kubelet metrics code, cadvisor, container runtime) asks the - quota code to compute the amount of storage used under the - directory. -* Determine whether a quota applies to the directory, in a - filesystem-dependent manner ([see below](#notes-on-implementation)). -* If so, determine how much storage or how many inodes are utilized, - in a filesystem dependent manner. - -If the quota code is unable to retrieve the consumption, it returns an -error status and it is up to the caller to utilize a fallback -mechanism (such as the directory walk performed today). - -### Operation Flow -- Removing a Quota. - -* Caller requests that the quota be removed from a directory. -* Determine whether a project quota applies to the directory. -* Remove the limit from the project ID associated with the directory. -* Remove the association between the directory and the project ID. -* Return the project ID to the system to allow its use elsewhere ([see - below](#return-a-project-id-to-the-system)). -* Caller may delete the directory and its contents (normally it will). - -### Operation Notes - -#### Selecting a Project ID - -Project IDs are a shared space within a filesystem. If the same -project ID is assigned to multiple directories, the space consumption -reported by the quota will be the sum of that of all of the -directories. Hence, it is important to ensure that each directory is -assigned a unique project ID (unless it is desired to pool the storage -use of multiple directories). - -The canonical mechanism to record persistently that a project ID is -reserved is to store it in the `/etc/projid` (`projid[5]`) and/or -`/etc/projects` (`projects(5)`) files. However, it is possible to utilize -project IDs without recording them in those files; they exist for -administrative convenience but neither the kernel nor the filesystem -is aware of them. Other ways can be used to determine whether a -project ID is in active use on a given filesystem: - -* The quota values (in blocks and/or inodes) assigned to the project - ID are non-zero. -* The storage consumption (in blocks and/or inodes) reported under the - project ID are non-zero. - -The algorithm to be used is as follows: - -* Lock this instance of the quota code against re-entrancy. -* open and `flock()` the `/etc/project` and `/etc/projid` files, so that - other uses of this code are excluded. -* Start from a high number (the prototype uses 1048577). -* Iterate from there, performing the following tests: - * Is the ID reserved by this instance of the quota code? - * Is the ID present in `/etc/projects`? - * Is the ID present in `/etc/projid`? - * Are the quota values and/or consumption reported by the kernel - non-zero? This test is restricted to 128 iterations to ensure - that a bug here or elsewhere does not result in an infinite loop - looking for a quota ID. -* If an ID has been found: - * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so - that any other uses of project quotas do not reuse it. - * Write temporary copies of `/etc/projects` and `/etc/projid` that are - `flock()`ed - * If successful, rename the temporary files appropriately (if - rename of one succeeds but the other fails, we have a problem - that we cannot recover from, and the files may be inconsistent). -* Unlock `/etc/projid` and `/etc/projects`. -* Unlock this instance of the quota code. - -A minor variation of this is used if we want to reuse an existing -quota ID. - -#### Determine Whether a Project ID Applies To a Directory - -It is possible to determine whether a directory has a project ID -applied to it by requesting (via the `quotactl(2)` system call) the -project ID associated with the directory. Whie the specifics are -filesystem-dependent, the basic method is the same for at least XFS -and ext4fs. - -It is not possible to determine in constant operations the directory -or directories to which a project ID is applied. It is possible to -determine whether a given project ID has been applied to an existing -directory or files (although those will not be known); the reported -consumption will be non-zero. - -The code records internally the project ID applied to a directory, but -it cannot always rely on this. In particular, if the kubelet has -exited and has been restarted (and hence the quota applying to the -directory should be removed), the map from directory to project ID is -lost. If it cannot find a map entry, it falls back on the approach -discussed above. - -#### Return a Project ID To the System - -The algorithm used to return a project ID to the system is very -similar to the algorithm used to select a project ID, except of course -for selecting a project ID. It performs the same sequence of locking -`/etc/project` and `/etc/projid`, editing a copy of the file, and -restoring it. - -If the project ID is applied to multiple directories and the code can -determine that, it will not remove the project ID from `/etc/projid` -until the last reference is removed. While it is not anticipated in -this KEP that this mode of operation will be used, at least initially, -this can be detected even on kubelet restart by looking at the -reference count in `/etc/projects`. - - -### Implementation Details/Notes/Constraints [optional] - -#### Notes on Implementation - -The primary new interface defined is the quota interface in -`pkg/volume/util/quota/quota.go`. This defines five operations: - -* Does the specified directory support quotas? - -* Assign a quota to a directory. If a non-empty pod UID is provided, - the quota assigned is that of any other directories under this pod - UID; if an empty pod UID is provided, a unique quota is assigned. - -* Retrieve the consumption of the specified directory. If the quota - code cannot handle it efficiently, it returns an error and the - caller falls back on existing mechanism. - -* Retrieve the inode consumption of the specified directory; same - description as above. - -* Remove quota from a directory. If a non-empty pod UID is passed, it - is checked against that recorded in-memory (if any). The quota is - removed from the specified directory. This can be used even if - AssignQuota has not been used; it inspects the directory and removes - the quota from it. This permits stale quotas from an interrupted - kubelet to be cleaned up. - -Two implementations are provided: `quota_linux.go` (for Linux) and -`quota_unsupported.go` (for other operating systems). The latter -returns an error for all requests. - -As the quota mechanism is intended to support multiple filesystems, -and different filesystems require different low level code for -manipulating quotas, a provider is supplied that finds an appropriate -quota applier implementation for the filesystem in question. The low -level quota applier provides similar operations to the top level quota -code, with two exceptions: - -* No operation exists to determine whether a quota can be applied - (that is handled by the provider). - -* An additional operation is provided to determine whether a given - quota ID is in use within the filesystem (outside of `/etc/projects` - and `/etc/projid`). - -The two quota providers in the initial implementation are in -`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While -some quota operations do require different system calls, a lot of the -code is common, and factored into -`pkg/volume/util/quota/common/quota_linux_common_impl.go`. - -#### Notes on Code Changes - -The prototype for this project is mostly self-contained within -`pkg/volume/util/quota` and a few changes to -`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were -required elsewhere: - -* The operation executor needs to pass the desired size limit to the - volume plugin where appropriate so that the volume plugin can impose - a quota. The limit is passed as 0 (do not use quotas), _positive - number (impose an enforcing quota if possible, measured in bytes),_ - or -1 (impose a non-enforcing quota, if possible) on the volume. - - This requires changes to - `pkg/volume/util/operationexecutor/operation_executor.go` (to add - `DesiredSizeLimit` to `VolumeToMount`), - `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and - `pkg/kubelet/eviction/helpers.go` (the latter in order to determine - whether the volume is a local ephemeral one). - -* The volume manager (in `pkg/volume/volume.go`) changes the - `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new - `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to - allow passing the desired size and pod UID (in the event we choose - to implement quotas shared between multiple volumes; [see - below](#alternative-quota-based-implementation)). This required - small changes to all volume plugins and their tests, but will in the - future allow adding additional data without having to change code - other than that which uses the new information. - -#### Testing Strategy - -The quota code is by an large not very amendable to unit tests. While -there are simple unit tests for parsing the mounts file, and there -could be tests for parsing the projects and projid files, the real -work (and risk) involves interactions with the kernel and with -multiple instances of this code (e. g. in the kubelet and the runtime -manager, particularly under stress). It also requires setup in the -form of a prepared filesystem. It would be better served by -appropriate end to end tests. - -### Risks and Mitigations - -* The SIG raised the possibility of a container being unable to exit - should we enforce quotas, and the quota interferes with writing the - log. This can be mitigated by either not applying a quota to the - log directory and using the du mechanism, or by applying a separate - non-enforcing quota to the log directory. - - As log directories are write-only by the container, and consumption - can be limited by other means (as the log is filtered by the - runtime), I do not consider the ability to write uncapped to the log - to be a serious exposure. - - Note in addition that even without quotas it is possible for writes - to fail due to lack of filesystem space, which is effectively (and - in some cases operationally) indistinguishable from exceeding quota, - so even at present code must be able to handle those situations. - -* Filesystem quotas may impact performance to an unknown degree. - Information on that is hard to come by in general, and one of the - reasons for using quotas is indeed to improve performance. If this - is a problem in the field, merely turning off quotas (or selectively - disabling project quotas) on the filesystem in question will avoid - the problem. Against the possibility that that cannot be done - (because project quotas are needed for other purposes), we should - provide a way to disable use of quotas altogether via a feature - gate. - - A report notes that an - unclean shutdown on Linux kernel versions between 3.11 and 3.17 can - result in a prolonged downtime while quota information is restored. - Unfortunately, [the link referenced - here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no - longer available. - -* Bugs in the quota code could result in a variety of regression - behavior. For example, if a quota is incorrectly applied it could - result in ability to write no data at all to the volume. This could - be mitigated by use of non-enforcing quotas. XFS in particular - offers the `pqnoenforce` mount option that makes all quotas - non-enforcing. - - -## Graduation Criteria - -How will we know that this has succeeded? Gathering user feedback is -crucial for building high quality experiences and SIGs have the -important responsibility of setting milestones for stability and -completeness. Hopefully the content previously contained in [umbrella -issues][] will be tracked in the `Graduation Criteria` section. - -[umbrella issues]: N/A - -## Implementation History - -Major milestones in the life cycle of a KEP should be tracked in -`Implementation History`. Major milestones might include - -- the `Summary` and `Motivation` sections being merged signaling SIG - acceptance -- the `Proposal` section being merged signaling agreement on a - proposed design -- the date implementation started -- the first Kubernetes release where an initial version of the KEP was - available -- the version of Kubernetes where the KEP graduated to general - availability -- when the KEP was retired or superseded - -## Drawbacks [optional] - -* Use of quotas, particularly the less commonly used project quotas, - requires additional action on the part of the administrator. In - particular: - * ext4fs filesystems must be created with additional options that - are not enabled by default: -``` -mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ -``` - * An additional option (`prjquota`) must be applied in `/etc/fstab` - * If the root filesystem is to be quota-enabled, it must be set in - the grub options. -* Use of project quotas for this purpose will preclude future use - within containers. - -## Alternatives [optional] - -I have considered two classes of alternatives: - -* Alternatives based on quotas, with different implementation - -* Alternatives based on loop filesystems without use of quotas - -### Alternative quota-based implementation - -Within the basic framework of using quotas to monitor and potentially -enforce storage utilization, there are a number of possible options: - -* Utilize per-volume non-enforcing quotas to monitor storage (the - first stage of this proposal). - - This mostly preserves the current behavior, but with more efficient - determination of storage utilization and the possibility of building - further on it. The one change from current behavior is the ability - to detect space used by deleted files. - -* Utilize per-volume enforcing quotas to monitor and enforce storage - (the second stage of this proposal). - - This allows partial enforcement of storage limits. As local storage - capacity isolation works at the level of the pod, and we have no - control of user utilization of ephemeral volumes, we would have to - give each volume a quota of the full limit. For example, if a pod - had a limit of 1 MB but had four ephemeral volumes mounted, it would - be possible for storage utilization to reach (at least temporarily) - 4MB before being capped. - -* Utilize per-pod enforcing user or group quotas to enforce storage - consumption, and per-volume non-enforcing quotas for monitoring. - - This would offer the best of both worlds: a fully capped storage - limit combined with efficient reporting. However, it would require - each pod to run under a distinct UID or GID. This may prevent pods - from using setuid or setgid or their variants, and would interfere - with any other use of group or user quotas within Kubernetes. - -* Utilize per-pod enforcing quotas to monitor and enforce storage. - - This allows for full enforcement of storage limits, at the expense - of being able to efficiently monitor per-volume storage - consumption. As there have already been reports of monitoring - causing trouble, I do not advise this option. - - A variant of this would report (1/N) storage for each covered - volume, so with a pod with a 4MiB quota and 1MiB total consumption, - spread across 4 ephemeral volumes, each volume would report a - consumption of 256 KiB. Another variant would change the API to - report statistics for all ephemeral volumes combined. I do not - advise this option. - -### Alternative loop filesystem-based implementation - -Another way of isolating storage is to utilize filesystems of -pre-determined size, using the loop filesystem facility within Linux. -It is possible to create a file and run `mkfs(8)` on it, and then to -mount that filesystem on the desired directory. This both limits the -storage available within that directory and enables quick retrieval of -it via `statfs(2)`. - -Cleanup of such a filesystem involves unmounting it and removing the -backing file. - -The backing file can be created as a sparse file, and the `discard` -option can be used to return unused space to the system, allowing for -thin provisioning. - -I conducted preliminary investigations into this. While at first it -appeared promising, it turned out to have multiple critical flaws: - -* If the filesystem is mounted without the `discard` option, it can - grow to the full size of the backing file, negating any possibility - of thin provisioning. If the file is created dense in the first - place, there is never any possibility of thin provisioning without - use of `discard`. - - If the backing file is created densely, it additionally may require - significant time to create if the ephemeral limit is large. - -* If the filesystem is mounted `nosync`, and is sparse, it is possible - for writes to succeed and then fail later with I/O errors when - synced to the backing storage. This will lead to data corruption - that cannot be detected at the time of write. - - This can easily be reproduced by e. g. creating a 64MB filesystem - and within it creating a 128MB sparse file and building a filesystem - on it. When that filesystem is in turn mounted, writes to it will - succeed, but I/O errors will be seen in the log and the file will be - incomplete: - -``` -# mkdir /var/tmp/d1 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383 -# mkfs.ext4 /var/tmp/fs1 -# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1 -# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767 -# mkfs.ext4 /var/tmp/d1/fs2 -# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 -# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 - ...will normally succeed... -# sync - ...fails with I/O error!... -``` - -* If the filesystem is mounted `sync`, all writes to it are - immediately committed to the backing store, and the `dd` operation - above fails as soon as it fills up `/var/tmp/d1`. However, - performance is drastically slowed, particularly with small writes; - with 1K writes, I observed performance degradation in some cases - exceeding three orders of magnitude. - - I performed a test comparing writing 64 MB to a base (partitioned) - filesystem, to a loop filesystem without `sync`, and a loop - filesystem with `sync`. Total I/O was sufficient to run for at least - 5 seconds in each case. All filesystems involved were XFS. Loop - filesystems were 128 MB and dense. Times are in seconds. The - erratic behavior (e. g. the 65536 case) was involved was observed - repeatedly, although the exact amount of time and which I/O sizes - were affected varied. The underlying device was an HP EX920 1TB - NVMe SSD. - -| I/O Size | Partition | Loop w/sync | Loop w/o sync | -| ---: | ---: | ---: | ---: | -| 1024 | 0.104 | 0.120 | 140.390 | -| 4096 | 0.045 | 0.077 | 21.850 | -| 16384 | 0.045 | 0.067 | 5.550 | -| 65536 | 0.044 | 0.061 | 20.440 | -| 262144 | 0.043 | 0.087 | 0.545 | -| 1048576 | 0.043 | 0.055 | 7.490 | -| 4194304 | 0.043 | 0.053 | 0.587 | - - The only potentially viable combination in my view would be a dense - loop filesystem without sync, but that would render any thin - provisioning impossible. - -## Infrastructure Needed [optional] - -* Decision: who is responsible for quota management of all volume - types (and especially ephemeral volumes of all types). At present, - emptydir volumes are managed by the kubelet and logdirs and writable - layers by either the kubelet or the runtime, depending upon the - choice of runtime. Beyond the specific proposal that the runtime - should manage quotas for volumes it creates, there are broader - issues that I request assistance from the SIG in addressing. - -* Location of the quota code. If the quotas for different volume - types are to be managed by different components, each such component - needs access to the quota code. The code is substantial and should - not be copied; it would more appropriately be vendored. - -## References - -### Bugs Opened Against Filesystem Quotas - -The following is a list of known security issues referencing -filesystem quotas on Linux, and other bugs referencing filesystem -quotas in Linux since 2012. These bugs are not necessarily in the -quota system. - -#### CVE - -* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel - before 3.3.6, when huge pages are enabled, allows local users to - cause a denial of service (system crash) or possibly gain privileges - by interacting with a hugetlbfs filesystem, as demonstrated by a - umount operation that triggers improper handling of quota data. - - The issue is actually related to huge pages, not quotas - specifically. The demonstration of the vulnerability resulted in - incorrect handling of quota data. - -* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c) - in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl - function the first time without a host name, which might allow - remote attackers to bypass TCP Wrappers rules in hosts.deny (related - to rpc.rquotad; remote attackers might be able to bypass TCP - Wrappers rules). - - This issue is related to remote quota handling, which is not the use - case for the proposal at hand. - -#### Other Security Issues Without CVE - -* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and - Create Large Files](https://securitytracker.com/id/1002610) - - A setuid root binary inheriting file descriptors from an - unprivileged user process may write to the file without respecting - quota limits. If this issue is still present, it would allow a - setuid process to exceed any enforcing limits, but does not affect - the quota accounting (use of quotas for monitoring). - -### Other Linux Quota-Related Bugs Since 2012 - -* [ext4: report delalloc reserve as non-free in statfs mangled by - project quota](https://lore.kernel.org/patchwork/patch/884530/) - - This bug, fixed in Feb. 2018, properly accounts for reserved but not - committed space in project quotas. At this point I have not - determined the impact of this issue. - -* [XFS quota doesn't work after rebooting because of - crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730) - - This bug resulted in XFS quotas not working after a crash or forced - reboot. Under this proposal, Kubernetes would fall back to du for - monitoring should a bug of this nature manifest itself again. - -* [quota can show incorrect filesystem - name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527) - - This issue, which will not be fixed, results in the quota command - possibly printing an incorrect filesystem name when used on remote - filesystems. It is a display issue with the quota command, not a - quota bug at all, and does not result in incorrect quota information - being reported. As this proposal does not utilize the quota command - or rely on filesystem name, or currently use quotas on remote - filesystems, it should not be affected by this bug. - -In addition, the e2fsprogs have had numerous fixes over the years. diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md new file mode 100644 index 00000000..2564455f --- /dev/null +++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md @@ -0,0 +1,810 @@ +--- +kep-number: 0 +title: My First KEP +authors: + - "@janedoe" +owning-sig: sig-xxx +participating-sigs: + - sig-aaa + - sig-bbb +reviewers: + - TBD + - "@alicedoe" +approvers: + - TBD + - "@oscardoe" +editor: TBD +creation-date: yyyy-mm-dd +last-updated: yyyy-mm-dd +status: provisional +see-also: + - KEP-1 + - KEP-2 +replaces: + - KEP-3 +superseded-by: + - KEP-100 +--- + +# Quotas for Ephemeral Storage + +## Table of Contents + +**Table of Contents** + +- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Project Quotas](#project-quotas) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Future Work](#future-work) + - [Proposal](#proposal) + - [Control over Use of Quotas](#control-over-use-of-quotas) + - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota) + - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption) + - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota) + - [Operation Notes](#operation-notes) + - [Selecting a Project ID](#selecting-a-project-id) + - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory) + - [Return a Project ID To the System](#return-a-project-id-to-the-system) + - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) + - [Notes on Implementation](#notes-on-implementation) + - [Notes on Code Changes](#notes-on-code-changes) + - [Testing Strategy](#testing-strategy) + - [Risks and Mitigations](#risks-and-mitigations) + - [Graduation Criteria](#graduation-criteria) + - [Implementation History](#implementation-history) + - [Drawbacks [optional]](#drawbacks-optional) + - [Alternatives [optional]](#alternatives-optional) + - [Alternative quota-based implementation](#alternative-quota-based-implementation) + - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation) + - [Infrastructure Needed [optional]](#infrastructure-needed-optional) + - [References](#references) + - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas) + - [CVE](#cve) + - [Other Security Issues Without CVE](#other-security-issues-without-cve) + - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012) + + + +[Tools for generating]: https://github.com/ekalinin/github-markdown-toc + +## Summary + +This proposal applies to the use of quotas for ephemeral-storage +metrics gathering. Use of quotas for ephemeral-storage limit +enforcement is a [non-goal](#non-goals), but as the architecture and +code will be very similar, there are comments interspersed related to +enforcement. _These comments will be italicized_. + +Local storage capacity isolation, aka ephemeral-storage, was +introduced into Kubernetes via +. It provides +support for capacity isolation of shared storage between pods, such +that a pod can be limited in its consumption of shared resources and +can be evicted if its consumption of shared storage exceeds that +limit. The limits and requests for shared ephemeral-storage are +similar to those for memory and CPU consumption. + +The current mechanism relies on periodically walking each ephemeral +volume (emptydir, logdir, or container writable layer) and summing the +space consumption. This method is slow, can be fooled, and has high +latency (i. e. a pod could consume a lot of storage prior to the +kubelet being aware of its overage and terminating it). + +The mechanism proposed here utilizes filesystem project quotas to +provide monitoring of resource consumption _and optionally enforcement +of limits._ Project quotas, initially in XFS and more recently ported +to ext4fs, offer a kernel-based means of monitoring _and restricting_ +filesystem consumption that can be applied to one or more directories. + +A prototype is in progress; see . + +### Project Quotas + +Project quotas are a form of filesystem quota that apply to arbitrary +groups of files, as opposed to file user or group ownership. They +were first implemented in XFS, as described here: +. + +Project quotas for ext4fs were [proposed in late +2014](https://lwn.net/Articles/623835/) and added to the Linux kernel +in early 2016, with +commit +[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e). +They were designed to be compatible with XFS project quotas. + +Each inode contains a 32-bit project ID, to which optionally quotas +(hard and soft limits for blocks and inodes) may be applied. The +total blocks and inodes for all files with the given project ID are +maintained by the kernel. Project quotas can be managed from +userspace by means of the `xfs_quota(8)` command in foreign filesystem +(`-f`) mode; the traditional Linux quota tools do not manipulate +project quotas. Programmatically, they are managed by the `quotactl(2)` +system call, using in part the standard quota commands and in part the +XFS quota commands; the man page implies incorrectly that the XFS +quota commands apply only to XFS filesystems. + +The project ID applied to a directory is inherited by files created +under it. Files cannot be (hard) linked across directories with +different project IDs. A file's project ID cannot be changed by a +non-privileged user, but a privileged user may use the `xfs_io(8)` +command to change the project ID of a file. + +Filesystems using project quotas may be mounted with quotas either +enforced or not; the non-enforcing mode tracks usage without enforcing +it. A non-enforcing project quota may be implemented on a filesystem +mounted with enforcing quotas by setting a quota too large to be hit. +The maximum size that can be set varies with the filesystem; on a +64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for +ext4fs. + +Conventionally, project quota mappings are stored in `/etc/projects` and +`/etc/projid`; these files exist for user convenience and do not have +any direct importance to the kernel. `/etc/projects` contains a mapping +from project ID to directory/file; this can be a one to many mapping +(the same project ID can apply to multiple directories or files, but +any given directory/file can be assigned only one project ID). +`/etc/projid` contains a mapping from named projects to project IDs. + +This proposal utilizes hard project quotas for both monitoring _and +enforcement_. Soft quotas are of no utility; they allow for temporary +overage that, after a programmable period of time, is converted to the +hard quota limit. + + +## Motivation + +The mechanism presently used to monitor storage consumption involves +use of `du` and `find` to periodically gather information about +storage and inode consumption of volumes. This mechanism suffers from +a number of drawbacks: + +* It is slow. If a volume contains a large number of files, walking + the directory can take a significant amount of time. There has been + at least one known report of nodes becoming not ready due to volume + metrics: +* It is possible to conceal a file from the walker by creating it and + removing it while holding an open file descriptor on it. POSIX + behavior is to not remove the file until the last open file + descriptor pointing to it is removed. This has legitimate uses; it + ensures that a temporary file is deleted when the processes using it + exit, and it minimizes the attack surface by not having a file that + can be found by an attacker. The following pod does this; it will + never be caught by the present mechanism: + +```yaml +apiVersion: v1 +kind: Pod +max: +metadata: + name: "diskhog" +spec: + containers: + - name: "perl" + resources: + limits: + ephemeral-storage: "2048Ki" + image: "perl" + command: + - perl + - -e + - > + my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999 + volumeMounts: + - name: a + mountPath: /data/a + volumes: + - name: a + emptyDir: {} +``` +* It is reactive rather than proactive. It does not prevent a pod + from overshooting its limit; at best it catches it after the fact. + On a fast storage medium, such as NVMe, a pod may write 50 GB or + more of data before the housekeeping performed once per minute + catches up to it. If the primary volume is the root partition, this + will completely fill the partition, possibly causing serious + problems elsewhere on the system. This proposal does not address + this issue; _a future enforcing project would_. + +In many environments, these issues may not matter, but shared +multi-tenant environments need these issues addressed. + +### Goals + +These goals apply only to local ephemeral storage, as described in +. + +* Primary: improve performance of monitoring by using project quotas + in a non-enforcing way to collect information about storage + utilization of ephemeral volumes. +* Primary: detect storage used by pods that is concealed by deleted + files being held open. +* Primary: this will not interfere with the more common user and group + quotas. + +### Non-Goals + +* Application to storage other than local ephemeral storage. +* Elimination of eviction as a means of enforcing ephemeral-storage + limits. Pods that hit their ephemeral-storage limit will still be + evicted by the kubelet even if their storage has been capped by + enforcing quotas. +* Enforcing node allocatable (limit over the sum of all pod's disk + usage, including e. g. images). +* Enforcing limits on total pod storage consumption by any means, such + that the pod would be hard restricted to the desired storage limit. + +### Future Work + +* _Enforce limits on per-volume storage consumption by using + enforced project quotas._ + +## Proposal + +This proposal applies project quotas to emptydir volumes on qualifying +filesystems (ext4fs and xfs with project quotas enabled). Project +quotas are applied by selecting an unused project ID (a 32-bit +unsigned integer), setting a limit on space and/or inode consumption, +and attaching the ID to one or more files. By default (and as +utilized herein), if a project ID is attached to a directory, it is +inherited by any files created under that directory. + +_If we elect to use the quota as enforcing, we impose a quota +consistent with the desired limit._ If we elect to use it as +non-enforcing, we impose a large quota that in practice cannot be +exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs). + +### Control over Use of Quotas + +At present, two feature gates control operation of quotas: + +* `LocalStorageCapacityIsolation` must be enabled for any use of + quotas. + +* `FSQuotaForLSCIMonitoring` must be enabled in addition. If this is + enabled, quotas are used for monitoring, but not enforcement. At + present, this defaults to False, but the intention is that this will + default to True by initial release. + +* _`FSQuotaForLSCIEnforcement` must be enabled, in addition to + `FSQuotaForLSCIMonitoring`, to use quotas for enforcement._ + +### Operation Flow -- Applying a Quota + +* Caller (emptydir volume manager or container runtime) creates an + emptydir volume, with an empty directory at a location of its + choice. +* Caller requests that a quota be applied to a directory. +* Determine whether a quota can be imposed on the directory, by asking + each quota provider (one per filesystem type) whether it can apply a + quota to the directory. If no provider claims the directory, an + error status is returned to the caller. +* Select an unused project ID ([see below](#selecting-a-project-id)). +* Set the desired limit on the project ID, in a filesystem-dependent + manner ([see below](#notes-on-implementation)). +* Apply the project ID to the directory in question, in a + filesystem-dependent manner. + +An error at any point results in no quota being applied and no change +to the state of the system. The caller in general should not assume a +priori that the attempt will be successful. It could choose to reject +a request if a quota cannot be applied, but at this time it will +simply ignore the error and proceed as today. + +### Operation Flow -- Retrieving Storage Consumption + +* Caller (kubelet metrics code, cadvisor, container runtime) asks the + quota code to compute the amount of storage used under the + directory. +* Determine whether a quota applies to the directory, in a + filesystem-dependent manner ([see below](#notes-on-implementation)). +* If so, determine how much storage or how many inodes are utilized, + in a filesystem dependent manner. + +If the quota code is unable to retrieve the consumption, it returns an +error status and it is up to the caller to utilize a fallback +mechanism (such as the directory walk performed today). + +### Operation Flow -- Removing a Quota. + +* Caller requests that the quota be removed from a directory. +* Determine whether a project quota applies to the directory. +* Remove the limit from the project ID associated with the directory. +* Remove the association between the directory and the project ID. +* Return the project ID to the system to allow its use elsewhere ([see + below](#return-a-project-id-to-the-system)). +* Caller may delete the directory and its contents (normally it will). + +### Operation Notes + +#### Selecting a Project ID + +Project IDs are a shared space within a filesystem. If the same +project ID is assigned to multiple directories, the space consumption +reported by the quota will be the sum of that of all of the +directories. Hence, it is important to ensure that each directory is +assigned a unique project ID (unless it is desired to pool the storage +use of multiple directories). + +The canonical mechanism to record persistently that a project ID is +reserved is to store it in the `/etc/projid` (`projid[5]`) and/or +`/etc/projects` (`projects(5)`) files. However, it is possible to utilize +project IDs without recording them in those files; they exist for +administrative convenience but neither the kernel nor the filesystem +is aware of them. Other ways can be used to determine whether a +project ID is in active use on a given filesystem: + +* The quota values (in blocks and/or inodes) assigned to the project + ID are non-zero. +* The storage consumption (in blocks and/or inodes) reported under the + project ID are non-zero. + +The algorithm to be used is as follows: + +* Lock this instance of the quota code against re-entrancy. +* open and `flock()` the `/etc/project` and `/etc/projid` files, so that + other uses of this code are excluded. +* Start from a high number (the prototype uses 1048577). +* Iterate from there, performing the following tests: + * Is the ID reserved by this instance of the quota code? + * Is the ID present in `/etc/projects`? + * Is the ID present in `/etc/projid`? + * Are the quota values and/or consumption reported by the kernel + non-zero? This test is restricted to 128 iterations to ensure + that a bug here or elsewhere does not result in an infinite loop + looking for a quota ID. +* If an ID has been found: + * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so + that any other uses of project quotas do not reuse it. + * Write temporary copies of `/etc/projects` and `/etc/projid` that are + `flock()`ed + * If successful, rename the temporary files appropriately (if + rename of one succeeds but the other fails, we have a problem + that we cannot recover from, and the files may be inconsistent). +* Unlock `/etc/projid` and `/etc/projects`. +* Unlock this instance of the quota code. + +A minor variation of this is used if we want to reuse an existing +quota ID. + +#### Determine Whether a Project ID Applies To a Directory + +It is possible to determine whether a directory has a project ID +applied to it by requesting (via the `quotactl(2)` system call) the +project ID associated with the directory. Whie the specifics are +filesystem-dependent, the basic method is the same for at least XFS +and ext4fs. + +It is not possible to determine in constant operations the directory +or directories to which a project ID is applied. It is possible to +determine whether a given project ID has been applied to an existing +directory or files (although those will not be known); the reported +consumption will be non-zero. + +The code records internally the project ID applied to a directory, but +it cannot always rely on this. In particular, if the kubelet has +exited and has been restarted (and hence the quota applying to the +directory should be removed), the map from directory to project ID is +lost. If it cannot find a map entry, it falls back on the approach +discussed above. + +#### Return a Project ID To the System + +The algorithm used to return a project ID to the system is very +similar to the algorithm used to select a project ID, except of course +for selecting a project ID. It performs the same sequence of locking +`/etc/project` and `/etc/projid`, editing a copy of the file, and +restoring it. + +If the project ID is applied to multiple directories and the code can +determine that, it will not remove the project ID from `/etc/projid` +until the last reference is removed. While it is not anticipated in +this KEP that this mode of operation will be used, at least initially, +this can be detected even on kubelet restart by looking at the +reference count in `/etc/projects`. + + +### Implementation Details/Notes/Constraints [optional] + +#### Notes on Implementation + +The primary new interface defined is the quota interface in +`pkg/volume/util/quota/quota.go`. This defines five operations: + +* Does the specified directory support quotas? + +* Assign a quota to a directory. If a non-empty pod UID is provided, + the quota assigned is that of any other directories under this pod + UID; if an empty pod UID is provided, a unique quota is assigned. + +* Retrieve the consumption of the specified directory. If the quota + code cannot handle it efficiently, it returns an error and the + caller falls back on existing mechanism. + +* Retrieve the inode consumption of the specified directory; same + description as above. + +* Remove quota from a directory. If a non-empty pod UID is passed, it + is checked against that recorded in-memory (if any). The quota is + removed from the specified directory. This can be used even if + AssignQuota has not been used; it inspects the directory and removes + the quota from it. This permits stale quotas from an interrupted + kubelet to be cleaned up. + +Two implementations are provided: `quota_linux.go` (for Linux) and +`quota_unsupported.go` (for other operating systems). The latter +returns an error for all requests. + +As the quota mechanism is intended to support multiple filesystems, +and different filesystems require different low level code for +manipulating quotas, a provider is supplied that finds an appropriate +quota applier implementation for the filesystem in question. The low +level quota applier provides similar operations to the top level quota +code, with two exceptions: + +* No operation exists to determine whether a quota can be applied + (that is handled by the provider). + +* An additional operation is provided to determine whether a given + quota ID is in use within the filesystem (outside of `/etc/projects` + and `/etc/projid`). + +The two quota providers in the initial implementation are in +`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While +some quota operations do require different system calls, a lot of the +code is common, and factored into +`pkg/volume/util/quota/common/quota_linux_common_impl.go`. + +#### Notes on Code Changes + +The prototype for this project is mostly self-contained within +`pkg/volume/util/quota` and a few changes to +`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were +required elsewhere: + +* The operation executor needs to pass the desired size limit to the + volume plugin where appropriate so that the volume plugin can impose + a quota. The limit is passed as 0 (do not use quotas), _positive + number (impose an enforcing quota if possible, measured in bytes),_ + or -1 (impose a non-enforcing quota, if possible) on the volume. + + This requires changes to + `pkg/volume/util/operationexecutor/operation_executor.go` (to add + `DesiredSizeLimit` to `VolumeToMount`), + `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and + `pkg/kubelet/eviction/helpers.go` (the latter in order to determine + whether the volume is a local ephemeral one). + +* The volume manager (in `pkg/volume/volume.go`) changes the + `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new + `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to + allow passing the desired size and pod UID (in the event we choose + to implement quotas shared between multiple volumes; [see + below](#alternative-quota-based-implementation)). This required + small changes to all volume plugins and their tests, but will in the + future allow adding additional data without having to change code + other than that which uses the new information. + +#### Testing Strategy + +The quota code is by an large not very amendable to unit tests. While +there are simple unit tests for parsing the mounts file, and there +could be tests for parsing the projects and projid files, the real +work (and risk) involves interactions with the kernel and with +multiple instances of this code (e. g. in the kubelet and the runtime +manager, particularly under stress). It also requires setup in the +form of a prepared filesystem. It would be better served by +appropriate end to end tests. + +### Risks and Mitigations + +* The SIG raised the possibility of a container being unable to exit + should we enforce quotas, and the quota interferes with writing the + log. This can be mitigated by either not applying a quota to the + log directory and using the du mechanism, or by applying a separate + non-enforcing quota to the log directory. + + As log directories are write-only by the container, and consumption + can be limited by other means (as the log is filtered by the + runtime), I do not consider the ability to write uncapped to the log + to be a serious exposure. + + Note in addition that even without quotas it is possible for writes + to fail due to lack of filesystem space, which is effectively (and + in some cases operationally) indistinguishable from exceeding quota, + so even at present code must be able to handle those situations. + +* Filesystem quotas may impact performance to an unknown degree. + Information on that is hard to come by in general, and one of the + reasons for using quotas is indeed to improve performance. If this + is a problem in the field, merely turning off quotas (or selectively + disabling project quotas) on the filesystem in question will avoid + the problem. Against the possibility that that cannot be done + (because project quotas are needed for other purposes), we should + provide a way to disable use of quotas altogether via a feature + gate. + + A report notes that an + unclean shutdown on Linux kernel versions between 3.11 and 3.17 can + result in a prolonged downtime while quota information is restored. + Unfortunately, [the link referenced + here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no + longer available. + +* Bugs in the quota code could result in a variety of regression + behavior. For example, if a quota is incorrectly applied it could + result in ability to write no data at all to the volume. This could + be mitigated by use of non-enforcing quotas. XFS in particular + offers the `pqnoenforce` mount option that makes all quotas + non-enforcing. + + +## Graduation Criteria + +How will we know that this has succeeded? Gathering user feedback is +crucial for building high quality experiences and SIGs have the +important responsibility of setting milestones for stability and +completeness. Hopefully the content previously contained in [umbrella +issues][] will be tracked in the `Graduation Criteria` section. + +[umbrella issues]: N/A + +## Implementation History + +Major milestones in the life cycle of a KEP should be tracked in +`Implementation History`. Major milestones might include + +- the `Summary` and `Motivation` sections being merged signaling SIG + acceptance +- the `Proposal` section being merged signaling agreement on a + proposed design +- the date implementation started +- the first Kubernetes release where an initial version of the KEP was + available +- the version of Kubernetes where the KEP graduated to general + availability +- when the KEP was retired or superseded + +## Drawbacks [optional] + +* Use of quotas, particularly the less commonly used project quotas, + requires additional action on the part of the administrator. In + particular: + * ext4fs filesystems must be created with additional options that + are not enabled by default: +``` +mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_ +``` + * An additional option (`prjquota`) must be applied in `/etc/fstab` + * If the root filesystem is to be quota-enabled, it must be set in + the grub options. +* Use of project quotas for this purpose will preclude future use + within containers. + +## Alternatives [optional] + +I have considered two classes of alternatives: + +* Alternatives based on quotas, with different implementation + +* Alternatives based on loop filesystems without use of quotas + +### Alternative quota-based implementation + +Within the basic framework of using quotas to monitor and potentially +enforce storage utilization, there are a number of possible options: + +* Utilize per-volume non-enforcing quotas to monitor storage (the + first stage of this proposal). + + This mostly preserves the current behavior, but with more efficient + determination of storage utilization and the possibility of building + further on it. The one change from current behavior is the ability + to detect space used by deleted files. + +* Utilize per-volume enforcing quotas to monitor and enforce storage + (the second stage of this proposal). + + This allows partial enforcement of storage limits. As local storage + capacity isolation works at the level of the pod, and we have no + control of user utilization of ephemeral volumes, we would have to + give each volume a quota of the full limit. For example, if a pod + had a limit of 1 MB but had four ephemeral volumes mounted, it would + be possible for storage utilization to reach (at least temporarily) + 4MB before being capped. + +* Utilize per-pod enforcing user or group quotas to enforce storage + consumption, and per-volume non-enforcing quotas for monitoring. + + This would offer the best of both worlds: a fully capped storage + limit combined with efficient reporting. However, it would require + each pod to run under a distinct UID or GID. This may prevent pods + from using setuid or setgid or their variants, and would interfere + with any other use of group or user quotas within Kubernetes. + +* Utilize per-pod enforcing quotas to monitor and enforce storage. + + This allows for full enforcement of storage limits, at the expense + of being able to efficiently monitor per-volume storage + consumption. As there have already been reports of monitoring + causing trouble, I do not advise this option. + + A variant of this would report (1/N) storage for each covered + volume, so with a pod with a 4MiB quota and 1MiB total consumption, + spread across 4 ephemeral volumes, each volume would report a + consumption of 256 KiB. Another variant would change the API to + report statistics for all ephemeral volumes combined. I do not + advise this option. + +### Alternative loop filesystem-based implementation + +Another way of isolating storage is to utilize filesystems of +pre-determined size, using the loop filesystem facility within Linux. +It is possible to create a file and run `mkfs(8)` on it, and then to +mount that filesystem on the desired directory. This both limits the +storage available within that directory and enables quick retrieval of +it via `statfs(2)`. + +Cleanup of such a filesystem involves unmounting it and removing the +backing file. + +The backing file can be created as a sparse file, and the `discard` +option can be used to return unused space to the system, allowing for +thin provisioning. + +I conducted preliminary investigations into this. While at first it +appeared promising, it turned out to have multiple critical flaws: + +* If the filesystem is mounted without the `discard` option, it can + grow to the full size of the backing file, negating any possibility + of thin provisioning. If the file is created dense in the first + place, there is never any possibility of thin provisioning without + use of `discard`. + + If the backing file is created densely, it additionally may require + significant time to create if the ephemeral limit is large. + +* If the filesystem is mounted `nosync`, and is sparse, it is possible + for writes to succeed and then fail later with I/O errors when + synced to the backing storage. This will lead to data corruption + that cannot be detected at the time of write. + + This can easily be reproduced by e. g. creating a 64MB filesystem + and within it creating a 128MB sparse file and building a filesystem + on it. When that filesystem is in turn mounted, writes to it will + succeed, but I/O errors will be seen in the log and the file will be + incomplete: + +``` +# mkdir /var/tmp/d1 /var/tmp/d2 +# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383 +# mkfs.ext4 /var/tmp/fs1 +# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1 +# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767 +# mkfs.ext4 /var/tmp/d1/fs2 +# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2 +# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576 + ...will normally succeed... +# sync + ...fails with I/O error!... +``` + +* If the filesystem is mounted `sync`, all writes to it are + immediately committed to the backing store, and the `dd` operation + above fails as soon as it fills up `/var/tmp/d1`. However, + performance is drastically slowed, particularly with small writes; + with 1K writes, I observed performance degradation in some cases + exceeding three orders of magnitude. + + I performed a test comparing writing 64 MB to a base (partitioned) + filesystem, to a loop filesystem without `sync`, and a loop + filesystem with `sync`. Total I/O was sufficient to run for at least + 5 seconds in each case. All filesystems involved were XFS. Loop + filesystems were 128 MB and dense. Times are in seconds. The + erratic behavior (e. g. the 65536 case) was involved was observed + repeatedly, although the exact amount of time and which I/O sizes + were affected varied. The underlying device was an HP EX920 1TB + NVMe SSD. + +| I/O Size | Partition | Loop w/sync | Loop w/o sync | +| ---: | ---: | ---: | ---: | +| 1024 | 0.104 | 0.120 | 140.390 | +| 4096 | 0.045 | 0.077 | 21.850 | +| 16384 | 0.045 | 0.067 | 5.550 | +| 65536 | 0.044 | 0.061 | 20.440 | +| 262144 | 0.043 | 0.087 | 0.545 | +| 1048576 | 0.043 | 0.055 | 7.490 | +| 4194304 | 0.043 | 0.053 | 0.587 | + + The only potentially viable combination in my view would be a dense + loop filesystem without sync, but that would render any thin + provisioning impossible. + +## Infrastructure Needed [optional] + +* Decision: who is responsible for quota management of all volume + types (and especially ephemeral volumes of all types). At present, + emptydir volumes are managed by the kubelet and logdirs and writable + layers by either the kubelet or the runtime, depending upon the + choice of runtime. Beyond the specific proposal that the runtime + should manage quotas for volumes it creates, there are broader + issues that I request assistance from the SIG in addressing. + +* Location of the quota code. If the quotas for different volume + types are to be managed by different components, each such component + needs access to the quota code. The code is substantial and should + not be copied; it would more appropriately be vendored. + +## References + +### Bugs Opened Against Filesystem Quotas + +The following is a list of known security issues referencing +filesystem quotas on Linux, and other bugs referencing filesystem +quotas in Linux since 2012. These bugs are not necessarily in the +quota system. + +#### CVE + +* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel + before 3.3.6, when huge pages are enabled, allows local users to + cause a denial of service (system crash) or possibly gain privileges + by interacting with a hugetlbfs filesystem, as demonstrated by a + umount operation that triggers improper handling of quota data. + + The issue is actually related to huge pages, not quotas + specifically. The demonstration of the vulnerability resulted in + incorrect handling of quota data. + +* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c) + in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl + function the first time without a host name, which might allow + remote attackers to bypass TCP Wrappers rules in hosts.deny (related + to rpc.rquotad; remote attackers might be able to bypass TCP + Wrappers rules). + + This issue is related to remote quota handling, which is not the use + case for the proposal at hand. + +#### Other Security Issues Without CVE + +* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and + Create Large Files](https://securitytracker.com/id/1002610) + + A setuid root binary inheriting file descriptors from an + unprivileged user process may write to the file without respecting + quota limits. If this issue is still present, it would allow a + setuid process to exceed any enforcing limits, but does not affect + the quota accounting (use of quotas for monitoring). + +### Other Linux Quota-Related Bugs Since 2012 + +* [ext4: report delalloc reserve as non-free in statfs mangled by + project quota](https://lore.kernel.org/patchwork/patch/884530/) + + This bug, fixed in Feb. 2018, properly accounts for reserved but not + committed space in project quotas. At this point I have not + determined the impact of this issue. + +* [XFS quota doesn't work after rebooting because of + crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730) + + This bug resulted in XFS quotas not working after a crash or forced + reboot. Under this proposal, Kubernetes would fall back to du for + monitoring should a bug of this nature manifest itself again. + +* [quota can show incorrect filesystem + name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527) + + This issue, which will not be fixed, results in the quota command + possibly printing an incorrect filesystem name when used on remote + filesystems. It is a display issue with the quota command, not a + quota bug at all, and does not result in incorrect quota information + being reported. As this proposal does not utilize the quota command + or rely on filesystem name, or currently use quotas on remote + filesystems, it should not be affected by this bug. + +In addition, the e2fsprogs have had numerous fixes over the years. -- cgit v1.2.3 From da22baba6ba8c47c4954ccd6cad9a8b6aea99e4e Mon Sep 17 00:00:00 2001 From: Robert Krawitz Date: Wed, 17 Oct 2018 15:13:15 -0400 Subject: Comments from @derekwaynecarr --- .../0030-20180906-quotas-for-ephemeral-storage.md | 25 ++++++++++------------ 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md index 2564455f..bf1ee5c9 100644 --- a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md +++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md @@ -1,29 +1,23 @@ --- kep-number: 0 -title: My First KEP +title: Quotas for Ephemeral Storage authors: - - "@janedoe" + - "@RobertKrawitz" owning-sig: sig-xxx participating-sigs: - - sig-aaa - - sig-bbb + - sig-node reviewers: - TBD - - "@alicedoe" approvers: - - TBD - - "@oscardoe" + - "@dchen1107" + - "@derekwaynecarr" editor: TBD creation-date: yyyy-mm-dd last-updated: yyyy-mm-dd status: provisional see-also: - - KEP-1 - - KEP-2 replaces: - - KEP-3 superseded-by: - - KEP-100 --- # Quotas for Ephemeral Storage @@ -228,6 +222,9 @@ These goals apply only to local ephemeral storage, as described in ### Non-Goals * Application to storage other than local ephemeral storage. +* Application to container copy on write layers. That will be managed + by the container runtime. For a future project, we should work with + the runtimes to use quotas for their monitoring. * Elimination of eviction as a means of enforcing ephemeral-storage limits. Pods that hit their ephemeral-storage limit will still be evicted by the kubelet even if their storage has been capped by @@ -264,13 +261,13 @@ At present, two feature gates control operation of quotas: * `LocalStorageCapacityIsolation` must be enabled for any use of quotas. -* `FSQuotaForLSCIMonitoring` must be enabled in addition. If this is +* `LocalStorageCapacityIsolationFSMonitoring` must be enabled in addition. If this is enabled, quotas are used for monitoring, but not enforcement. At present, this defaults to False, but the intention is that this will default to True by initial release. -* _`FSQuotaForLSCIEnforcement` must be enabled, in addition to - `FSQuotaForLSCIMonitoring`, to use quotas for enforcement._ +* _`LocalStorageCapacityIsolationFSEnforcement` must be enabled, in addition to + `LocalStorageCapacityIsolationFSMonitoring`, to use quotas for enforcement._ ### Operation Flow -- Applying a Quota -- cgit v1.2.3