summaryrefslogtreecommitdiff
path: root/keps/sig-node
diff options
context:
space:
mode:
authork8s-ci-robot <k8s-ci-robot@users.noreply.github.com>2018-10-29 11:27:16 -0700
committerGitHub <noreply@github.com>2018-10-29 11:27:16 -0700
commit60d263753ca22bef2fa4172206fba29901247000 (patch)
treefd955380ead13b2b0a3f6f60c1f71c40df934127 /keps/sig-node
parenta860615db7086bf162c8d70f06645acdfe68dfe8 (diff)
parentda22baba6ba8c47c4954ccd6cad9a8b6aea99e4e (diff)
Merge pull request #2638 from RobertKrawitz/quotas-for-ephemeral-storage
KEP (Provisional): quotas for ephemeral storage
Diffstat (limited to 'keps/sig-node')
-rw-r--r--keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md807
1 files changed, 807 insertions, 0 deletions
diff --git a/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md
new file mode 100644
index 00000000..bf1ee5c9
--- /dev/null
+++ b/keps/sig-node/0030-20180906-quotas-for-ephemeral-storage.md
@@ -0,0 +1,807 @@
+---
+kep-number: 0
+title: Quotas for Ephemeral Storage
+authors:
+ - "@RobertKrawitz"
+owning-sig: sig-xxx
+participating-sigs:
+ - sig-node
+reviewers:
+ - TBD
+approvers:
+ - "@dchen1107"
+ - "@derekwaynecarr"
+editor: TBD
+creation-date: yyyy-mm-dd
+last-updated: yyyy-mm-dd
+status: provisional
+see-also:
+replaces:
+superseded-by:
+---
+
+# Quotas for Ephemeral Storage
+
+## Table of Contents
+<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-generate-toc again -->
+**Table of Contents**
+
+- [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage)
+ - [Table of Contents](#table-of-contents)
+ - [Summary](#summary)
+ - [Project Quotas](#project-quotas)
+ - [Motivation](#motivation)
+ - [Goals](#goals)
+ - [Non-Goals](#non-goals)
+ - [Future Work](#future-work)
+ - [Proposal](#proposal)
+ - [Control over Use of Quotas](#control-over-use-of-quotas)
+ - [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota)
+ - [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption)
+ - [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota)
+ - [Operation Notes](#operation-notes)
+ - [Selecting a Project ID](#selecting-a-project-id)
+ - [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory)
+ - [Return a Project ID To the System](#return-a-project-id-to-the-system)
+ - [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
+ - [Notes on Implementation](#notes-on-implementation)
+ - [Notes on Code Changes](#notes-on-code-changes)
+ - [Testing Strategy](#testing-strategy)
+ - [Risks and Mitigations](#risks-and-mitigations)
+ - [Graduation Criteria](#graduation-criteria)
+ - [Implementation History](#implementation-history)
+ - [Drawbacks [optional]](#drawbacks-optional)
+ - [Alternatives [optional]](#alternatives-optional)
+ - [Alternative quota-based implementation](#alternative-quota-based-implementation)
+ - [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation)
+ - [Infrastructure Needed [optional]](#infrastructure-needed-optional)
+ - [References](#references)
+ - [Bugs Opened Against Filesystem Quotas](#bugs-opened-against-filesystem-quotas)
+ - [CVE](#cve)
+ - [Other Security Issues Without CVE](#other-security-issues-without-cve)
+ - [Other Linux Quota-Related Bugs Since 2012](#other-linux-quota-related-bugs-since-2012)
+
+<!-- markdown-toc end -->
+
+[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
+
+## Summary
+
+This proposal applies to the use of quotas for ephemeral-storage
+metrics gathering. Use of quotas for ephemeral-storage limit
+enforcement is a [non-goal](#non-goals), but as the architecture and
+code will be very similar, there are comments interspersed related to
+enforcement. _These comments will be italicized_.
+
+Local storage capacity isolation, aka ephemeral-storage, was
+introduced into Kubernetes via
+<https://github.com/kubernetes/features/issues/361>. It provides
+support for capacity isolation of shared storage between pods, such
+that a pod can be limited in its consumption of shared resources and
+can be evicted if its consumption of shared storage exceeds that
+limit. The limits and requests for shared ephemeral-storage are
+similar to those for memory and CPU consumption.
+
+The current mechanism relies on periodically walking each ephemeral
+volume (emptydir, logdir, or container writable layer) and summing the
+space consumption. This method is slow, can be fooled, and has high
+latency (i. e. a pod could consume a lot of storage prior to the
+kubelet being aware of its overage and terminating it).
+
+The mechanism proposed here utilizes filesystem project quotas to
+provide monitoring of resource consumption _and optionally enforcement
+of limits._ Project quotas, initially in XFS and more recently ported
+to ext4fs, offer a kernel-based means of monitoring _and restricting_
+filesystem consumption that can be applied to one or more directories.
+
+A prototype is in progress; see <https://github.com/kubernetes/kubernetes/pull/66928>.
+
+### Project Quotas
+
+Project quotas are a form of filesystem quota that apply to arbitrary
+groups of files, as opposed to file user or group ownership. They
+were first implemented in XFS, as described here:
+<http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html>.
+
+Project quotas for ext4fs were [proposed in late
+2014](https://lwn.net/Articles/623835/) and added to the Linux kernel
+in early 2016, with
+commit
+[391f2a16b74b95da2f05a607f53213fc8ed24b8e](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=391f2a16b74b95da2f05a607f53213fc8ed24b8e).
+They were designed to be compatible with XFS project quotas.
+
+Each inode contains a 32-bit project ID, to which optionally quotas
+(hard and soft limits for blocks and inodes) may be applied. The
+total blocks and inodes for all files with the given project ID are
+maintained by the kernel. Project quotas can be managed from
+userspace by means of the `xfs_quota(8)` command in foreign filesystem
+(`-f`) mode; the traditional Linux quota tools do not manipulate
+project quotas. Programmatically, they are managed by the `quotactl(2)`
+system call, using in part the standard quota commands and in part the
+XFS quota commands; the man page implies incorrectly that the XFS
+quota commands apply only to XFS filesystems.
+
+The project ID applied to a directory is inherited by files created
+under it. Files cannot be (hard) linked across directories with
+different project IDs. A file's project ID cannot be changed by a
+non-privileged user, but a privileged user may use the `xfs_io(8)`
+command to change the project ID of a file.
+
+Filesystems using project quotas may be mounted with quotas either
+enforced or not; the non-enforcing mode tracks usage without enforcing
+it. A non-enforcing project quota may be implemented on a filesystem
+mounted with enforcing quotas by setting a quota too large to be hit.
+The maximum size that can be set varies with the filesystem; on a
+64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for
+ext4fs.
+
+Conventionally, project quota mappings are stored in `/etc/projects` and
+`/etc/projid`; these files exist for user convenience and do not have
+any direct importance to the kernel. `/etc/projects` contains a mapping
+from project ID to directory/file; this can be a one to many mapping
+(the same project ID can apply to multiple directories or files, but
+any given directory/file can be assigned only one project ID).
+`/etc/projid` contains a mapping from named projects to project IDs.
+
+This proposal utilizes hard project quotas for both monitoring _and
+enforcement_. Soft quotas are of no utility; they allow for temporary
+overage that, after a programmable period of time, is converted to the
+hard quota limit.
+
+
+## Motivation
+
+The mechanism presently used to monitor storage consumption involves
+use of `du` and `find` to periodically gather information about
+storage and inode consumption of volumes. This mechanism suffers from
+a number of drawbacks:
+
+* It is slow. If a volume contains a large number of files, walking
+ the directory can take a significant amount of time. There has been
+ at least one known report of nodes becoming not ready due to volume
+ metrics: <https://github.com/kubernetes/kubernetes/issues/62917>
+* It is possible to conceal a file from the walker by creating it and
+ removing it while holding an open file descriptor on it. POSIX
+ behavior is to not remove the file until the last open file
+ descriptor pointing to it is removed. This has legitimate uses; it
+ ensures that a temporary file is deleted when the processes using it
+ exit, and it minimizes the attack surface by not having a file that
+ can be found by an attacker. The following pod does this; it will
+ never be caught by the present mechanism:
+
+```yaml
+apiVersion: v1
+kind: Pod
+max:
+metadata:
+ name: "diskhog"
+spec:
+ containers:
+ - name: "perl"
+ resources:
+ limits:
+ ephemeral-storage: "2048Ki"
+ image: "perl"
+ command:
+ - perl
+ - -e
+ - >
+ my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999
+ volumeMounts:
+ - name: a
+ mountPath: /data/a
+ volumes:
+ - name: a
+ emptyDir: {}
+```
+* It is reactive rather than proactive. It does not prevent a pod
+ from overshooting its limit; at best it catches it after the fact.
+ On a fast storage medium, such as NVMe, a pod may write 50 GB or
+ more of data before the housekeeping performed once per minute
+ catches up to it. If the primary volume is the root partition, this
+ will completely fill the partition, possibly causing serious
+ problems elsewhere on the system. This proposal does not address
+ this issue; _a future enforcing project would_.
+
+In many environments, these issues may not matter, but shared
+multi-tenant environments need these issues addressed.
+
+### Goals
+
+These goals apply only to local ephemeral storage, as described in
+<https://github.com/kubernetes/features/issues/361>.
+
+* Primary: improve performance of monitoring by using project quotas
+ in a non-enforcing way to collect information about storage
+ utilization of ephemeral volumes.
+* Primary: detect storage used by pods that is concealed by deleted
+ files being held open.
+* Primary: this will not interfere with the more common user and group
+ quotas.
+
+### Non-Goals
+
+* Application to storage other than local ephemeral storage.
+* Application to container copy on write layers. That will be managed
+ by the container runtime. For a future project, we should work with
+ the runtimes to use quotas for their monitoring.
+* Elimination of eviction as a means of enforcing ephemeral-storage
+ limits. Pods that hit their ephemeral-storage limit will still be
+ evicted by the kubelet even if their storage has been capped by
+ enforcing quotas.
+* Enforcing node allocatable (limit over the sum of all pod's disk
+ usage, including e. g. images).
+* Enforcing limits on total pod storage consumption by any means, such
+ that the pod would be hard restricted to the desired storage limit.
+
+### Future Work
+
+* _Enforce limits on per-volume storage consumption by using
+ enforced project quotas._
+
+## Proposal
+
+This proposal applies project quotas to emptydir volumes on qualifying
+filesystems (ext4fs and xfs with project quotas enabled). Project
+quotas are applied by selecting an unused project ID (a 32-bit
+unsigned integer), setting a limit on space and/or inode consumption,
+and attaching the ID to one or more files. By default (and as
+utilized herein), if a project ID is attached to a directory, it is
+inherited by any files created under that directory.
+
+_If we elect to use the quota as enforcing, we impose a quota
+consistent with the desired limit._ If we elect to use it as
+non-enforcing, we impose a large quota that in practice cannot be
+exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs).
+
+### Control over Use of Quotas
+
+At present, two feature gates control operation of quotas:
+
+* `LocalStorageCapacityIsolation` must be enabled for any use of
+ quotas.
+
+* `LocalStorageCapacityIsolationFSMonitoring` must be enabled in addition. If this is
+ enabled, quotas are used for monitoring, but not enforcement. At
+ present, this defaults to False, but the intention is that this will
+ default to True by initial release.
+
+* _`LocalStorageCapacityIsolationFSEnforcement` must be enabled, in addition to
+ `LocalStorageCapacityIsolationFSMonitoring`, to use quotas for enforcement._
+
+### Operation Flow -- Applying a Quota
+
+* Caller (emptydir volume manager or container runtime) creates an
+ emptydir volume, with an empty directory at a location of its
+ choice.
+* Caller requests that a quota be applied to a directory.
+* Determine whether a quota can be imposed on the directory, by asking
+ each quota provider (one per filesystem type) whether it can apply a
+ quota to the directory. If no provider claims the directory, an
+ error status is returned to the caller.
+* Select an unused project ID ([see below](#selecting-a-project-id)).
+* Set the desired limit on the project ID, in a filesystem-dependent
+ manner ([see below](#notes-on-implementation)).
+* Apply the project ID to the directory in question, in a
+ filesystem-dependent manner.
+
+An error at any point results in no quota being applied and no change
+to the state of the system. The caller in general should not assume a
+priori that the attempt will be successful. It could choose to reject
+a request if a quota cannot be applied, but at this time it will
+simply ignore the error and proceed as today.
+
+### Operation Flow -- Retrieving Storage Consumption
+
+* Caller (kubelet metrics code, cadvisor, container runtime) asks the
+ quota code to compute the amount of storage used under the
+ directory.
+* Determine whether a quota applies to the directory, in a
+ filesystem-dependent manner ([see below](#notes-on-implementation)).
+* If so, determine how much storage or how many inodes are utilized,
+ in a filesystem dependent manner.
+
+If the quota code is unable to retrieve the consumption, it returns an
+error status and it is up to the caller to utilize a fallback
+mechanism (such as the directory walk performed today).
+
+### Operation Flow -- Removing a Quota.
+
+* Caller requests that the quota be removed from a directory.
+* Determine whether a project quota applies to the directory.
+* Remove the limit from the project ID associated with the directory.
+* Remove the association between the directory and the project ID.
+* Return the project ID to the system to allow its use elsewhere ([see
+ below](#return-a-project-id-to-the-system)).
+* Caller may delete the directory and its contents (normally it will).
+
+### Operation Notes
+
+#### Selecting a Project ID
+
+Project IDs are a shared space within a filesystem. If the same
+project ID is assigned to multiple directories, the space consumption
+reported by the quota will be the sum of that of all of the
+directories. Hence, it is important to ensure that each directory is
+assigned a unique project ID (unless it is desired to pool the storage
+use of multiple directories).
+
+The canonical mechanism to record persistently that a project ID is
+reserved is to store it in the `/etc/projid` (`projid[5]`) and/or
+`/etc/projects` (`projects(5)`) files. However, it is possible to utilize
+project IDs without recording them in those files; they exist for
+administrative convenience but neither the kernel nor the filesystem
+is aware of them. Other ways can be used to determine whether a
+project ID is in active use on a given filesystem:
+
+* The quota values (in blocks and/or inodes) assigned to the project
+ ID are non-zero.
+* The storage consumption (in blocks and/or inodes) reported under the
+ project ID are non-zero.
+
+The algorithm to be used is as follows:
+
+* Lock this instance of the quota code against re-entrancy.
+* open and `flock()` the `/etc/project` and `/etc/projid` files, so that
+ other uses of this code are excluded.
+* Start from a high number (the prototype uses 1048577).
+* Iterate from there, performing the following tests:
+ * Is the ID reserved by this instance of the quota code?
+ * Is the ID present in `/etc/projects`?
+ * Is the ID present in `/etc/projid`?
+ * Are the quota values and/or consumption reported by the kernel
+ non-zero? This test is restricted to 128 iterations to ensure
+ that a bug here or elsewhere does not result in an infinite loop
+ looking for a quota ID.
+* If an ID has been found:
+ * Add it to an in-memory copy of `/etc/projects` and `/etc/projid` so
+ that any other uses of project quotas do not reuse it.
+ * Write temporary copies of `/etc/projects` and `/etc/projid` that are
+ `flock()`ed
+ * If successful, rename the temporary files appropriately (if
+ rename of one succeeds but the other fails, we have a problem
+ that we cannot recover from, and the files may be inconsistent).
+* Unlock `/etc/projid` and `/etc/projects`.
+* Unlock this instance of the quota code.
+
+A minor variation of this is used if we want to reuse an existing
+quota ID.
+
+#### Determine Whether a Project ID Applies To a Directory
+
+It is possible to determine whether a directory has a project ID
+applied to it by requesting (via the `quotactl(2)` system call) the
+project ID associated with the directory. Whie the specifics are
+filesystem-dependent, the basic method is the same for at least XFS
+and ext4fs.
+
+It is not possible to determine in constant operations the directory
+or directories to which a project ID is applied. It is possible to
+determine whether a given project ID has been applied to an existing
+directory or files (although those will not be known); the reported
+consumption will be non-zero.
+
+The code records internally the project ID applied to a directory, but
+it cannot always rely on this. In particular, if the kubelet has
+exited and has been restarted (and hence the quota applying to the
+directory should be removed), the map from directory to project ID is
+lost. If it cannot find a map entry, it falls back on the approach
+discussed above.
+
+#### Return a Project ID To the System
+
+The algorithm used to return a project ID to the system is very
+similar to the algorithm used to select a project ID, except of course
+for selecting a project ID. It performs the same sequence of locking
+`/etc/project` and `/etc/projid`, editing a copy of the file, and
+restoring it.
+
+If the project ID is applied to multiple directories and the code can
+determine that, it will not remove the project ID from `/etc/projid`
+until the last reference is removed. While it is not anticipated in
+this KEP that this mode of operation will be used, at least initially,
+this can be detected even on kubelet restart by looking at the
+reference count in `/etc/projects`.
+
+
+### Implementation Details/Notes/Constraints [optional]
+
+#### Notes on Implementation
+
+The primary new interface defined is the quota interface in
+`pkg/volume/util/quota/quota.go`. This defines five operations:
+
+* Does the specified directory support quotas?
+
+* Assign a quota to a directory. If a non-empty pod UID is provided,
+ the quota assigned is that of any other directories under this pod
+ UID; if an empty pod UID is provided, a unique quota is assigned.
+
+* Retrieve the consumption of the specified directory. If the quota
+ code cannot handle it efficiently, it returns an error and the
+ caller falls back on existing mechanism.
+
+* Retrieve the inode consumption of the specified directory; same
+ description as above.
+
+* Remove quota from a directory. If a non-empty pod UID is passed, it
+ is checked against that recorded in-memory (if any). The quota is
+ removed from the specified directory. This can be used even if
+ AssignQuota has not been used; it inspects the directory and removes
+ the quota from it. This permits stale quotas from an interrupted
+ kubelet to be cleaned up.
+
+Two implementations are provided: `quota_linux.go` (for Linux) and
+`quota_unsupported.go` (for other operating systems). The latter
+returns an error for all requests.
+
+As the quota mechanism is intended to support multiple filesystems,
+and different filesystems require different low level code for
+manipulating quotas, a provider is supplied that finds an appropriate
+quota applier implementation for the filesystem in question. The low
+level quota applier provides similar operations to the top level quota
+code, with two exceptions:
+
+* No operation exists to determine whether a quota can be applied
+ (that is handled by the provider).
+
+* An additional operation is provided to determine whether a given
+ quota ID is in use within the filesystem (outside of `/etc/projects`
+ and `/etc/projid`).
+
+The two quota providers in the initial implementation are in
+`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While
+some quota operations do require different system calls, a lot of the
+code is common, and factored into
+`pkg/volume/util/quota/common/quota_linux_common_impl.go`.
+
+#### Notes on Code Changes
+
+The prototype for this project is mostly self-contained within
+`pkg/volume/util/quota` and a few changes to
+`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were
+required elsewhere:
+
+* The operation executor needs to pass the desired size limit to the
+ volume plugin where appropriate so that the volume plugin can impose
+ a quota. The limit is passed as 0 (do not use quotas), _positive
+ number (impose an enforcing quota if possible, measured in bytes),_
+ or -1 (impose a non-enforcing quota, if possible) on the volume.
+
+ This requires changes to
+ `pkg/volume/util/operationexecutor/operation_executor.go` (to add
+ `DesiredSizeLimit` to `VolumeToMount`),
+ `pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and
+ `pkg/kubelet/eviction/helpers.go` (the latter in order to determine
+ whether the volume is a local ephemeral one).
+
+* The volume manager (in `pkg/volume/volume.go`) changes the
+ `Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new
+ `MounterArgs` type rather than an `FsGroup` (`*int64`). This is to
+ allow passing the desired size and pod UID (in the event we choose
+ to implement quotas shared between multiple volumes; [see
+ below](#alternative-quota-based-implementation)). This required
+ small changes to all volume plugins and their tests, but will in the
+ future allow adding additional data without having to change code
+ other than that which uses the new information.
+
+#### Testing Strategy
+
+The quota code is by an large not very amendable to unit tests. While
+there are simple unit tests for parsing the mounts file, and there
+could be tests for parsing the projects and projid files, the real
+work (and risk) involves interactions with the kernel and with
+multiple instances of this code (e. g. in the kubelet and the runtime
+manager, particularly under stress). It also requires setup in the
+form of a prepared filesystem. It would be better served by
+appropriate end to end tests.
+
+### Risks and Mitigations
+
+* The SIG raised the possibility of a container being unable to exit
+ should we enforce quotas, and the quota interferes with writing the
+ log. This can be mitigated by either not applying a quota to the
+ log directory and using the du mechanism, or by applying a separate
+ non-enforcing quota to the log directory.
+
+ As log directories are write-only by the container, and consumption
+ can be limited by other means (as the log is filtered by the
+ runtime), I do not consider the ability to write uncapped to the log
+ to be a serious exposure.
+
+ Note in addition that even without quotas it is possible for writes
+ to fail due to lack of filesystem space, which is effectively (and
+ in some cases operationally) indistinguishable from exceeding quota,
+ so even at present code must be able to handle those situations.
+
+* Filesystem quotas may impact performance to an unknown degree.
+ Information on that is hard to come by in general, and one of the
+ reasons for using quotas is indeed to improve performance. If this
+ is a problem in the field, merely turning off quotas (or selectively
+ disabling project quotas) on the filesystem in question will avoid
+ the problem. Against the possibility that that cannot be done
+ (because project quotas are needed for other purposes), we should
+ provide a way to disable use of quotas altogether via a feature
+ gate.
+
+ A report <https://blog.pythonanywhere.com/110/> notes that an
+ unclean shutdown on Linux kernel versions between 3.11 and 3.17 can
+ result in a prolonged downtime while quota information is restored.
+ Unfortunately, [the link referenced
+ here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no
+ longer available.
+
+* Bugs in the quota code could result in a variety of regression
+ behavior. For example, if a quota is incorrectly applied it could
+ result in ability to write no data at all to the volume. This could
+ be mitigated by use of non-enforcing quotas. XFS in particular
+ offers the `pqnoenforce` mount option that makes all quotas
+ non-enforcing.
+
+
+## Graduation Criteria
+
+How will we know that this has succeeded? Gathering user feedback is
+crucial for building high quality experiences and SIGs have the
+important responsibility of setting milestones for stability and
+completeness. Hopefully the content previously contained in [umbrella
+issues][] will be tracked in the `Graduation Criteria` section.
+
+[umbrella issues]: N/A
+
+## Implementation History
+
+Major milestones in the life cycle of a KEP should be tracked in
+`Implementation History`. Major milestones might include
+
+- the `Summary` and `Motivation` sections being merged signaling SIG
+ acceptance
+- the `Proposal` section being merged signaling agreement on a
+ proposed design
+- the date implementation started
+- the first Kubernetes release where an initial version of the KEP was
+ available
+- the version of Kubernetes where the KEP graduated to general
+ availability
+- when the KEP was retired or superseded
+
+## Drawbacks [optional]
+
+* Use of quotas, particularly the less commonly used project quotas,
+ requires additional action on the part of the administrator. In
+ particular:
+ * ext4fs filesystems must be created with additional options that
+ are not enabled by default:
+```
+mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_
+```
+ * An additional option (`prjquota`) must be applied in `/etc/fstab`
+ * If the root filesystem is to be quota-enabled, it must be set in
+ the grub options.
+* Use of project quotas for this purpose will preclude future use
+ within containers.
+
+## Alternatives [optional]
+
+I have considered two classes of alternatives:
+
+* Alternatives based on quotas, with different implementation
+
+* Alternatives based on loop filesystems without use of quotas
+
+### Alternative quota-based implementation
+
+Within the basic framework of using quotas to monitor and potentially
+enforce storage utilization, there are a number of possible options:
+
+* Utilize per-volume non-enforcing quotas to monitor storage (the
+ first stage of this proposal).
+
+ This mostly preserves the current behavior, but with more efficient
+ determination of storage utilization and the possibility of building
+ further on it. The one change from current behavior is the ability
+ to detect space used by deleted files.
+
+* Utilize per-volume enforcing quotas to monitor and enforce storage
+ (the second stage of this proposal).
+
+ This allows partial enforcement of storage limits. As local storage
+ capacity isolation works at the level of the pod, and we have no
+ control of user utilization of ephemeral volumes, we would have to
+ give each volume a quota of the full limit. For example, if a pod
+ had a limit of 1 MB but had four ephemeral volumes mounted, it would
+ be possible for storage utilization to reach (at least temporarily)
+ 4MB before being capped.
+
+* Utilize per-pod enforcing user or group quotas to enforce storage
+ consumption, and per-volume non-enforcing quotas for monitoring.
+
+ This would offer the best of both worlds: a fully capped storage
+ limit combined with efficient reporting. However, it would require
+ each pod to run under a distinct UID or GID. This may prevent pods
+ from using setuid or setgid or their variants, and would interfere
+ with any other use of group or user quotas within Kubernetes.
+
+* Utilize per-pod enforcing quotas to monitor and enforce storage.
+
+ This allows for full enforcement of storage limits, at the expense
+ of being able to efficiently monitor per-volume storage
+ consumption. As there have already been reports of monitoring
+ causing trouble, I do not advise this option.
+
+ A variant of this would report (1/N) storage for each covered
+ volume, so with a pod with a 4MiB quota and 1MiB total consumption,
+ spread across 4 ephemeral volumes, each volume would report a
+ consumption of 256 KiB. Another variant would change the API to
+ report statistics for all ephemeral volumes combined. I do not
+ advise this option.
+
+### Alternative loop filesystem-based implementation
+
+Another way of isolating storage is to utilize filesystems of
+pre-determined size, using the loop filesystem facility within Linux.
+It is possible to create a file and run `mkfs(8)` on it, and then to
+mount that filesystem on the desired directory. This both limits the
+storage available within that directory and enables quick retrieval of
+it via `statfs(2)`.
+
+Cleanup of such a filesystem involves unmounting it and removing the
+backing file.
+
+The backing file can be created as a sparse file, and the `discard`
+option can be used to return unused space to the system, allowing for
+thin provisioning.
+
+I conducted preliminary investigations into this. While at first it
+appeared promising, it turned out to have multiple critical flaws:
+
+* If the filesystem is mounted without the `discard` option, it can
+ grow to the full size of the backing file, negating any possibility
+ of thin provisioning. If the file is created dense in the first
+ place, there is never any possibility of thin provisioning without
+ use of `discard`.
+
+ If the backing file is created densely, it additionally may require
+ significant time to create if the ephemeral limit is large.
+
+* If the filesystem is mounted `nosync`, and is sparse, it is possible
+ for writes to succeed and then fail later with I/O errors when
+ synced to the backing storage. This will lead to data corruption
+ that cannot be detected at the time of write.
+
+ This can easily be reproduced by e. g. creating a 64MB filesystem
+ and within it creating a 128MB sparse file and building a filesystem
+ on it. When that filesystem is in turn mounted, writes to it will
+ succeed, but I/O errors will be seen in the log and the file will be
+ incomplete:
+
+```
+# mkdir /var/tmp/d1 /var/tmp/d2
+# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383
+# mkfs.ext4 /var/tmp/fs1
+# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1
+# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767
+# mkfs.ext4 /var/tmp/d1/fs2
+# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2
+# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576
+ ...will normally succeed...
+# sync
+ ...fails with I/O error!...
+```
+
+* If the filesystem is mounted `sync`, all writes to it are
+ immediately committed to the backing store, and the `dd` operation
+ above fails as soon as it fills up `/var/tmp/d1`. However,
+ performance is drastically slowed, particularly with small writes;
+ with 1K writes, I observed performance degradation in some cases
+ exceeding three orders of magnitude.
+
+ I performed a test comparing writing 64 MB to a base (partitioned)
+ filesystem, to a loop filesystem without `sync`, and a loop
+ filesystem with `sync`. Total I/O was sufficient to run for at least
+ 5 seconds in each case. All filesystems involved were XFS. Loop
+ filesystems were 128 MB and dense. Times are in seconds. The
+ erratic behavior (e. g. the 65536 case) was involved was observed
+ repeatedly, although the exact amount of time and which I/O sizes
+ were affected varied. The underlying device was an HP EX920 1TB
+ NVMe SSD.
+
+| I/O Size | Partition | Loop w/sync | Loop w/o sync |
+| ---: | ---: | ---: | ---: |
+| 1024 | 0.104 | 0.120 | 140.390 |
+| 4096 | 0.045 | 0.077 | 21.850 |
+| 16384 | 0.045 | 0.067 | 5.550 |
+| 65536 | 0.044 | 0.061 | 20.440 |
+| 262144 | 0.043 | 0.087 | 0.545 |
+| 1048576 | 0.043 | 0.055 | 7.490 |
+| 4194304 | 0.043 | 0.053 | 0.587 |
+
+ The only potentially viable combination in my view would be a dense
+ loop filesystem without sync, but that would render any thin
+ provisioning impossible.
+
+## Infrastructure Needed [optional]
+
+* Decision: who is responsible for quota management of all volume
+ types (and especially ephemeral volumes of all types). At present,
+ emptydir volumes are managed by the kubelet and logdirs and writable
+ layers by either the kubelet or the runtime, depending upon the
+ choice of runtime. Beyond the specific proposal that the runtime
+ should manage quotas for volumes it creates, there are broader
+ issues that I request assistance from the SIG in addressing.
+
+* Location of the quota code. If the quotas for different volume
+ types are to be managed by different components, each such component
+ needs access to the quota code. The code is substantial and should
+ not be copied; it would more appropriately be vendored.
+
+## References
+
+### Bugs Opened Against Filesystem Quotas
+
+The following is a list of known security issues referencing
+filesystem quotas on Linux, and other bugs referencing filesystem
+quotas in Linux since 2012. These bugs are not necessarily in the
+quota system.
+
+#### CVE
+
+* *CVE-2012-2133* Use-after-free vulnerability in the Linux kernel
+ before 3.3.6, when huge pages are enabled, allows local users to
+ cause a denial of service (system crash) or possibly gain privileges
+ by interacting with a hugetlbfs filesystem, as demonstrated by a
+ umount operation that triggers improper handling of quota data.
+
+ The issue is actually related to huge pages, not quotas
+ specifically. The demonstration of the vulnerability resulted in
+ incorrect handling of quota data.
+
+* *CVE-2012-3417* The good_client function in rquotad (rquota_svc.c)
+ in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl
+ function the first time without a host name, which might allow
+ remote attackers to bypass TCP Wrappers rules in hosts.deny (related
+ to rpc.rquotad; remote attackers might be able to bypass TCP
+ Wrappers rules).
+
+ This issue is related to remote quota handling, which is not the use
+ case for the proposal at hand.
+
+#### Other Security Issues Without CVE
+
+* [Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and
+ Create Large Files](https://securitytracker.com/id/1002610)
+
+ A setuid root binary inheriting file descriptors from an
+ unprivileged user process may write to the file without respecting
+ quota limits. If this issue is still present, it would allow a
+ setuid process to exceed any enforcing limits, but does not affect
+ the quota accounting (use of quotas for monitoring).
+
+### Other Linux Quota-Related Bugs Since 2012
+
+* [ext4: report delalloc reserve as non-free in statfs mangled by
+ project quota](https://lore.kernel.org/patchwork/patch/884530/)
+
+ This bug, fixed in Feb. 2018, properly accounts for reserved but not
+ committed space in project quotas. At this point I have not
+ determined the impact of this issue.
+
+* [XFS quota doesn't work after rebooting because of
+ crash](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1461730)
+
+ This bug resulted in XFS quotas not working after a crash or forced
+ reboot. Under this proposal, Kubernetes would fall back to du for
+ monitoring should a bug of this nature manifest itself again.
+
+* [quota can show incorrect filesystem
+ name](https://bugzilla.redhat.com/show_bug.cgi?id=1326527)
+
+ This issue, which will not be fixed, results in the quota command
+ possibly printing an incorrect filesystem name when used on remote
+ filesystems. It is a display issue with the quota command, not a
+ quota bug at all, and does not result in incorrect quota information
+ being reported. As this proposal does not utilize the quota command
+ or rely on filesystem name, or currently use quotas on remote
+ filesystems, it should not be affected by this bug.
+
+In addition, the e2fsprogs have had numerous fixes over the years.