diff options
| author | Garrett Rodrigues <grod@google.com> | 2017-09-14 11:46:18 -0700 |
|---|---|---|
| committer | Garrett Rodrigues <grod@google.com> | 2017-09-14 11:46:18 -0700 |
| commit | 2a27c063aa02cefa8188d2717048cbeec95d5c9d (patch) | |
| tree | 0a972fbe8903af018279f0b91a8cb019bf6f6380 | |
| parent | 9b399e0355e5f58e43dad35ae903ead51cab8c4b (diff) | |
removing dupe hugepages and obsolete marking
| -rw-r--r-- | contributors/design-proposals/apps/OBSOLETE_templates.md | 2 | ||||
| -rw-r--r-- | contributors/design-proposals/scheduling/hugepages.md | 308 |
2 files changed, 0 insertions, 310 deletions
diff --git a/contributors/design-proposals/apps/OBSOLETE_templates.md b/contributors/design-proposals/apps/OBSOLETE_templates.md index 010b31a8..50712932 100644 --- a/contributors/design-proposals/apps/OBSOLETE_templates.md +++ b/contributors/design-proposals/apps/OBSOLETE_templates.md @@ -1,5 +1,3 @@ -# OBSOLETE - # Templates+Parameterization: Repeatedly instantiating user-customized application topologies. ## Motivation diff --git a/contributors/design-proposals/scheduling/hugepages.md b/contributors/design-proposals/scheduling/hugepages.md deleted file mode 100644 index 27e5c5af..00000000 --- a/contributors/design-proposals/scheduling/hugepages.md +++ /dev/null @@ -1,308 +0,0 @@ -# HugePages support in Kubernetes - -**Authors** -* Derek Carr (@derekwaynecarr) -* Seth Jennings (@sjenning) -* Piotr Prokop (@PiotrProkop) - -**Status**: In progress - -## Abstract - -A proposal to enable applications running in a Kubernetes cluster to use huge -pages. - -A pod may request a number of huge pages. The `scheduler` is able to place the -pod on a node that can satisfy that request. The `kubelet` advertises an -allocatable number of huge pages to support scheduling decisions. A pod may -consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not -overcommitted. - -## Motivation - -Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi -of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have -a built-in memory management unit that manages a list of these pages in -hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of -virtual-to-physical page mappings. If the virtual address passed in a hardware -instruction can be found in the TLB, the mapping can be determined quickly. If -not, a TLB miss occurs, and the system falls back to slower, software based -address translation. This results in performance issues. Since the size of the -TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the -page size. - -A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, -there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other -architectures, but the idea is the same. In order to use huge pages, -application must write code that is aware of them. Transparent huge pages (THP) -attempts to automate the management of huge pages without application knowledge, -but they have limitations. In particular, they are limited to 2Mi page sizes. -THP might lead to performance degradation on nodes with high memory utilization -or fragmentation due to defragmenting efforts of THP, which can lock memory -pages. For this reason, some applications may be designed to (or recommend) -usage of pre-allocated huge pages instead of THP. - -Managing memory is hard, and unfortunately, there is no one-size fits all -solution for all applications. - -## Scope - -This proposal only includes pre-allocated huge pages configured on the node by -the administrator at boot time or by manual dynamic allocation. It does not -discuss how the cluster could dynamically attempt to allocate huge pages in an -attempt to find a fit for a pod pending scheduling. It is anticipated that -operators may use a variety of strategies to allocate huge pages, but we do not -anticipate the kubelet itself doing the allocation. Allocation of huge pages -ideally happens soon after boot time. - -This proposal defers issues relating to NUMA. - -## Use Cases - -The class of applications that benefit from huge pages typically have -- A large memory working set -- A sensitivity to memory access latency - -Example applications include: -- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.) -- Java applications can back the heap with huge pages using the - `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options. -- packet processing systems (DPDK) - -Applications can generally use huge pages by calling -- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory -- `mmap()` a file backed by `hugetlbfs` -- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known - Issues). - -1. A pod can use huge pages with any of the prior described methods. -1. A pod can request huge pages. -1. A scheduler can bind pods to nodes that have available huge pages. -1. A quota may limit usage of huge pages. -1. A limit range may constrain min and max huge page requests. - -## Feature Gate - -The proposal introduces huge pages as an Alpha feature. - -It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent -components pending graduation to Beta. - -## Node Specfication - -Huge pages cannot be overcommitted on a node. - -A system may support multiple huge page sizes. It is assumed that most nodes -will be configured to primarily use the default huge page size as returned via -`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems -unless overriden by `default_hugepagesz=1g` in kernel boot parameters. - -For each supported huge page size, the node will advertise a resource of the -form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are -determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB` -directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>` -resource using binary notation form. It will convert `<hugepagesize>` into the -most compact binary notation using integer values. For example, if a node -supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node -capacity and allocatable values. Operators may set aside pre-allocated huge -pages that are not available for user pods similar to normal memory via the -`--system-reserved` flag. - -There are a variety of huge page sizes supported across different hardware -architectures. It is preferred to have a resource per size in order to better -support quota. For example, 1 huge page with size 2Mi is orders of magnitude -different than 1 huge page with size 1Gi. We assume gigantic pages are even -more precious resources than huge pages. - -Pre-allocated huge pages reduce the amount of allocatable memory on a node. The -node will treat pre-allocated huge pages similar to other system reservations -and reduce the amount of `memory` it reports using the following formula: - -``` -[Allocatable] = [Node Capacity] - - [Kube-Reserved] - - [System-Reserved] - - [Pre-Allocated-HugePages * HugePageSize] - - [Hard-Eviction-Threshold] -``` - -The following represents a machine with 10Gi of memory. 1Gi of memory has been -reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the -allocatable memory has been reduced to account for the amount of huge pages -reserved. - -``` -apiVersion: v1 -kind: Node -metadata: - name: node1 -... -status: - capacity: - memory: 10Gi - hugepages-2Mi: 1Gi - allocatable: - memory: 9Gi - hugepages-2Mi: 1Gi -... -``` - -## Pod Specification - -A pod must make a request to consume pre-allocated huge pages using the resource -`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in -bytes. The specified amount must align with the `<hugepagesize>`; otherwise, -the pod will fail validation. For example, it would be valid to request -`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`. - -The request and limit for `hugepages-<hugepagesize>` must match. Similar to -memory, an application that requests `hugepages-<hugepagesize>` resource is at -minimum in the `Burstable` QoS class. - -If a pod consumes huge pages via `shmget`, it must run with a supplemental group -that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of -this group is outside the scope of this specification. - -Initially, a pod may not consume multiple huge page sizes in a single pod spec. -Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will -fail validation. We believe it is rare for applications to attempt to use -multiple huge page sizes. This restriction may be lifted in the future with -community presented use cases. Introducing the feature with this restriction -limits the exposure of API changes needed when consuming huge pages via volumes. - -In order to consume huge pages backed by the `hugetlbfs` filesystem inside the -specified container in the pod, it is helpful to understand the set of mount -options used with `hugetlbfs`. For more details, see "Using Huge Pages" here: -https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt - -``` -mount -t hugetlbfs \ - -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ - min_size=<value>,nr_inodes=<value> none /mnt/huge -``` - -The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy -this use case. A new `medium=HugePages` option would be supported. To write -into this volume, the pod must make a request for huge pages. The `pagesize` -argument is inferred from the `hugepages-<hugepagesize>` from the resource -request. If in the future, multiple huge page sizes are supported in a single -pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page -size. The existing `sizeLimit` option for `emptyDir` would restrict usage to -the minimum value specified between `sizeLimit` and the sum of huge page limits -of all containers in a pod. This keeps the behavior consistent with memory -backed `emptyDir` volumes whose usage is ultimately constrained by the pod -cgroup sandbox memory settings. The `min_size` option is omitted as its not -necessary. The `nr_inodes` mount option is omitted at this time in the same -manner it is omitted with `medium=Memory` when using `tmpfs`. - -The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It -can consume those pages using `shmget()` or via `mmap()` with the specified -volume. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: example -spec: - containers: -... - volumeMounts: - - mountPath: /hugepages - name: hugepage - resources: - requests: - hugepages-2Mi: 1Gi - limits: - hugepages-2Mi: 1Gi - volumes: - - name: hugepage - emptyDir: - medium: HugePages -``` - -## CRI Updates - -The `LinuxContainerResources` message should be extended to support specifying -huge page limits per size. The specification for huge pages should align with -opencontainers/runtime-spec. - -see: -https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits - -The CRI changes are required before promoting this feature to Beta. - -## Cgroup Enforcement - -To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the -`hugetlb` cgroup must be mounted. - -The `kubepods` cgroup is bounded by the `Allocatable` value. - -The QoS level cgroups are left unbounded across all huge page pool sizes. - -The pod level cgroup sandbox is configured as follows, where `hugepagesize` is -the system supported huge page size(s). If no request is made for huge pages of -a particular size, the limit is set to 0 for all supported types on the node. - -``` -pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>]) -``` - -If the container runtime supports specification of huge page limits, the -container cgroup sandbox will be configured with the specified limit. - -The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level -cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed. - -## Limits and Quota - -The `ResourceQuota` resource will be extended to support accounting for -`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange` -resource will be extended to define min and max constraints for `hugepages` -similar to `cpu` and `memory`. - -## Scheduler changes - -The scheduler will need to ensure any huge page request defined in the pod spec -can be fulfilled by a candidate node. - -## cAdvisor changes - -cAdvisor will need to be modified to return the number of pre-allocated huge -pages per page size on the node. It will be used to determine capacity and -calculate allocatable values on the node. - -## Roadmap - -### Version 1.8 - -Initial alpha support for huge pages usage by pods. - -### Version 1.9 - -Resource Quota support. Limit Range support. Beta support for huge pages -(pending community feedback) - -## Known Issues - -### Huge pages as shared memory - -For the Java use case, the JVM maps the huge pages as a shared memory segment -and memlocks them to prevent the system from moving or swapping them out. - -There are several issues here: -- The user running the Java app must be a member of the gid set in the - `vm.huge_tlb_shm_group` sysctl -- sysctl `kernel.shmmax` must allow the size of the shared memory segment -- The user's memlock ulimits must allow the size of the shared memory segment -- `vm.huge_tlb_shm_group` is not namespaced. - -### NUMA - -NUMA is complicated. To support NUMA, the node must support cpu pinning, -devices, and memory locality. Extending that requirement to huge pages is not -much different. It is anticipated that the `kubelet` will provide future NUMA -locality guarantees as a feature of QoS. In particular, pods in the -`Guaranteed` QoS class are expected to have NUMA locality preferences. - |
