From 5e2f69c5006d109ccd29876643efca6803a3a941 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Tue, 2 Oct 2018 09:36:25 -0700 Subject: Renamed numa-manager.md => topology-manager.md --- contributors/design-proposals/node/numa-manager.md | 294 --------------------- .../design-proposals/node/topology-manager.md | 294 +++++++++++++++++++++ 2 files changed, 294 insertions(+), 294 deletions(-) delete mode 100644 contributors/design-proposals/node/numa-manager.md create mode 100644 contributors/design-proposals/node/topology-manager.md diff --git a/contributors/design-proposals/node/numa-manager.md b/contributors/design-proposals/node/numa-manager.md deleted file mode 100644 index c8f24e24..00000000 --- a/contributors/design-proposals/node/numa-manager.md +++ /dev/null @@ -1,294 +0,0 @@ -# NUMA Manager - -_Authors:_ - -* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com> -* @balajismaniam - Balaji Subramaniam <balaji.subramaniam@intel.com> -* @lmdaly - Louise M. Daly <louise.m.daly@intel.com> - -**Contents:** - -* [Overview](#overview) -* [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [User Stories](#user-stories) -* [Proposal](#proposal) - * [User Stories](#user-stories) - * [Proposed Changes](#proposed-changes) - * [New Component: NUMA Manager](#new-component-numa-manager) - * [Computing Preferred Affinity](#computing-preferred-affinity) - * [New Interfaces](#new-interfaces) - * [Changes to Existing Components](#changes-to-existing-components) -* [Graduation Criteria](#graduation-criteria) - * [alpha (target v1.11)](#alpha-target-v1.11) - * [beta](#beta) - * [GA (stable)](#ga-stable) -* [Challenges](#challenges) -* [Limitations](#limitations) -* [Alternatives](#alternatives) -* [Reference](#reference) - -# Overview - -An increasing number of systems leverage a combination of CPUs and -hardware accelerators to support latency-critical execution and -high-throughput parallel computation. These include workloads in fields -such as telecommunications, scientific computing, machine learning, -financial services and data analytics. Such hybrid systems comprise a -high performance environment. - -In order to extract the best performance, optimizations related to CPU -isolation and memory and device locality are required. However, in -Kubernetes, these optimizations are handled by a disjoint set of -components. - -This proposal provides a mechanism to coordinate fine-grained hardware -resource assignments for different components in Kubernetes. - -# Motivation - -Multiple components in the Kubelet make decisions about system -topology-related assignments: - -- CPU manager - - The CPU manager makes decisions about the set of CPUs a container is -allowed to run on. The only implemented policy as of v1.8 is the static -one, which does not change assignments for the lifetime of a container. -- Device manager - - The device manager makes concrete device assignments to satisfy -container resource requirements. Generally devices are attached to one -peripheral interconnect. If the device manager and the CPU manager are -misaligned, all communication between the CPU and the device can incur -an additional hop over the processor interconnect fabric. -- Container Network Interface (CNI) - - NICs including SR-IOV Virtual Functions have affinity to one NUMA node, -with measurable performance ramifications. - -*Related Issues:* - -- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964] -- [Discover nodes with NUMA architecture][nfd-issue-84] -- [Support VF interrupt binding to specified CPU][sriov-issue-10] -- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity] - -Note that all of these concerns pertain only to multi-socket systems. Correct -behavior requires that the kernel receive accurate topology information from -the underlying hardware (typically via the SLIT table). See section 5.2.16 -and 5.2.17 of the -[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more -information. - -## Goals - -- Arbitrate preferred NUMA node affinity for containers based on input from - CPU manager and Device Manager. -- Provide an internal interface and pattern to integrate additional - topology-aware Kubelet components. - -## Non-Goals - -- _Inter-device connectivity:_ Decide device assignments based on direct - device interconnects. This issue can be separated from NUMA node - locality. Inter-device topology can be considered entirely within the - scope of the Device Manager, after which it can emit possible - NUMA affinities. The policy to reach that decision can start simple - and iterate to include support for arbitrary inter-device graphs. -- _HugePages:_ This proposal assumes that pre-allocated HugePages are - spread among the available NUMA nodes in the system. We further assume - the operating system provides best-effort local page allocation for - containers (as long as sufficient HugePages are free on the local NUMA - node. -- _CNI:_ Changing the Container Networking Interface is out of scope for - this proposal. However, this design should be extensible enough to - accommodate network interface locality if the CNI adds support in the - future. This limitation is potentially mitigated by the possibility to - use the device plugin API as a stopgap solution for specialized - networking requirements. - -## User Stories - -*Story 1: Fast virtualized network functions* - -A user asks for a "fast network" and automatically gets all the various -pieces coordinated (hugepages, cpusets, network device) co-located on a -NUMA node. - -*Story 2: Accelerated neural network training* - -A user asks for an accelerator device and some number of exclusive CPUs -in order to get the best training performance, due to NUMA-alignment of -the assigned CPUs and devices. - -# Proposal - -*Main idea: Two Phase NUMA coherence protocol* - -NUMA affinity is tracked at the container level, similar to devices and -CPU affinity. At pod admission time, a new component called the NUMA Manager -collects possible NUMA configurations from the Device Manager and the -CPU Manager. The NUMA manager acts as an oracle for NUMA node affinity by -those same components when they make concrete resource allocations. We -expect the consulted components to use the inferred QoS class of each -pod in order to prioritize the importance of fulfilling optimal NUMA -affinity. - -## Proposed Changes - -### New Component: NUMA Manager - -This proposal is focused on a new component in the Kubelet called the -NUMA Manager. The NUMA Manager implements the pod admit handler -interface and participates in Kubelet pod admission. When the `Admit()` -function is called, the NUMA manager collects NUMA hints from other -Kubelet components. - -If the NUMA hints are not compatible, the NUMA manager could choose to -reject the pod. The details of what to do in this situation needs more -discussion. For example, the NUMA manager could enforce strict NUMA -alignment for Guaranteed QoS pods. Alternatively, the NUMA manager could -simply provide best-effort NUMA alignment for all pods. The NUMA manager could -use `softAdmitHandler` to keep the pod in `Pending` state. - -The NUMA Manager component will be disabled behind a feature gate until -graduation from alpha to beta. - -#### Computing Preferred Affinity - -A NUMA hint is a list of possible NUMA node masks. After collecting hints -from all providers, the NUMA Manager must choose some mask that is -present in all lists. Here is a sketch: - -1. Apply a partial order on each list: number of bits set in the - mask, ascending. This biases the result to be more precise if - possible. -1. Iterate over the permutations of preference lists and compute - bitwise-and over the masks in each permutation. -1. Store the first non-empty result and break out early. -1. If no non-empty result exists, return an error. - -The behavior when a match does not exist should be configurable. The Kubelet -could support a config option to require strict NUMA assignment when set to -`true`. A `false` value would mean best-effort NUMA alignment. - -#### New Interfaces - -```go -package numamanager - -// NUMAManager helps to coordinate NUMA-related resource assignments -// within the Kubelet. -type Manager interface { - lifecycle.PodAdmitHandler - Store - AddHintProvider(HintProvider) - RemovePod(podName string) -} - -// NUMAMask is a bitmask-like type denoting a subset of available NUMA nodes. -type NUMAMask struct{} // TBD - -// NUMAStore manages state related to the NUMA manager. -type Store interface { - // GetAffinity returns the preferred NUMA affinity for the supplied - // pod and container. - GetAffinity(podName string, containerName string) NUMAMask -} - -// HintProvider is implemented by Kubelet components that make -// NUMA-related resource assignments. The NUMA manager consults each -// hint provider at pod admission time. -type HintProvider interface { - // Returns a mask if this hint provider has a preference; otherwise - // returns `_, false` to indicate "don't care". - GetNUMAHints(pod v1.Pod, containerName string) ([]NUMAMask, bool) -} -``` - -_NUMA Manager and related interfaces (sketch)._ - -![numa-manager-components](https://user-images.githubusercontent.com/379372/35370509-13dd9488-0143-11e8-998b-6b5115982842.png) - -_NUMA Manager components._ - -![numa-manager-instantiation](https://user-images.githubusercontent.com/379372/35370513-17f90f70-0143-11e8-88e3-f199e9717946.png) - -_NUMA Manager instantiation and inclusion in pod admit lifecycle._ - -### Changes to Existing Components - -1. Kubelet consults NUMA Manager for pod admission (discussed above.) -1. Add two implementations of NUMA Manager interface and a feature gate. - 1. As much NUMA Manager functionality as possible is stubbed when the - feature gate is disabled. - 1. Add a functional NUMA manager that queries hint providers in order - to compute a preferred NUMA node mask for each container. -1. Add `GetNUMAHints()` method to CPU Manager. - 1. CPU Manager static policy calls `GetAffinity()` method of NUMA - manager when deciding CPU affinity. -1. Add `GetNUMAHints()` method to Device Manager. - 1. Add NUMA Node ID to Device structure in the device plugin - interface. Plugins should be able to determine the NUMA node - easily when enumerating supported devices. For example, Linux - exposes the node ID in sysfs for PCI devices: - `/sys/devices/pci*/*/numa_node`. NOTE: this is `-1` on many - public cloud instances and single-node machines. - 1. Device Manager calls `GetAffinity()` method of NUMA manager when - deciding device allocation. - -![numa-manager-wiring](https://user-images.githubusercontent.com/379372/35370514-1e10fb84-0143-11e8-84d3-99c9ca3af111.png) - -_NUMA Manager hint provider registration._ - -![numa-manager-hints](https://user-images.githubusercontent.com/379372/35370517-234a5d34-0143-11e8-845a-80e5c66c7b72.png) - -_NUMA Manager fetches affinity from hint providers._ - -# Graduation Criteria - -## Phase 1: Alpha (target v1.13) - -* Feature gate is disabled by default. -* Alpha-level documentation. -* Unit test coverage. -* CPU Manager allocation policy takes NUMA hints into account. -* Device plugin interface includes NUMA node ID. -* Device Manager allocation policy takes NUMA hints into account. - -## Phase 2: Beta (later versions) - -* Feature gate is enabled by default. -* Alpha-level documentation. -* Node e2e tests. -* Support hugepages alignment. -* User feedback. - -## GA (stable) - -* *TBD* - -# Challenges - -* Testing the NUMA Manager in a continuous integration environment - depends on cloud infrastructure to expose multi-node NUMA topologies - to guest virtual machines. -* Implementing the `GetNUMAHints()` interface may prove challenging. - -# Limitations - -* *TBD* - -# Alternatives - -* [AutoNUMA][numa-challenges]: This kernel feature affects memory - allocation and thread scheduling, but does not address device locality. - -# References - -* *TBD* - -[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964 -[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84 -[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10 -[proposal-affinity]: https://github.com/kubernetes/community/pull/171 -[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078 diff --git a/contributors/design-proposals/node/topology-manager.md b/contributors/design-proposals/node/topology-manager.md new file mode 100644 index 00000000..c8f24e24 --- /dev/null +++ b/contributors/design-proposals/node/topology-manager.md @@ -0,0 +1,294 @@ +# NUMA Manager + +_Authors:_ + +* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com> +* @balajismaniam - Balaji Subramaniam <balaji.subramaniam@intel.com> +* @lmdaly - Louise M. Daly <louise.m.daly@intel.com> + +**Contents:** + +* [Overview](#overview) +* [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) + * [User Stories](#user-stories) +* [Proposal](#proposal) + * [User Stories](#user-stories) + * [Proposed Changes](#proposed-changes) + * [New Component: NUMA Manager](#new-component-numa-manager) + * [Computing Preferred Affinity](#computing-preferred-affinity) + * [New Interfaces](#new-interfaces) + * [Changes to Existing Components](#changes-to-existing-components) +* [Graduation Criteria](#graduation-criteria) + * [alpha (target v1.11)](#alpha-target-v1.11) + * [beta](#beta) + * [GA (stable)](#ga-stable) +* [Challenges](#challenges) +* [Limitations](#limitations) +* [Alternatives](#alternatives) +* [Reference](#reference) + +# Overview + +An increasing number of systems leverage a combination of CPUs and +hardware accelerators to support latency-critical execution and +high-throughput parallel computation. These include workloads in fields +such as telecommunications, scientific computing, machine learning, +financial services and data analytics. Such hybrid systems comprise a +high performance environment. + +In order to extract the best performance, optimizations related to CPU +isolation and memory and device locality are required. However, in +Kubernetes, these optimizations are handled by a disjoint set of +components. + +This proposal provides a mechanism to coordinate fine-grained hardware +resource assignments for different components in Kubernetes. + +# Motivation + +Multiple components in the Kubelet make decisions about system +topology-related assignments: + +- CPU manager + - The CPU manager makes decisions about the set of CPUs a container is +allowed to run on. The only implemented policy as of v1.8 is the static +one, which does not change assignments for the lifetime of a container. +- Device manager + - The device manager makes concrete device assignments to satisfy +container resource requirements. Generally devices are attached to one +peripheral interconnect. If the device manager and the CPU manager are +misaligned, all communication between the CPU and the device can incur +an additional hop over the processor interconnect fabric. +- Container Network Interface (CNI) + - NICs including SR-IOV Virtual Functions have affinity to one NUMA node, +with measurable performance ramifications. + +*Related Issues:* + +- [Hardware topology awareness at node level (including NUMA)][k8s-issue-49964] +- [Discover nodes with NUMA architecture][nfd-issue-84] +- [Support VF interrupt binding to specified CPU][sriov-issue-10] +- [Proposal: CPU Affinity and NUMA Topology Awareness][proposal-affinity] + +Note that all of these concerns pertain only to multi-socket systems. Correct +behavior requires that the kernel receive accurate topology information from +the underlying hardware (typically via the SLIT table). See section 5.2.16 +and 5.2.17 of the +[ACPI Specification](http://www.acpi.info/DOWNLOADS/ACPIspec50.pdf) for more +information. + +## Goals + +- Arbitrate preferred NUMA node affinity for containers based on input from + CPU manager and Device Manager. +- Provide an internal interface and pattern to integrate additional + topology-aware Kubelet components. + +## Non-Goals + +- _Inter-device connectivity:_ Decide device assignments based on direct + device interconnects. This issue can be separated from NUMA node + locality. Inter-device topology can be considered entirely within the + scope of the Device Manager, after which it can emit possible + NUMA affinities. The policy to reach that decision can start simple + and iterate to include support for arbitrary inter-device graphs. +- _HugePages:_ This proposal assumes that pre-allocated HugePages are + spread among the available NUMA nodes in the system. We further assume + the operating system provides best-effort local page allocation for + containers (as long as sufficient HugePages are free on the local NUMA + node. +- _CNI:_ Changing the Container Networking Interface is out of scope for + this proposal. However, this design should be extensible enough to + accommodate network interface locality if the CNI adds support in the + future. This limitation is potentially mitigated by the possibility to + use the device plugin API as a stopgap solution for specialized + networking requirements. + +## User Stories + +*Story 1: Fast virtualized network functions* + +A user asks for a "fast network" and automatically gets all the various +pieces coordinated (hugepages, cpusets, network device) co-located on a +NUMA node. + +*Story 2: Accelerated neural network training* + +A user asks for an accelerator device and some number of exclusive CPUs +in order to get the best training performance, due to NUMA-alignment of +the assigned CPUs and devices. + +# Proposal + +*Main idea: Two Phase NUMA coherence protocol* + +NUMA affinity is tracked at the container level, similar to devices and +CPU affinity. At pod admission time, a new component called the NUMA Manager +collects possible NUMA configurations from the Device Manager and the +CPU Manager. The NUMA manager acts as an oracle for NUMA node affinity by +those same components when they make concrete resource allocations. We +expect the consulted components to use the inferred QoS class of each +pod in order to prioritize the importance of fulfilling optimal NUMA +affinity. + +## Proposed Changes + +### New Component: NUMA Manager + +This proposal is focused on a new component in the Kubelet called the +NUMA Manager. The NUMA Manager implements the pod admit handler +interface and participates in Kubelet pod admission. When the `Admit()` +function is called, the NUMA manager collects NUMA hints from other +Kubelet components. + +If the NUMA hints are not compatible, the NUMA manager could choose to +reject the pod. The details of what to do in this situation needs more +discussion. For example, the NUMA manager could enforce strict NUMA +alignment for Guaranteed QoS pods. Alternatively, the NUMA manager could +simply provide best-effort NUMA alignment for all pods. The NUMA manager could +use `softAdmitHandler` to keep the pod in `Pending` state. + +The NUMA Manager component will be disabled behind a feature gate until +graduation from alpha to beta. + +#### Computing Preferred Affinity + +A NUMA hint is a list of possible NUMA node masks. After collecting hints +from all providers, the NUMA Manager must choose some mask that is +present in all lists. Here is a sketch: + +1. Apply a partial order on each list: number of bits set in the + mask, ascending. This biases the result to be more precise if + possible. +1. Iterate over the permutations of preference lists and compute + bitwise-and over the masks in each permutation. +1. Store the first non-empty result and break out early. +1. If no non-empty result exists, return an error. + +The behavior when a match does not exist should be configurable. The Kubelet +could support a config option to require strict NUMA assignment when set to +`true`. A `false` value would mean best-effort NUMA alignment. + +#### New Interfaces + +```go +package numamanager + +// NUMAManager helps to coordinate NUMA-related resource assignments +// within the Kubelet. +type Manager interface { + lifecycle.PodAdmitHandler + Store + AddHintProvider(HintProvider) + RemovePod(podName string) +} + +// NUMAMask is a bitmask-like type denoting a subset of available NUMA nodes. +type NUMAMask struct{} // TBD + +// NUMAStore manages state related to the NUMA manager. +type Store interface { + // GetAffinity returns the preferred NUMA affinity for the supplied + // pod and container. + GetAffinity(podName string, containerName string) NUMAMask +} + +// HintProvider is implemented by Kubelet components that make +// NUMA-related resource assignments. The NUMA manager consults each +// hint provider at pod admission time. +type HintProvider interface { + // Returns a mask if this hint provider has a preference; otherwise + // returns `_, false` to indicate "don't care". + GetNUMAHints(pod v1.Pod, containerName string) ([]NUMAMask, bool) +} +``` + +_NUMA Manager and related interfaces (sketch)._ + +![numa-manager-components](https://user-images.githubusercontent.com/379372/35370509-13dd9488-0143-11e8-998b-6b5115982842.png) + +_NUMA Manager components._ + +![numa-manager-instantiation](https://user-images.githubusercontent.com/379372/35370513-17f90f70-0143-11e8-88e3-f199e9717946.png) + +_NUMA Manager instantiation and inclusion in pod admit lifecycle._ + +### Changes to Existing Components + +1. Kubelet consults NUMA Manager for pod admission (discussed above.) +1. Add two implementations of NUMA Manager interface and a feature gate. + 1. As much NUMA Manager functionality as possible is stubbed when the + feature gate is disabled. + 1. Add a functional NUMA manager that queries hint providers in order + to compute a preferred NUMA node mask for each container. +1. Add `GetNUMAHints()` method to CPU Manager. + 1. CPU Manager static policy calls `GetAffinity()` method of NUMA + manager when deciding CPU affinity. +1. Add `GetNUMAHints()` method to Device Manager. + 1. Add NUMA Node ID to Device structure in the device plugin + interface. Plugins should be able to determine the NUMA node + easily when enumerating supported devices. For example, Linux + exposes the node ID in sysfs for PCI devices: + `/sys/devices/pci*/*/numa_node`. NOTE: this is `-1` on many + public cloud instances and single-node machines. + 1. Device Manager calls `GetAffinity()` method of NUMA manager when + deciding device allocation. + +![numa-manager-wiring](https://user-images.githubusercontent.com/379372/35370514-1e10fb84-0143-11e8-84d3-99c9ca3af111.png) + +_NUMA Manager hint provider registration._ + +![numa-manager-hints](https://user-images.githubusercontent.com/379372/35370517-234a5d34-0143-11e8-845a-80e5c66c7b72.png) + +_NUMA Manager fetches affinity from hint providers._ + +# Graduation Criteria + +## Phase 1: Alpha (target v1.13) + +* Feature gate is disabled by default. +* Alpha-level documentation. +* Unit test coverage. +* CPU Manager allocation policy takes NUMA hints into account. +* Device plugin interface includes NUMA node ID. +* Device Manager allocation policy takes NUMA hints into account. + +## Phase 2: Beta (later versions) + +* Feature gate is enabled by default. +* Alpha-level documentation. +* Node e2e tests. +* Support hugepages alignment. +* User feedback. + +## GA (stable) + +* *TBD* + +# Challenges + +* Testing the NUMA Manager in a continuous integration environment + depends on cloud infrastructure to expose multi-node NUMA topologies + to guest virtual machines. +* Implementing the `GetNUMAHints()` interface may prove challenging. + +# Limitations + +* *TBD* + +# Alternatives + +* [AutoNUMA][numa-challenges]: This kernel feature affects memory + allocation and thread scheduling, but does not address device locality. + +# References + +* *TBD* + +[k8s-issue-49964]: https://github.com/kubernetes/kubernetes/issues/49964 +[nfd-issue-84]: https://github.com/kubernetes-incubator/node-feature-discovery/issues/84 +[sriov-issue-10]: https://github.com/hustcat/sriov-cni/issues/10 +[proposal-affinity]: https://github.com/kubernetes/community/pull/171 +[numa-challenges]: https://queue.acm.org/detail.cfm?id=2852078 -- cgit v1.2.3