Device plugin proposal patch by Jiaying

author: Jiaying Zhang <jiayingz@google.com> 2017-07-19 17:53:03 -0700
committer: Renaud Gaubert <renaud.gaubert@gmail.com> 2017-07-20 16:25:37 -0700
commit: 4cbc77e491e4cdba320dc3e17c5dd1d9376c1328 (patch)
tree: 4adfbca32336af1689e5068271355fbeaa2fc8f6
parent: 9d7a245a238a5e72346aa490a619d9cedc3e98bf (diff)
2 files changed, 303 insertions, 128 deletions
diff --git a/contributors/design-proposals/device-plugin-2.png b/contributors/design-proposals/device-plugin-2.png
new file mode 100644
index 00000000..60892429
--- /dev/null
+++ b/contributors/design-proposals/device-plugin-2.png
diff --git a/contributors/design-proposals/device-plugin.md b/contributors/design-proposals/device-plugin.md
index 339070d8..2f73c932 100644
--- a/contributors/design-proposals/device-plugin.md
+++ b/contributors/design-proposals/device-plugin.md
@@ -1,99 +1,77 @@
-# Device Manager Proposal
-
-  1. [Abstract](#abstract)
-  2. [Motivation](#motivation)
-  3. [Use Cases](#use-cases)
-  4. [Objectives](#objectives)
-  5. [Non Objectives](#non-objectives)
-  6. [Stories](#stories)
-      * [Vendor story](#vendor-story)
-      * [User story](#user-story)
-  8. [Device Plugin](#device-plugin)
-      * [Protocol Overview](#protocol-overview)
-      * [Protobuf specification](#protobuf-specification)
-      * [Installation](#installation)
-      * [API Changes](#api-changes)
-      * [Versioning](#versioning)
+Device Manager Proposal
+===============
+
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Motivation](#motivation)
+- [Use Cases](#use-cases)
+- [Objectives](#objectives)
+- [Non Objectives](#non-objectives)
+- [Proposed Implementation 1](#proposed-implementation-1)
+  - [Vendor story](#vendor-story)
+  - [End User story](#end-user-story)
+  - [Device Plugin](#device-plugin)
+    - [Introduction](#introduction)
+    - [Registration](#registration)
+    - [Unix Socket](#unix-socket)
+    - [Protocol Overview](#protocol-overview)
+    - [Protobuf specification](#protobuf-specification)
+- [Proposed Implementation 2](#proposed-implementation-2)
+  - [Device Plugin Lifecycle](#device-plugin-lifecycle)
+  - [Protobuf API](#protobuf-api)
+  - [Failure recovery](#failure-recovery)
+  - [Roadmap](#roadmap)
+  - [Open Questions](#open-questions-1)
+- [Installation](#installation)
+- [Versioning](#versioning)
+  - [References](#references)
+
+<!-- END MUNGE: GENERATED_TOC -->
 
 _Authors:_
 
 * @RenaudWasTaken - Renaud Gaubert &lt;rgaubert@NVIDIA.com&gt;
 
-## Abstract
+# Motivation
+
+Kubernetes currently supports discovery of CPU and Memory primarily to a
+minimal extent. Very few devices are handled natively by Kubelet.
+
+It is not a sustainable solution to expect every vendor to add their vendor
+specific code inside Kubernetes to make their devices usable.
+Instead, we want a solution for vendors to be able to advertise their resources
+to Kubelet and monitor them without writing custom Kubernetes code.
+We also want to provide a consistent and portable solution for users to
+consume hardware devices across k8s clusters.
 
 This document describes a vendor independant solution to:
   * Discovering and representing external devices
-  * Making these devices available to the container and cleaning them up
-    afterwards
-  * Health Check of these devices
+  * Making these devices available to the containers using these devices and
+    cleaning them up afterwards
+  * Monitoring these devices
 
 Because devices are vendor dependant and have their own sets of problems
-and mechanisms, the solution we describe is a plugin mechanism managed by
-Kubelet.
-
-At their core, device plugins are simple gRPC servers that may run in a
-container deployed through the pod mechanism.
-
-These servers implement the gRPC interface defined later in this design
-document and once the device plugin makes itself know to kubelet, kubelet
-will interact with the device through three simple functions:
-  1. A `Discover` function for the kubelet to Discover the devices and
-     their properties.
-  2. An `Allocate` and `Deallocate` function which are called respectively
-     before container creation and after container deletion with the
-     devices to allocate and deallocate.
-  3. A `Monitor` function to notify Kubelet whenever a device becomes
-     unhealthy.
+and mechanisms, the solution we describe is a plugin mechanism that may run
+in a container deployed through the DaemonSets mechanism.
+The targeted devices include GPUs, High-performance NICs, FPGAs, InfiniBand,
+Storage devices, and other similar computing resources that require vendor
+specific initialization and setup.
 
 The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
-the simple following steps:
+the following simple steps:
   * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
   * When launching `kubectl describe nodes`, the devices appear in the node spec
   * In the long term users will be able to select them through Resource Class
 
-We expect the plugins to be deployed across the clusters through DaemonSets.
-The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
-
-
-## Motivation
-
-Kubernetes currently supports discovery of CPU and Memory primarily to a
-minimal extent. Very few devices are handled natively by Kubelet.
-
-It is not a sustainable solution to expect every vendor to add their vendor
-specific code inside Kubernetes. This approach does not scale and is not
-portable.
-
-We want a solution for those vendors to be able to advertise their resources
-to kubelet and monitor them.
-We also want a way for the user to specify which resource their jobs will use
-and what constraints are associated to these resources.
-
-In order to solve this problem it is obvious that we need a plugin system in
-order to have vendors advertise and monitor their resources on behalf
-of Kubelet.
-
-Additionally, we introduce the concept of Device to be able to select
-resources with constraints in a pod spec.
-
-_GPU Integration Example:_
-  * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
-  * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
-
-_Kubernetes Meeting Notes On This:_
-  * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
-  * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
-  * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
-
-## Use Cases
+# Use Cases
 
-  * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
-    in my pod.
-  * I should be able to use that device without writing custom Kubernetes code.
-  * I want a consistent and portable solution to consume hardware devices
-    across k8s clusters.
+ * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
+   in my pod.
+ * I should be able to use that device without writing custom Kubernetes code.
+ * I want a consistent and portable solution to consume hardware devices
+   across k8s clusters.
 
-## Objectives
+# Objectives
 
 1. Add support for vendor specific Devices in kubelet:
     * Through a pluggable mechanism.
@@ -103,16 +81,18 @@ _Kubernetes Meeting Notes On This:_
 2. Define a deployment mechanism for this new API.
 3. Define a versioning mechanism for this new API.
 
-## Non Objectives
-1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
+# Non Objectives
+
+1. Advanced scheduling and resource selection (solved through
+   [#782](https://github.com/Kubernetes/community/pull/782)).
    We will only try to give basic selection primitives to the devices
 2. Metrics: this should be the job of cadvisor and should probably either be
    addressed there (cadvisor) or if people feel there is a case to be made
    for it being addressed in the Device Plugin, in a follow up proposal.
 
-## Stories
+# Proposed Implementation 1
 
-### Vendor story
+## Vendor story
 
 Kubernetes provides to vendors a mechanism called device plugins to:
   * advertise devices.
@@ -144,7 +124,7 @@ onwn gRPC server.
 Only then will kubelet start interacting with the vendor's device plugin
 through the gRPC apis.
 
-### End User story
+## End User story
 
 When setting up the cluster the admin knows what kind of devices are present
 on the different machines and therefore can select what devices they want to
@@ -182,6 +162,7 @@ He might in the future be in charge of selecting the device.
 ## Device Plugin
 
 ### Introduction
+
 The device plugin is structured in 5 parts:
 1. Registration: The device plugin advertises it's presence to Kubelet
 2. Discovery: Kubelet calls the device plugin to list it's devices
@@ -189,7 +170,7 @@ The device plugin is structured in 5 parts:
    devices advertised by the device plugin, Kubelet calls the device plugin's
    `Allocate` and `Deallocate` functions.
 4. Cleanup: Kubelet terminates the communication through a "Stop"
-4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
+5. Heartbeat: The device plugin polls Kubelet to know if it's still alive
    and if it has to re-issue a Register request
 
 ### Registration
@@ -247,7 +228,7 @@ The device plugin is also expected to periodically call the `Heartbeat` function
 exposed by Kubelet and issue a `Registration` request when it either can't reach
 Kubelet or Kubelet answers with a `KO` response.
 
-![Process](./device-plugin.png)
+![Process](device-plugin.png)
 
 
 ### Protobuf specification
@@ -343,7 +324,233 @@ message DeviceHealth {
 }
 ```
 
-## Installation
+# Proposed Implementation 2
+
+The main strategy of this proposed implemenation is that we want to start with
+something simple that can show benefits on our immediate use cases, yet the
+API design should be extendable to support future requirements.
+Here are the main motivations for this alternative proposed implementation:
+
+* Discovery phase: can we eliminate this gRPC procedure? It seems more
+  natural for device plugin to send Kubelet the discovered device information
+  right after device initialization and the registration gRPC procedure.
+* The current implementation uses gRPC to communicate between Kubelet and
+  device plugin. Both Kubelet and device plugin need to start a gRPC server
+  for two-way communication. This seems a bit complicated. Can we have
+  device plugin send enough information to Kubelet so that we only need
+  Kubelet to start gRPC server and device plugin is kept as gRPC client?
+  The main concern with one-way gRPC communication is that we can NOT
+  support device specific operations, like reset device, during
+  allocation/deallocation. Depending on how long we expect device specific
+  operations to take, we can support this feature later by
+  having device plugin also provide a gRPC service or Device Plugin
+  can instruct Kubelet to perform device specific operation hooks.
+* Do we need checkpointing in the initial prototype implementation?
+  Even in alpha release, we may still want to be able to recover from
+  various failure scenario. Otherwise, it would affect user experience.
+  Currently, it seems the only information we need to record somewhere
+  is what device is allocated to what pod/container. There have been
+  discussions on different ways and places to record this information.
+  The approach taken by the current implementation pushes this information
+  to ApiServer by extending NodeStatus interface between Kubelet and ApiServer.
+  The major concern on this approach is that it introduced an API extension
+  apart from the current model (Currently Node information recorded at ApiServer
+  only contains resource capacity information. Resource allocation information
+  is kept at Node). The second approach is for Kubelet to checkpoint this
+  information. This seems to align with the current Kubernetes model that
+  Kubelet is the component to implement allocation/deallocation functionalities.
+  The information we want to checkpoint, i.e., what device is allocated to what
+  pod/container, also seems generic enough to be implemented at Kubelet.
+  It may also allow other use cases outside device plugin, e.g., cpu pin.
+  The third approach is to implement this in device plugin. This way,
+  device plugin can also record any state information useful to its own
+  failure recovery in checkpoints. One concern on this approach is that it
+  may add more burdens on vendors to implement their device plugin images.
+  Surprises might happen in production if things were not implemented correctly.
+  It also seems apart from the current model as today Kubelet is the place
+  where allocation/deallocation happens for other types of resources.
+* Heartbeat: do we need this to make sure connections can be re-established
+  between kubelet and device plugin? Can we reuse keepalive feature from gRPC?
+  Or if Kubelet checkpoints device allocation state information, device plugin
+  may only need to detect Kubelet failure when it needs to update device
+  information. Or can device plugin send periodic device state updates
+  (this may be needed anyway if we want to collect device usage stats)
+  and use that to detect Kubelet failure or device plugin failure?
+
+## Device Plugin Lifecycle
+
+![Process](device-plugin-2.png)
+
+1. User or cluster admin push vendor-specific device plugin DaemonSets.
+   The DaemonSets YAML config includes mountPaths to the host directories
+   where device driver, user-space libraries, and tools will be installed.
+2. After device plugin container is brought up, it detects the specific
+   types of HW devices. If such devices exist, it initializes these devices
+   and sets up the environments to access these devices (e.g., install
+   device drivers, user-space libraries, and tools).
+3. After initialization, device plugin queries HW device states through the
+   installed device monitoring tools or other device interfaces. Then device
+   plugin connects to the Kubelet device plugin gRPC server and sends it the
+   obtained list of HW device information. In the initial prototype, the
+   device resource exported by a device plugin can be implemented as an
+   [extended OIR](https://github.com/kubernetes/kubernetes/pull/48922)
+   with special prefix “extensions.kubernetes.io/”, plus device resource_name
+   that uniquely identifies a device plugin on a node.
+   Kubelet can use existing API to add this resource to API server so that the
+   device resource is available for scheduling.
+4. Device plugin runs in a loop to continuously query HW device states. If it
+   detects any changes, it sends the Kubelet device plugin gRPC server the new
+   list of HW device information. Kubelet can use this information to update its
+   device capacity states and if necessary, re-allocate new device to a user
+   container with unhealthy allocated devices.
+5. When Kubelet receives an allocation request for a HW device advertised
+   by a device plugin (i.e., resource with “extensions.kubernetes.io/” prefix
+   plus device resource_name), it updates its internal allocation state,
+   issues certain calls to CRI to bind mount the host directories where
+   user-space libraries and tools are installed to the device-specific
+   default directories in user Pod or set up certain environment variables,
+   and checkpoints the container-to-device allocation information to persistent
+   storage.
+6. When user container accessing the device finishes, Kubelet updates its
+   internal state to deallocate the device, and updates its checkpoint state
+   in persistent storage.
+7. When device plugin DaemonSets is removed, clean up device state (e.g., uninstall
+   device driver, remove user-space libraries and tools). This step can be
+   specified as a preStop
+   [container lifecycle step](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/).
+   Note one implication from this approach is that device plugin upgrade process
+   will be disruptive. It will need more thinkings if we want to
+   support non-disruptive upgrade process.
+
+## Protobuf API
+
+```go
+
+service PluginResource {
+	rpc Register(RegisterRequest) returns (RegisterResponse) {}
+	rpc ReportDeviceInfo(ReportRequest) returns (ReportResponse) {}
+}
+
+message RegisterRequest {
+	// Version of the API the Device Plugin was built against
+	string version = 1;
+	// E.g., "nvidia-gpu". Used to construct OIR:
+	// “extensions.kubernetes.io/resourcename”.
+	string resourcename = 2;
+}
+
+message RegisterResponse {
+	bool success = 1;
+	// Kubelet fills this field with details if it encounters any errors
+	// during the registration process, e.g., for version mismatch, what
+	// is the required version and minimum supported version by kubelet.
+	string error = 2;
+}
+
+message ReportRequest {
+	repeated DeviceInfo devices = 1;
+}
+
+message DeviceInfo {
+      // E.g., "GPU-fef8089b-4820-abfc-e83e-94318197576e".
+      // Needs to be unique per device plugin.
+	string Id = 1;
+      // E.g., UNKNOWN, HEALTHY, UNHEALTHY.
+	enum State = 2;
+      // E.g., {"/rootfs/nvidia":"/usr/local/nvidia"}
+      // Maps from host directory where device library or tools
+      // are installed to user pod directory where the library or 
+      // tools are expected to be accessed. Kubelet will use this 
+      // information to bind mount host directory to the user pod
+      // directory during allocation.
+       map<string, string> mountpaths = 3;
+      // E.g., {"LD_PRELOAD":"xxx.so"}. Kubelet will export these
+      // env variables in user pod during allocation.
+       map<string, string> envariables = 4;
+       // E.g., {"Family":"Pascal"} {"ECC":"True"}
+       // These fields can be used as node labels for selection
+       map<string, string> labels = 4;
+}
+
+message ReportResponse {
+	bool success = 1;
+	// Kubelet fills this field if it encounters any errors
+	// during the report process, e.g., device plugin hasn’t
+	// registered yet (could happen when Kubelet restarts).
+	string error = 2;
+}
+```
+
+## Failure recovery
+
+* Device failure: Device plugin should be able to detect device failure and
+  report that to Kubelet. Kubelet should then remove the failed device from
+  available list. If there is any user container using the failed device,
+  Kubelet may terminate the user container and reschedule it on a good
+  available device. When a failed device recovers, device plugin will send
+  Kubelet the updated device state and Kubelet can add the device to the
+  available device list.
+* Kubelet crash: When Kubelet restarts after a crash, it should be able to
+  recover allocation states from the checkpoints recorded on persistent storage.
+  The checkpoint records should include allocated device id to pod mapping
+  information as well as non-allocated device information, so Kubelet can
+  re-establish precise allocation state. Device plugin should be able to
+  detect Kubelet failure when it needs to update device informaiton,
+  and re-registers with the new Kubelet.
+* Device plugin crash: A device plugin is deployed through DaemonSets.
+  If a device plugin process fails, Kubelet will detect that and automatically
+  restart it. After restart, device plugin will reconnect to Kubelet and
+  report the current device states. Kubelet can compare the reported device
+  information with its internal device states, and makes adjustments if
+  necessary. One thing we need to pay special attention is that device plugin
+  may fail at any time, e.g., during initialization. When the new device plugin
+  process starts, it needs to be able to recover from incomplete states.
+
+## Roadmap
+
+* Phase 1: device plugin is supported in alpha mode in 1.8 kubernetes release.
+  Make sure it provides the following functionalities: initialize, discover,
+  allocate/deallocate, cleanup, basic health check, and can recover from device,
+  Kubelet or device plugin failures . Make sure the interface is kept simple and
+  extensible, and the document is clear. With the initial implemented API,
+  make sure we can use the interface to implement device plugin images for at
+  least two types of devices: Nvidia GPU and Solarflare NIC. Note the support
+  for Nvidia GPU will help gpu support to enter beta by providing a general
+  and extendable api. Test coverage: e2e tests with the developed Nvidia GPU
+  image and Solarflare image to make sure these devices can be correctly
+  initialized, allocated, deallocated, and cleaned up. Also should test we
+  can recover from device failure, Kubelet restarts, and device plugin failure.
+* Phase 2: device plugin is supported in beta mode in 1.9 kubernetes release.
+  At this phase, the primary design and API should be stabilized. We need to
+  implement authentication mechanism to ensure only trusted device plugin
+  images can be registered. We can support device specific
+  allocation/deallocation requests by having device plugin also provide a gRPC
+  service or Device Plugin can instruct Kubelet to perform device specific
+  operation hooks during allocation/deallocation procedures.
+  Hopefully at this time, we may make good progress on supporting more flexible
+  resource allocation policies in Kubernetes, and with that, we can switch
+  device plugin from using OIR to using ResourceClass to allow more efficient
+  HW specific resource allocations, e.g., topology aware resource allocations,
+  NUMA aware resource allocations etc.
+* Phase 3: device plugin is supported in GA mode in 1.10 kubernetes release.
+  Device plugin should have clear error handling and problem report that
+  allows easy debuggability and monitoring on its exported devices.
+  We should have clear documentation on how to develop a device plugin and
+  how interact with device plugin. The framework needs to be stable and
+  demonstrate good user experiences through the support on multiple types
+  of devices, such as GPU, Infiniband, high-performance NIC, and etc.
+
+## Open Questions
+
+* The proposal assumes we can omit device specific allocation/deallocation
+operations in the alpha release and support this feature in later releases.
+If people are concerned that such omission would impact the usability of
+alpha release, we will need to come up with a solution that would either
+require two-way gRPC communication between Kubelet and Device Plugin or
+Device Plugin can instruct Kubelet to perform device specific operation hooks
+during allocation/deallocation procedures.
+
+# Installation
 
 The installation process should be straightforward to the user, transparent
 and similar to other regular Kubernetes actions.
@@ -366,6 +573,7 @@ as `kubeadm` they would use the examples that we would provide at:
 `https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
 
 YAML example:
+
 ```yaml
 apiVersion: extensions/v1beta1
 kind: DaemonSet
@@ -388,48 +596,6 @@ spec:
                    path: /var/run/Kubernetes
 ```
 
-## API Changes
-### Device
-
-When discovering the devices, Kubelet will be in charge of advertising those
-resources to the API server.
-
-We will advertise each device returned by the Device Plugin in a new structure
-called `Device`.
-It is defined as follows:
-
-```golang
-type Device struct {
-	Kind       string
-	Vendor     string
-	Name       string
-	Health     DeviceHealthStatus
-	Properties map[string]string
-}
-```
-
-Because the current API (Capacity) can not be extended to support Device,
-we will need to create two new attributes in the NodeStatus structure:
-  * `DevCapacity`: Describing the device capacity of the node
-  * `DevAvailable`: Describing the available devices
-
-```golang
-type NodeStatus struct {
-	DevCapacity []Device
-	DevAvailable []Device
-}
-```
-
-We also introduce the `Allocated` field in the pod's status so that user
-can know what devices were assigned to the pod. It could also be useful in
-the case of monitoring
-
-```golang
-type ContainerStatus struct {
-	Devices []Device
-}
-```
-
 # Versioning
 
 Currently there is only one part (CRI) of Kubernetes which is based on
@@ -469,3 +635,12 @@ Negotiation would take place in the registration:
    contacts the Device Plugin
 4. If the Device Plugin supports the version sent by Kubelet it can and should
    answer the different calls made by Kubelet
+
+## References
+
+  * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
+  * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
+  * [Kubernetes Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
+  * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
+  * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
+
author	Jiaying Zhang <jiayingz@google.com>	2017-07-19 17:53:03 -0700
committer	Renaud Gaubert <renaud.gaubert@gmail.com>	2017-07-20 16:25:37 -0700
commit	4cbc77e491e4cdba320dc3e17c5dd1d9376c1328 (patch)
tree	4adfbca32336af1689e5068271355fbeaa2fc8f6
parent	9d7a245a238a5e72346aa490a619d9cedc3e98bf (diff)