summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRenaud Gaubert <rgaubert@nvidia.com>2017-06-07 14:38:16 -0700
committerRenaud Gaubert <rgaubert@nvidia.com>2017-07-15 21:50:23 -0700
commit9d7a245a238a5e72346aa490a619d9cedc3e98bf (patch)
tree0bb14285988c1acbe025befbc0c94c3d788c5ea9
parentc4265903506939f9f281b138c372acdbd1c2babf (diff)
Added Device plugin proposal
-rw-r--r--contributors/design-proposals/device-plugin.md471
-rw-r--r--contributors/design-proposals/device-plugin.pngbin0 -> 61264 bytes
2 files changed, 471 insertions, 0 deletions
diff --git a/contributors/design-proposals/device-plugin.md b/contributors/design-proposals/device-plugin.md
new file mode 100644
index 00000000..339070d8
--- /dev/null
+++ b/contributors/design-proposals/device-plugin.md
@@ -0,0 +1,471 @@
+# Device Manager Proposal
+
+ 1. [Abstract](#abstract)
+ 2. [Motivation](#motivation)
+ 3. [Use Cases](#use-cases)
+ 4. [Objectives](#objectives)
+ 5. [Non Objectives](#non-objectives)
+ 6. [Stories](#stories)
+ * [Vendor story](#vendor-story)
+ * [User story](#user-story)
+ 8. [Device Plugin](#device-plugin)
+ * [Protocol Overview](#protocol-overview)
+ * [Protobuf specification](#protobuf-specification)
+ * [Installation](#installation)
+ * [API Changes](#api-changes)
+ * [Versioning](#versioning)
+
+_Authors:_
+
+* @RenaudWasTaken - Renaud Gaubert &lt;rgaubert@NVIDIA.com&gt;
+
+## Abstract
+
+This document describes a vendor independant solution to:
+ * Discovering and representing external devices
+ * Making these devices available to the container and cleaning them up
+ afterwards
+ * Health Check of these devices
+
+Because devices are vendor dependant and have their own sets of problems
+and mechanisms, the solution we describe is a plugin mechanism managed by
+Kubelet.
+
+At their core, device plugins are simple gRPC servers that may run in a
+container deployed through the pod mechanism.
+
+These servers implement the gRPC interface defined later in this design
+document and once the device plugin makes itself know to kubelet, kubelet
+will interact with the device through three simple functions:
+ 1. A `Discover` function for the kubelet to Discover the devices and
+ their properties.
+ 2. An `Allocate` and `Deallocate` function which are called respectively
+ before container creation and after container deletion with the
+ devices to allocate and deallocate.
+ 3. A `Monitor` function to notify Kubelet whenever a device becomes
+ unhealthy.
+
+The goal is for a user to be able to enable vendor devices (e.g: GPUs) through
+the simple following steps:
+ * `kubectl create -f http://vendor.com/device-plugin-daemonset.yaml`
+ * When launching `kubectl describe nodes`, the devices appear in the node spec
+ * In the long term users will be able to select them through Resource Class
+
+We expect the plugins to be deployed across the clusters through DaemonSets.
+The targeted devices are GPUs, NICs, FPGAs, InfiniBand, Storage devices, ....
+
+
+## Motivation
+
+Kubernetes currently supports discovery of CPU and Memory primarily to a
+minimal extent. Very few devices are handled natively by Kubelet.
+
+It is not a sustainable solution to expect every vendor to add their vendor
+specific code inside Kubernetes. This approach does not scale and is not
+portable.
+
+We want a solution for those vendors to be able to advertise their resources
+to kubelet and monitor them.
+We also want a way for the user to specify which resource their jobs will use
+and what constraints are associated to these resources.
+
+In order to solve this problem it is obvious that we need a plugin system in
+order to have vendors advertise and monitor their resources on behalf
+of Kubelet.
+
+Additionally, we introduce the concept of Device to be able to select
+resources with constraints in a pod spec.
+
+_GPU Integration Example:_
+ * [Enable "kick the tires" support for NVIDIA GPUs in COS](https://github.com/Kubernetes/Kubernetes/pull/45136)
+ * [Extend experimental support to multiple NVIDIA GPUs](https://github.com/Kubernetes/Kubernetes/pull/42116)
+
+_Kubernetes Meeting Notes On This:_
+ * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
+ * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
+ * [Extensible support for hardware devices in Kubernetes (join Kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
+
+## Use Cases
+
+ * I want to use a particular device type (GPU, InfiniBand, FPGA, etc.)
+ in my pod.
+ * I should be able to use that device without writing custom Kubernetes code.
+ * I want a consistent and portable solution to consume hardware devices
+ across k8s clusters.
+
+## Objectives
+
+1. Add support for vendor specific Devices in kubelet:
+ * Through a pluggable mechanism.
+ * Which allows discovery and monitoring of devices.
+ * Which allows hooking the runtime to make devices available in containers
+ and cleaning them up.
+2. Define a deployment mechanism for this new API.
+3. Define a versioning mechanism for this new API.
+
+## Non Objectives
+1. Advanced scheduling and resource selection (solved through [#782](https://github.com/Kubernetes/community/pull/782)).
+ We will only try to give basic selection primitives to the devices
+2. Metrics: this should be the job of cadvisor and should probably either be
+ addressed there (cadvisor) or if people feel there is a case to be made
+ for it being addressed in the Device Plugin, in a follow up proposal.
+
+## Stories
+
+### Vendor story
+
+Kubernetes provides to vendors a mechanism called device plugins to:
+ * advertise devices.
+ * monitor devices (currently perform health checks).
+ * hook into the runtime to instruct Kubelet what are the steps to
+ take in order to make the device available (or cleanup the device).
+
+A device plugin at it's core is a simple gRPC server usually running in
+a container and deployed across clusters through a daemonSet.
+
+```gRPC
+service DevicePlugin {
+ rpc Discover(Empty) returns (stream Device) {}
+ rpc Monitor(Empty) returns (stream DeviceHealth) {}
+
+ rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
+ rpc Deallocate(DeallocateRequest) returns (Empty) {}
+}
+
+```
+
+The gRPC server that the device plugin must implement is expected to
+be advertised on a unix socket in a mounted hostPath (e.g:
+`/var/run/Kubernetes/vendor.sock`).
+
+Finally, to notify Kubelet of the existence of the device plugin,
+the vendor's device plugin will have to make a request to Kubelet's
+onwn gRPC server.
+Only then will kubelet start interacting with the vendor's device plugin
+through the gRPC apis.
+
+### End User story
+
+When setting up the cluster the admin knows what kind of devices are present
+on the different machines and therefore can select what devices they want to
+enable.
+
+The cluster admins knows his cluster has NVIDIA GPUs therefore he deploys
+the NVIDIA device plugin through:
+`kubectl create -f NVIDIA.io/device-plugin.yml`
+
+The device plugin lands on all the nodes of the cluster and if it detects that
+there are no GPUs it terminates. However, when there are GPUs it reports them
+to Kubelet.
+For device plugins reporting non-GPU Devices these are advertised as
+OIRs and selected through the same method.
+
+1. A user submits a pod spec requesting X GPUs (or devices)
+2. The scheduler filters the nodes which do not match the resource requests
+3. The pod lands on the node and Kubelet decides which device
+ should be assigned to the pod
+4. Kubelet calls `Allocate` on the matching Device Plugins
+5. The user deletes the pod or the pod terminates
+6. Kubelet calls `Deallocate` on the matching Device Plugins
+
+When receiving a pod which requests Devices kubelet is in charge of:
+ * deciding which device to assign to the pod's containers (this will
+ change in the future)
+ * advertising the changes to the node's `Available` list
+ * advertising the changes to the pods's `Allocated` list
+ * Calling the `Allocate` function with the list of devices
+
+The scheduler is still be in charge of filtering the nodes which cannot
+satisfy the resource requests.
+He might in the future be in charge of selecting the device.
+
+## Device Plugin
+
+### Introduction
+The device plugin is structured in 5 parts:
+1. Registration: The device plugin advertises it's presence to Kubelet
+2. Discovery: Kubelet calls the device plugin to list it's devices
+3. Allocate / Deallocate: When creating/deleting containers requesting the
+ devices advertised by the device plugin, Kubelet calls the device plugin's
+ `Allocate` and `Deallocate` functions.
+4. Cleanup: Kubelet terminates the communication through a "Stop"
+4. Heartbeat: The device plugin polls Kubelet to know if it's still alive
+ and if it has to re-issue a Register request
+
+### Registration
+
+When starting the device plugin is expected to make a (client) gRPC call
+to the `Register` function that Kubelet exposes.
+
+The communication between Kubelet is expected to happen only through Unix
+sockets and follow this simple pattern:
+1. The device plugins starts it's gRPC server
+2. The device plugins sends a `RegisterRequest` to Kubelet (through a
+ gRPC request)
+4. Kubelet starts it's Discovery phase and calls `Discover` and `Monitor`
+5. Kubelet answers to the `RegisterRequest` with a `RegisterResponse`
+ containing any error Kubelet might have encountered
+
+### Unix Socket
+
+Device Plugins are expected to communicate with Kubelet through gRPC
+on an Unix socket.
+When starting the gRPC server, they are expected to create a unix socket
+at the following host path: `/var/run/Kubernetes`.
+
+For non bare metal device plugin this means they will have to mount the folder
+as a volume in their pod spec ([see Installation](##installation)).
+
+Device plugins can expect to find the socket to register themselves on
+the host at the following path:
+`/var/run/Kubernetes/kubelet.sock`.
+
+### Protocol Overview
+
+When first registering themselves against Kubelet, the device plugin
+will send:
+ * The name of their unix socket
+ * [The API version against which they were built](#versioning).
+ * Their `Vendor` ID or name of the device plugin
+
+Kubelet answers with the minimum version it supports and whether or
+not there was an error. The errors may include (but not limited to):
+ * API version not supported
+ * A device plugin was already registered for this vendor
+ * A device plugin already registered this device
+ * Vendor is not consistent across discovered devices
+
+Kubelet will then interact with the plugin through the following functions:
+ * `Discover`: List Devices
+ * `Monitor`: Returns a stream that is written to when a
+ Device becomes unhealty
+ * `Allocate`: Called when creating a container with a list of devices
+ can request changes to the Container config
+ * `Deallocate`: Called when deleting a container can be used for cleanup
+
+The device plugin is also expected to periodically call the `Heartbeat` function
+exposed by Kubelet and issue a `Registration` request when it either can't reach
+Kubelet or Kubelet answers with a `KO` response.
+
+![Process](./device-plugin.png)
+
+
+### Protobuf specification
+
+```go
+service PluginRegistration {
+ rpc Register(RegisterRequest) returns (RegisterResponse) {}
+ rpc Heartbeat(HeartbeatRequest) returns (HeartbeatResponse) {}
+}
+
+service DevicePlugin {
+ rpc Discover(Empty) returns (stream Device) {}
+ rpc Monitor(Empty) returns (stream DeviceHealth) {}
+
+ rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
+ rpc Deallocate(DeallocateRequest) returns (Empty) {}
+}
+
+message RegisterRequest {
+ // Version of the API the Device Plugin was built against
+ string version = 1;
+ // Name of the unix socket the device plugin is listening on
+ string unixsocket = 2;
+ // Name of the devices the device plugin wants to register
+ // A device plugin can only register one kind of devices
+ string vendor = 3;
+}
+
+message RegisterResponse {
+ // Minimum version the Kubelet API supports.
+ string version = 1;
+ // Kubelet fills this field if it encounters any errors
+ // during the registration process or discover process
+ Error error = 2;
+}
+
+message HeartbeatRequest {
+ string vendor = 1;
+}
+
+message HeartbeatResponse {
+ // Kubelet answers with a string telling the device
+ // plugin to either re-register itself or not
+ string response = 1;
+ // Kubelet fills this field if it encountered any errors
+ Error error = 2;
+}
+
+message AllocateRequest {
+ repeated Device devices = 1;
+}
+
+message AllocateResponse {
+ // List of environment variable to set in the container.
+ repeated KeyValue envs = 1;
+ // Mounts for the container.
+ repeated Mount mounts = 2;
+}
+
+message DeallocateRequest {
+ repeated Device devices = 1;
+}
+
+message Error {
+ bool error = 1;
+ string reason = 2;
+}
+
+// E.g:
+// struct Device {
+// Kind: "NVIDIA-gpu"
+// Name: "GPU-fef8089b-4820-abfc-e83e-94318197576e"
+// Properties: {
+// "Family": "Pascal",
+// "Memory": "4G",
+// "ECC" : "True",
+// }
+//}
+//
+message Device {
+ string Kind = 1;
+ string Name = 2;
+ string Health = 3;
+ string Vendor = 4;
+ map<string, string> properties = 5; // Could be [1, 1.2, 1G]
+}
+
+message DeviceHealth {
+ string Name = 1;
+ string Kind = 2;
+ string Vendor = 4;
+ string Health = 3;
+}
+```
+
+## Installation
+
+The installation process should be straightforward to the user, transparent
+and similar to other regular Kubernetes actions.
+The device plugin should also run in containers so that Kubernetes can
+deploy them and restart the plugins when they fail.
+However, we should not prevent the user from deploying a bare metal device
+plugin.
+
+Deploying the device plugins through DemonSets makes sense as the cluster
+admin would be able to specify which machines it wants the device plugins to
+run on, the process is similar to any Kubernetes action and does not require
+to change any parts of Kubernetes.
+
+Additionally, for integrated solutions such as `kubeadm` we can add support
+to auto-deploy community vetted Device Plugins.
+Thus not fragmenting once more the Kubernetes ecosystem.
+
+For users installing Kubernetes without using an integrated solution such
+as `kubeadm` they would use the examples that we would provide at:
+`https://github.com/Kubernetes/Kubernetes/tree/master/examples/device-plugin.yaml`
+
+YAML example:
+```yaml
+apiVersion: extensions/v1beta1
+kind: DaemonSet
+metadata:
+spec:
+ template:
+ metadata:
+ labels:
+ - name: device-plugin
+ spec:
+ containers:
+ name: device-plugin-ctr
+ image: NVIDIA/device-plugin:1.0
+ volumeMounts:
+ - mountPath: /device-plugin
+ - name: device-plugin
+ volumes:
+ - name: device-plugin
+ hostPath:
+ path: /var/run/Kubernetes
+```
+
+## API Changes
+### Device
+
+When discovering the devices, Kubelet will be in charge of advertising those
+resources to the API server.
+
+We will advertise each device returned by the Device Plugin in a new structure
+called `Device`.
+It is defined as follows:
+
+```golang
+type Device struct {
+ Kind string
+ Vendor string
+ Name string
+ Health DeviceHealthStatus
+ Properties map[string]string
+}
+```
+
+Because the current API (Capacity) can not be extended to support Device,
+we will need to create two new attributes in the NodeStatus structure:
+ * `DevCapacity`: Describing the device capacity of the node
+ * `DevAvailable`: Describing the available devices
+
+```golang
+type NodeStatus struct {
+ DevCapacity []Device
+ DevAvailable []Device
+}
+```
+
+We also introduce the `Allocated` field in the pod's status so that user
+can know what devices were assigned to the pod. It could also be useful in
+the case of monitoring
+
+```golang
+type ContainerStatus struct {
+ Devices []Device
+}
+```
+
+# Versioning
+
+Currently there is only one part (CRI) of Kubernetes which is based on
+a protobuf model.
+
+The model used by CRI as of now involves the client (kubelet) checking
+if the server (runtime) version is compatible and then continuing to
+communicate with the server.
+Currently for CRI, compatible means matching the exact version.
+This means that every time the CRI spec changes the CRI clients needs to
+be updated.
+
+CRI also uses gRPC-go, which requires the same package name between client
+and server.
+If they are not same, then no API calls can succeed because the generated grpc
+code registers a service using the `package_name.service_name` convention,
+e.g., The StopPodSandbox method is known as `/v1alpha1.RuntimeService/StopPodSandbox.`
+
+To work around this restriction, CRI adopted the strategy to freeze the
+package name at `pkg/kubelet/apis/cri/v1alpha1/runtime`.
+
+Considering the restrictions it seems reasonable to follow the same pattern for
+the device plugin proposal to prevent API breaking:
+ * Follow protobuf guidelines on versionning:
+ * Do not change ordering
+ * Do not remove fields or change types
+ * Add optional fields
+ * Freeze the package name to `apis/device-plugin/v1alpha1`
+ * Have kubelet and the Device Plugin negotiate versions if we do break the API
+
+Negotiation would take place in the registration:
+1. When registering itself with Kubelet, the Device plugin sends the version
+ against which it was built.
+2. Kubelet returns the minimum version it supports and if the version sent
+ is supported.
+3. If Kubelet supports the version sent by the Device Plugin, it
+ contacts the Device Plugin
+4. If the Device Plugin supports the version sent by Kubelet it can and should
+ answer the different calls made by Kubelet
diff --git a/contributors/design-proposals/device-plugin.png b/contributors/design-proposals/device-plugin.png
new file mode 100644
index 00000000..160e1d10
--- /dev/null
+++ b/contributors/design-proposals/device-plugin.png
Binary files differ