summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authork8s-ci-robot <k8s-ci-robot@users.noreply.github.com>2018-07-31 07:11:38 -0700
committerGitHub <noreply@github.com>2018-07-31 07:11:38 -0700
commit9dd9ae00581a19bc9352be5ba6596ce66426d061 (patch)
tree5dbbfb48d1be88e96eb6ebb448b8d7e1fdb76528
parent9fbb34548e9dc2b1ac30db43a4dfc8cc1e143b29 (diff)
parent83e09c413e9f14b2b4dc113424e2f0ff9eb5e65c (diff)
Merge pull request #2290 from tallclair/runtime-class
RuntimeClass KEP
-rw-r--r--keps/sig-node/0014-runtime-class.md323
1 files changed, 323 insertions, 0 deletions
diff --git a/keps/sig-node/0014-runtime-class.md b/keps/sig-node/0014-runtime-class.md
new file mode 100644
index 00000000..1370875f
--- /dev/null
+++ b/keps/sig-node/0014-runtime-class.md
@@ -0,0 +1,323 @@
+---
+kep-number: 14 FIXME(13)
+title: Runtime Class
+authors:
+ - "@tallclair"
+owning-sig: sig-node
+participating-sigs:
+ - sig-architecture
+reviewers:
+ - TBD
+approvers:
+ - TBD
+editor: TBD
+creation-date: 2018-06-19
+status: provisional
+---
+
+# Runtime Class
+
+## Table of Contents
+
+* [Summary](#summary)
+* [Motivation](#motivation)
+ * [Goals](#goals)
+ * [Non\-Goals](#non-goals)
+ * [User Stories](#user-stories)
+* [Proposal](#proposal)
+ * [API](#api)
+ * [Runtime Handler](#runtime-handler)
+ * [Versioning, Updates, and Rollouts](#versioning-updates-and-rollouts)
+ * [Implementation Details](#implementation-details)
+ * [Risks and Mitigations](#risks-and-mitigations)
+* [Graduation Criteria](#graduation-criteria)
+* [Implementation History](#implementation-history)
+* [Appendix](#appendix)
+ * [Examples of runtime variation](#examples-of-runtime-variation)
+
+## Summary
+
+`RuntimeClass` is a new cluster-scoped resource that surfaces container runtime properties to the
+control plane. RuntimeClasses are assigned to pods through a `runtimeClass` field on the
+`PodSpec`. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node.
+
+## Motivation
+
+There is growing interest in using different runtimes within a cluster. [Sandboxes][] are the
+primary motivator for this right now, with both Kata containers and gVisor looking to integrate with
+Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also
+require support in the future. RuntimeClass provides a way to select between different runtimes
+configured in the cluster and surface their properties (both to the cluster & the user).
+
+In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to
+the control plane level, including: accounting for runtime overhead, scheduling to nodes that
+support the runtime, and surfacing which optional features are supported by different
+runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a
+cluster-scoped resource tied to the runtime that can help solve these problems in a future update.
+
+[Sandboxes]: https://docs.google.com/document/d/1QQ5u1RBDLXWvC8K3pscTtTRThsOeBSts_imYEoRyw8A/edit
+
+### Goals
+
+- Provide a mechanism for surfacing container runtime properties to the control plane
+- Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired
+ runtime
+
+### Non-Goals
+
+- RuntimeClass is NOT RuntimeComponentConfig.
+- RuntimeClass is NOT a general policy mechanism.
+- RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general
+ RuntimeClass should not be a cross product of runtime properties and node properties.
+
+The following goals are out-of-scope for the initial implementation, but may be explored in a future
+iteration:
+
+- Surfacing support for optional features by runtimes, and surfacing errors caused by
+ incompatible features & runtimes earlier.
+- Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the
+ cluster admin or provider), and are asserted to be an accurate representation of the runtime.
+- Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster
+ (different runtime configurations on different nodes) through scheduling primitives like
+ `NodeAffinity` and `Taints+Tolerations`, but the user is responsible for setting these up and
+ automatic runtime-aware scheduling is out-of-scope.
+- Define standardized or conformant runtime classes - although I would like to declare some
+ predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP.
+- [Pod Overhead][] - Although RuntimeClass is likely to be the configuration mechanism of choice,
+ the details of how pod resource overhead will be implemented is out of scope for this KEP.
+- Provide a mechanism to dynamically register or provision additional runtimes.
+- Requiring specific RuntimeClasses according to policy. This should be addressed by other
+ cluster-level policy mechanisms, such as PodSecurityPolicy.
+- "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and
+ letting the system match an appropriate RuntimeClass, rather than explicitly assigning a
+ RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a
+ future iteration.
+
+[Pod Overhead]: https://docs.google.com/document/d/1EJKT4gyl58-kzt2bnwkv08MIUZ6lkDpXcxkHqCvvAp4/edit
+
+### User Stories
+
+- As a cluster operator, I want to provide multiple runtime options to support a wide variety of
+ workloads. Examples include native linux containers, "sandboxed" containers, and windows
+ containers.
+- As a cluster operator, I want to provide stable rolling upgrades of runtimes. For
+ example, rolling out an update with backwards incompatible changes or previously unsupported
+ features.
+- As an application developer, I want to select the runtime that best fits my workload.
+- As an application developer, I don't want to study the nitty-gritty details of different runtime
+ implementations, but rather choose from pre-configured classes.
+- As an application developer, I want my application to be portable across clusters that use similar
+ but different variants of a "class" of runtimes.
+
+## Proposal
+
+The initial design includes:
+
+- `RuntimeClass` API resource definition
+- `RuntimeClass` pod field for specifying the RuntimeClass the pod should be run with
+- Kubelet implementation for fetching & interpreting the RuntimeClass
+- CRI API & implementation for passing along the [RuntimeHandler](#runtime-handler).
+
+### API
+
+`RuntimeClass` is a new cluster-scoped resource in the `node.k8s.io` API group.
+
+> _The `node.k8s.io` API group would eventually hold the Node resource when `core` is retired.
+> Alternatives considered: `runtime.k8s.io`, `cluster.k8s.io`_
+
+_(This is a simplified declaration, syntactic details will be covered in the API PR review)_
+
+```go
+type RuntimeClass struct {
+ metav1.TypeMeta
+ // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class.
+ // Namespace should be left blank.
+ metav1.ObjectMeta
+
+ Spec RuntimeClassSpec
+}
+
+type RuntimeClassSpec struct {
+ // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
+ // creation. The possible values are specific to a given configuration & CRI implementation.
+ // The empty string is equivalent to the default behavior.
+ // +optional
+ RuntimeHandler string
+}
+```
+
+The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is
+scheduled, the RuntimeClass cannot be changed.
+
+```go
+type PodSpec struct {
+ ...
+ // RuntimeClassName refers to a RuntimeClass object with the same name,
+ // which should be used to run this pod.
+ // +optional
+ RuntimeClassName string
+ ...
+}
+```
+
+The `legacy` RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards
+compatible with current Kubernetes. This means that the legacy runtime does not specify any
+RuntimeHandler or perform any feature validation (all features are "supported").
+
+```go
+const (
+ // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy
+ // RuntimeClass does not specify a runtime handler or perform any
+ // feature validation.
+ RuntimeClassNameLegacy = "legacy"
+)
+```
+
+An unspecified RuntimeClassName `""` is equivalent to the `legacy` RuntimeClass, though the field is
+not defaulted to `legacy` (to leave room for configurable defaults in a future update).
+
+#### Runtime Handler
+
+The `RuntimeHandler` is passed to the CRI as part of the `RunPodSandboxRequest`:
+
+```proto
+message RunPodSandboxRequest {
+ // Configuration for creating a PodSandbox.
+ PodSandboxConfig config = 1;
+ // Named runtime configuration to use for this PodSandbox.
+ string RuntimeHandler = 2;
+}
+```
+
+The RuntimeHandler is provided as a mechanism for CRI implementations to select between different
+predetermined configurations. The initial use case is replacing the experimental pod annotations
+currently used for selecting a sandboxed runtime by various CRI implementations:
+
+| CRI Runtime | Pod Annotation |
+| ------------|-------------------------------------------------------------|
+| CRIO | io.kubernetes.cri-o.TrustedSandbox: "false" |
+| containerd | io.kubernetes.cri.untrusted-workload: "true" |
+| frakti | runtime.frakti.alpha.kubernetes.io/OSContainer: "true"<br>runtime.frakti.alpha.kubernetes.io/Unikernel: "true" |
+| windows | experimental.windows.kubernetes.io/isolation-type: "hyperv" |
+
+These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred
+approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be
+matched against the specified RuntimeHandler. For example, containerd might have a configuration
+corresponding to a "kata-runtime" handler:
+
+```
+[plugins.cri.containerd.kata-runtime]
+ runtime_type = "io.containerd.runtime.v1.linux"
+ runtime_engine = "/opt/kata/bin/kata-runtime"
+ runtime_root = ""
+```
+
+This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection
+(e.g. `sandboxed` or `untrusted` RuntimeClasses), but can also support multiple parallel sandbox
+types (e.g. `kata-containers` or `gvisor` RuntimeClasses).
+
+### Versioning, Updates, and Rollouts
+
+Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha
+implementation, we will kick the can down the road by making the `RuntimeClassSpec` **immutable**,
+thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods
+must be updated to reference the new RuntimeClass, and comes with the advantage of native support
+for rolling updates through the same mechanisms as any other application update. The
+`RuntimeClassName` pod field is also immutable post scheduling.
+
+This conservative approach is preferred since it's much easier to relax constraints in a backwards
+compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass
+to beta.
+
+### Implementation Details
+
+The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is
+added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once
+resolved, the RuntimeHandler field is passed to the CRI as part of the
+[`RunPodSandboxRequest`][]. At that point, the interpretation of the RuntimeHandler is left to the
+CRI implementation, but it should be cached if needed for subsequent calls.
+
+If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will
+be rejected in admission (controller to be detailed in a following update). If the RuntimeClass
+cannot be resolved by the Kubelet when `RunPodSandbox` should be called, then the Kubelet will fail
+the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If
+the `RuntimeHandler` is not recognized by the CRI implementation, then `RunPodSandbox` will return
+an error.
+
+[RunPodSandboxRequest]: https://github.com/kubernetes/kubernetes/blob/b05a61e299777c2030fbcf27a396aff21b35f01b/pkg/kubelet/apis/cri/runtime/v1alpha2/api.proto#L344
+
+### Risks and Mitigations
+
+**Scope creep.** RuntimeClass has a fairly broad charter, but it should not become a default
+dumping ground for every new feature exposed by the node. For each feature, careful consideration
+should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The
+[non-goals](#non-goals) should be kept in mind when considering RuntimeClass features.
+
+**Becoming a general policy mechanism.** RuntimeClass should not be used a replacement for
+PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying
+runtime implementation should be extremely limited (generally only around updates & rollouts). To
+enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to
+restrict a user to a specific RuntimeClass, you must use another policy mechanism such as
+PodSecurityPolicy.
+
+**Pushing complexity to the user.** RuntimeClass is a new resource in order to hide the complexity
+of runtime configuration from most users (aside from the cluster admin or provisioner). However, we
+are still side-stepping the issue of precisely defining specific types of runtimes like
+"Sandboxed". However, it is still up for debate whether precisely defining such runtime categories
+is even possible. RuntimeClass allows us to decouple this specification from the implementation, but
+it is still something I hope we can address in a future iteration through the concept of pre-defined
+or "conformant" RuntimeClasses.
+
+**Non-portability.** We are already in a world of non-portability for many features (see [examples
+of runtime variation](#examples-of-runtime-variation). Future improvements to RuntimeClass can help
+address this issue by formally declaring supported features, or matching the runtime that supports a
+given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name,
+which may not be defined in every cluster. This is something that can be addressed through
+pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible
+RuntimeClasses.
+
+## Graduation Criteria
+
+Alpha:
+
+- Everything described in the current proposal
+- [CRI validation test][cri-validation]
+
+[cri-validation]: https://github.com/kubernetes-incubator/cri-tools/blob/master/docs/validation.md
+
+Beta:
+
+- Major runtimes support RuntimeClass
+- RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass
+- The update & upgrade story is revisited, and a longer-term approach is implemented as necessary.
+- The cluster admin can choose which RuntimeClass is the default in a cluster.
+- Additional requirements TBD
+
+## Implementation History
+
+- 2018-06-11: SIG-Node decision to move forward with proposal
+- 2018-06-19: Initial KEP published.
+
+## Appendix
+
+### Examples of runtime variation
+
+- Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods,
+ but those are mutually exclusive, and support of either is not required by the runtime. The
+ default configuration is also not well defined.
+- Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is
+ defined by the runtime, and support is not guaranteed.
+- Windows containers - isolation features are very OS-specific, and most of the current features are
+ limited to linux. As we build out Windows container support, we'll need to add windows-specific
+ features as well.
+- Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes
+ (e.g. Kata-containers & gVisor).
+- Per-pod and Per-container resource overhead varies by runtime.
+- Device support (e.g. GPUs) varies wildly by runtime & nodes.
+- Supported volume types varies by node - it remains TBD whether this information belongs in
+ RuntimeClass.
+- The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may
+ have differing defaults, or support a subset of capabilities.
+- `Privileged` mode is not well defined, and thus may have differing implementations.
+- Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed
+ workloads)