diff options
| author | Lee Verberne <verb@google.com> | 2017-09-12 18:22:33 +0200 |
|---|---|---|
| committer | Lee Verberne <verb@google.com> | 2018-01-23 22:22:22 +0100 |
| commit | 333729bd66910b3c0b45aa74a8a54b5c9075226b (patch) | |
| tree | cbb93352243211e4d334aca9618a9a5b2bc94262 | |
| parent | 19d1097b58bc1c051b6ff1760b6d293548f8fb24 (diff) | |
Update Docker Shared PID NS for per-pod config
| -rw-r--r-- | contributors/design-proposals/node/container-runtime-interface-v1.md | 6 | ||||
| -rw-r--r-- | contributors/design-proposals/node/pod-pid-namespace.md | 345 |
2 files changed, 300 insertions, 51 deletions
diff --git a/contributors/design-proposals/node/container-runtime-interface-v1.md b/contributors/design-proposals/node/container-runtime-interface-v1.md index 9e89abf5..f2de2640 100644 --- a/contributors/design-proposals/node/container-runtime-interface-v1.md +++ b/contributors/design-proposals/node/container-runtime-interface-v1.md @@ -86,8 +86,10 @@ container setup that are not currently trackable as Pod constraints, e.g., filesystem setup, container image pulling, etc.* A container in a PodSandbox maps to an application in the Pod Spec. For Linux -containers, they are expected to share at least network, PID and IPC namespaces, -with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615). +containers, they are expected to share at least network, IPC and sometimes PID +namespaces. PID sharing is defined in [Shared PID +Namespace](pod-pid-namespace.md). Other namespaces are discussed in +[#1615](https://issues.k8s.io/1615). Below is an example of the proposed interfaces. diff --git a/contributors/design-proposals/node/pod-pid-namespace.md b/contributors/design-proposals/node/pod-pid-namespace.md index 6ac16b3b..e73bdafe 100644 --- a/contributors/design-proposals/node/pod-pid-namespace.md +++ b/contributors/design-proposals/node/pod-pid-namespace.md @@ -1,72 +1,319 @@ # Shared PID Namespace -Pods share namespaces where possible, but a requirement for sharing the PID -namespace has not been defined due to lack of support in Docker. Docker began -supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt, -cri-o, hyper) have already implemented a shared PID namespace. - -This proposal defines a shared PID namespace as a requirement of the Container -Runtime Interface and links its rollout in Docker to that of the CRI. +* Status: Pending +* Version: Alpha +* Implementation Owner: [@verb](https://github.com/verb) ## Motivation -Sharing a PID namespace between containers in a pod is discussed in -[#1615](https://issues.k8s.io/1615), and enables: +Pods share namespaces where possible, but support for sharing the PID namespace +had not been defined due to lack of support in Docker. This created an implicit +API on which certain container images now rely. This document proposes adding +support for sharing a process namespace between containers in a pod while +maintaining backwards compatibility with the existing implicit API. - 1. signaling between containers, which is useful for side cars (e.g. for - signaling a daemon process after rotating logs). - 2. easier troubleshooting of pods. - 3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the - infra container. +## Proposal -## Goals and Non-Goals +### Goals and Non-Goals Goals include: - - Changing default behavior in the Docker runtime as implemented by the CRI - - Making Docker behavior compatible with the other Kubernetes runtimes + +* Backwards compatibility with container images expecting `pid == 1` semantics +* Per-pod configuration of PID namespace sharing +* Ability to change default sharing behavior in `v2.Pod` Non-goals include: - - Creating an init solution that works for all runtimes - - Supporting isolated PID namespace indefinitely -## Modification to the Docker Runtime +* Creating a general purpose container init solution +* Multiple shared PID namespaces per pod +* Per-container configuration of PID namespace sharing + +### Summary + +We will add support for configuring pod-shared process namespaces by adding a +new boolean field `ShareProcessNamespace` to the pod spec. The default to false +means that each container will have a separate process namespace. When set to +true, all containers in the pod will share a single process namespace. + +The Container Runtime Interface (CRI) will be updated to support three namespace +modes: Container, Pod & Node. The Runtime Manager will translate the pod spec +into one of these modes as follows: + +Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode +--------------------------- | ------------- | ------------ +false | false | Container +false | true | Node +true | false | Pod +true | true | *Error* + +If a runtime does not implement a particular PID mode, it must return an error. +For reference, Docker will support all three modes when using version >= 1.13.1. + +The shared PID functionality will be hidden behind a new feature gate in both +the API server and the kubelet, and the existing `--docker-disable-shared-pid` +flag will be removed from the kubelet, subject to [deprecation +policy](https://kubernetes.io/docs/reference/deprecation-policy/). + +## User Experience + +### Use Cases + +Sharing a PID namespace between containers in a pod is discussed in +[#1615](https://issues.k8s.io/1615) and enables: + +1. signaling between containers, which is useful for side cars (e.g. for + signaling a daemon process after rotating logs). +1. easier troubleshooting of pods. +1. addressing [Docker's zombie + problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/) + by reaping orphaned zombies in the infra container. + +### Behavioral Changes + +Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's +a significant departure from the traditional behavior of Docker. This may break +container images and development patterns that have come to rely on process +isolation. Notably: + +1. **The main container process no longer has PID 1**. It cannot be signalled + using `kill 1`, and attempting to do so will instead signal the + infrastructure container and potentially restart the pod. Containers + shipping an init system like systemd may [require additional + flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669). +1. **Processes are visible to other containers in the pod**. This includes all + information visible in `/proc`, such as passwords as arguments or + environment variables, and process signalling. This can be somewhat + mitigated by running processes as separate, non-root users. +1. **Container filesystems are visible to other containers in the pod through + the <code>/proc/$pid/root</code> magic symlink**. This makes debugging + easier, but it also means that secrets are protected only by standard + filesystem permissions. + +## Implementation + +### Kubernetes API Changes + +`v1.PodSpec` gains a new field named `ShareProcessNamespace`: + +``` +// PodSpec is a description of a pod. +type PodSpec struct { + ... + // Use the host's pid namespace. + // Note that HostPID and ShareProcessNamespace cannot both be set. + // Optional: Default to false. + // +k8s:conversion-gen=false + // +optional + HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"` + // Share a single process namespace between all of the containers in a pod. + // Note that HostPID and ShareProcessNamespace cannot both be set. + // Optional: Default to false. + // +k8s:conversion-gen=false + // +optional + ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"` + ... +``` + +The field name deviates from that of HostPID in an attempt to [better signal the +consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536) +of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will +cause a validation error. + +### Container Runtime Interface Changes + +Namespace options in the CRI are currently specified for both `PodSandbox` and +`Container` creation requests via booleans in `NamespaceOption`: + +``` +message NamespaceOption { + // If set, use the host's network namespace. + bool host_network = 1; + // If set, use the host's PID namespace. + bool host_pid = 2; + // If set, use the host's IPC namespace. + bool host_ipc = 3; +} +``` + +We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the +existing namespace options: + +``` +enum NamespaceMode { + POD = 0; + CONTAINER = 1; + NODE = 2; +} + +// NamespaceOption provides options for Linux namespaces. +message NamespaceOption { + // Network namespace for this container/sandbox. + // Runtimes must support: POD, NODE + NamespaceMode network = 1; + // PID namespace for this container/sandbox. + // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. + // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods. + // Runtimes must support: POD, CONTAINER, NODE + NamespaceMode pid = 2; + // IPC namespace for this container/sandbox. + // Runtimes must support: POD, NODE + NamespaceMode ipc = 3; +} +``` + +Note that this breaks backwards compatibility in the CRI, which is still in +alpha. + +The protocol default for a namespace is `POD` because that's the default for +network and IPC, and we will consider making it the default for PID in `v2.Pod`. +The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so +that the default behavior of `v1.Pod` does not change. + +This CRI design allows different namespace configuration for each of the +containers in the pod and the sandbox, but currently we have no plans to support +this in the Kubernetes API. The kubelet will translate namespace booleans from +v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all +regular and init containers in a pod. + +#### Targeting a Specific Container's Namespace + +Though we don't intend to support this in general pod configuration, there is a +use case for mixed process namespaces within a single pod. [Troubleshooting +Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral +Debug Container in an existing, running pod. In order for this to be useful we +want to share, within the pod, a process namespace between the new container +performing the debugging and its existing target container. + +This is done with the additional `NamespaceMode` `TARGET` and field `target_id`: + +``` +enum NamespaceMode { + POD = 0; + CONTAINER = 1; + NODE = 2; + TARGET = 3; +} + +// NamespaceOption provides options for Linux namespaces. +message NamespaceOption { + // Network namespace for this container/sandbox. + // Runtimes must support: POD, NODE + NamespaceMode network = 1; + // PID namespace for this container/sandbox. + // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. + // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods. + // Runtimes must support: POD, CONTAINER, NODE, TARGET + NamespaceMode pid = 2; + // IPC namespace for this container/sandbox. + // Runtimes must support: POD, NODE + NamespaceMode ipc = 3; + // Target Container ID for NamespaceMode of TARGET. This container must be in the + // same pod as the target container. + string target_id = 4; +} +``` + +When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new +container in the namespace used by the container ID in `target_id`. If the +target container has `NamespaceOption.pid` set to `POD`, then the new container +should also use the pod namespace. If the target container has an isolated +process namespace, then the new container will join only that container's +namespace. Examples are provided for dockershim below. + +There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It +exists for the kubelet to run automation or debugging from a container image in +the namespace of an existing pod and container. Additionally, we choose to +explicitly not support sharing namespaces between different pods. The kubelet +must not generate such a reference, and the runtime should not accept it. That +is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated +Container `C`: + +valid `target_id` | invalid `target_id` +----------------- | ------------------- +containerID(A) | sandboxID(S) +containerID(B) | containerID(C) + +### dockershim Changes + +The Docker runtime implements the pod sandbox as a container running the pause +container image. When configured for `POD` namespace sharing, the PID namespace +of the sandbox will become the single PID namespace for the pod. This means a +namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping +of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod` +settings provided as reference): + +ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode +--------------------- | ------- | ---------------- | ------------------ +false | false | CONTAINER | *unset* +true | false | POD | *unset* +false | true | NODE | "host" +\- | \- | TARGET | *Error* + +For _containers_, `HostConfig.PidMode` will be set as follows: + +ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode +--------------------- | ------- | ------------------ | ------------------ +false | false | CONTAINER | *unset* +true | false | POD | "container:[sandbox-container-id]" +false | true | NODE | "host" +false | false | TARGET | "container:[target-container-id]" +true | false | TARGET | "container:[sandbox-container-id]" +false | true | TARGET | "host" + +If the Docker runtime version does not support sharing pid namespaces, a +`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return +an error. + +### Deprecation of existing kubelet flag + +SIG Node did not anticipate the strong objections to migrating from isolated to +shared process namespaces for Docker. The previous (now abandoned) migration +plan introduced a kubelet flag to toggle the shared namespace behavior, but +objections did not materialize until the flag had moved from experimental to GA. -We will modify the Docker implementation of the CRI to use a shared PID -namespace when running with a version of Docker >= 1.12. The legacy -`dockertools` implementation will not be changed. +The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use +of shared process namespaces for the Docker runtime. We will immediately mark it +as deprecated, but according to the [deprecation +policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must +support it for 6 months. -Linking this change to the CRI means that Kubernetes users who care to test such -changes can test the combined changes at once. Users who do not care to test -such changes will be insulated by Kubernetes not recommending Docker >= 1.12 -until after switching to the CRI. +We must provide a transition path for users setting this kubelet flag to false. +Setting this flag asserts a desire to override the default Kubernetes behavior +for all pods. Until the flag is removed, the kubelet will honor this assertion +by ignoring the value of `ShareProcessNamespace` and logging a warning to the +event log. -Other changes that must be made to support this change: +## Alternatives Considered -1. Add a test to verify all containers restart if the infra container - responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source - of the PID namespace dies all containers sharing that namespace are killed - as well.) -2. Modify the Infra container used by the Docker runtime to reap orphaned - zombies ([#36853](https://pr.k8s.io/36853)). +### Explicit Container/Sandbox ID Targeting -## Rollout Plan +Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string +that explicitly targets a container or sandbox ID: -SIG Node is planning to switch to the CRI as a default in 1.6, at which point -users with Docker >= 1.12 will receive a shared PID namespace by default. -Cluster administrators will be able to disable this behavior by providing a flag -to the kubelet which will cause the dockershim to revert to previous behavior. +``` +// NamespaceOption provides options for Linux namespaces. +message NamespaceOption { + ... + // ID of Sandbox or Container to use for PID namespace, or "host" + string pid = 2; + ... +} +``` -The ability to disable shared PID namespaces is intended as a way to roll back -to prior behavior in the event of unforeseen problems. It won't be possible to -configure the behavior per-pod. We believe this is acceptable because: +This removes the need for a separate `TARGET` mode, but a mode enumeration +better captures the intent of the option. -* We have not identified a concrete use case requiring isolated PID namespaces. -* Making PID namespace configurable requires changing the CRI, which we would - like to avoid since there are no use cases. +### Defaulting to PID Namespace Sharing -In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling -use case for isolated PID namespaces is discovered, we will remove the ability -to disable the shared PID namespace in the subsequent release. +Other Kubernetes runtimes already share a single PID namespace between +containers in a pod. We could easily change the Docker runtime to always share a +PID namespace when supported by the installed Docker version, but this would +cause problems for container images that assume they will always be PID 1. +### Migration to Shared-only Namespaces -[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ +Rather than adding support to the API for configuring namespaces we could allow +changing the default behavior with pod annotations with the intention of +removing support for isolated PID namespaces in v2.Pod. Many members of the +community want to use the isolated namespaces as security boundary between +containers in a pod, however. |
