Update Docker Shared PID NS for per-pod config

author: Lee Verberne <verb@google.com> 2017-09-12 18:22:33 +0200
committer: Lee Verberne <verb@google.com> 2018-01-23 22:22:22 +0100
commit: 333729bd66910b3c0b45aa74a8a54b5c9075226b (patch)
tree: cbb93352243211e4d334aca9618a9a5b2bc94262
parent: 19d1097b58bc1c051b6ff1760b6d293548f8fb24 (diff)
2 files changed, 300 insertions, 51 deletions
diff --git a/contributors/design-proposals/node/container-runtime-interface-v1.md b/contributors/design-proposals/node/container-runtime-interface-v1.md
index 9e89abf5..f2de2640 100644
--- a/contributors/design-proposals/node/container-runtime-interface-v1.md
+++ b/contributors/design-proposals/node/container-runtime-interface-v1.md
@@ -86,8 +86,10 @@ container setup that are not currently trackable as Pod constraints, e.g.,
 filesystem setup, container image pulling, etc.*
 
 A container in a PodSandbox maps to an application in the Pod Spec. For Linux
-containers, they are expected to share at least network, PID and IPC namespaces,
-with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
+containers, they are expected to share at least network, IPC and sometimes PID
+namespaces. PID sharing is defined in [Shared PID
+Namespace](pod-pid-namespace.md). Other namespaces are discussed in
+[#1615](https://issues.k8s.io/1615).
 
 
 Below is an example of the proposed interfaces.
diff --git a/contributors/design-proposals/node/pod-pid-namespace.md b/contributors/design-proposals/node/pod-pid-namespace.md
index 6ac16b3b..e73bdafe 100644
--- a/contributors/design-proposals/node/pod-pid-namespace.md
+++ b/contributors/design-proposals/node/pod-pid-namespace.md
@@ -1,72 +1,319 @@
 # Shared PID Namespace
 
-Pods share namespaces where possible, but a requirement for sharing the PID
-namespace has not been defined due to lack of support in Docker. Docker began
-supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt,
-cri-o, hyper) have already implemented a shared PID namespace.
-
-This proposal defines a shared PID namespace as a requirement of the Container
-Runtime Interface and links its rollout in Docker to that of the CRI.
+*   Status: Pending
+*   Version: Alpha
+*   Implementation Owner: [@verb](https://github.com/verb)
 
 ## Motivation
 
-Sharing a PID namespace between containers in a pod is discussed in
-[#1615](https://issues.k8s.io/1615), and enables:
+Pods share namespaces where possible, but support for sharing the PID namespace
+had not been defined due to lack of support in Docker. This created an implicit
+API on which certain container images now rely. This document proposes adding
+support for sharing a process namespace between containers in a pod while
+maintaining backwards compatibility with the existing implicit API.
 
-  1. signaling between containers, which is useful for side cars (e.g. for
-     signaling a daemon process after rotating logs).
-  2. easier troubleshooting of pods.
-  3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
-     infra container.
+## Proposal
 
-## Goals and Non-Goals
+### Goals and Non-Goals
 
 Goals include:
-  - Changing default behavior in the Docker runtime as implemented by the CRI
-  - Making Docker behavior compatible with the other Kubernetes runtimes
+
+*   Backwards compatibility with container images expecting `pid == 1` semantics
+*   Per-pod configuration of PID namespace sharing
+*   Ability to change default sharing behavior in `v2.Pod`
 
 Non-goals include:
-  - Creating an init solution that works for all runtimes
-  - Supporting isolated PID namespace indefinitely
 
-## Modification to the Docker Runtime
+*   Creating a general purpose container init solution
+*   Multiple shared PID namespaces per pod
+*   Per-container configuration of PID namespace sharing
+
+### Summary
+
+We will add support for configuring pod-shared process namespaces by adding a
+new boolean field `ShareProcessNamespace` to the pod spec. The default to false
+means that each container will have a separate process namespace. When set to
+true, all containers in the pod will share a single process namespace.
+
+The Container Runtime Interface (CRI) will be updated to support three namespace
+modes: Container, Pod & Node. The Runtime Manager will translate the pod spec
+into one of these modes as follows:
+
+Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode
+--------------------------- | ------------- | ------------
+false                       | false         | Container
+false                       | true          | Node
+true                        | false         | Pod
+true                        | true          | *Error*
+
+If a runtime does not implement a particular PID mode, it must return an error.
+For reference, Docker will support all three modes when using version >= 1.13.1.
+
+The shared PID functionality will be hidden behind a new feature gate in both
+the API server and the kubelet, and the existing `--docker-disable-shared-pid`
+flag will be removed from the kubelet, subject to [deprecation
+policy](https://kubernetes.io/docs/reference/deprecation-policy/).
+
+## User Experience
+
+### Use Cases
+
+Sharing a PID namespace between containers in a pod is discussed in
+[#1615](https://issues.k8s.io/1615) and enables:
+
+1.  signaling between containers, which is useful for side cars (e.g. for
+    signaling a daemon process after rotating logs).
+1.  easier troubleshooting of pods.
+1.  addressing [Docker's zombie
+    problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)
+    by reaping orphaned zombies in the infra container.
+
+### Behavioral Changes
+
+Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's
+a significant departure from the traditional behavior of Docker. This may break
+container images and development patterns that have come to rely on process
+isolation. Notably:
+
+1.  **The main container process no longer has PID 1**. It cannot be signalled
+    using `kill 1`, and attempting to do so will instead signal the
+    infrastructure container and potentially restart the pod. Containers
+    shipping an init system like systemd may [require additional
+    flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669).
+1.  **Processes are visible to other containers in the pod**. This includes all
+    information visible in `/proc`, such as passwords as arguments or
+    environment variables, and process signalling. This can be somewhat
+    mitigated by running processes as separate, non-root users.
+1.  **Container filesystems are visible to other containers in the pod through
+    the <code>/proc/$pid/root</code> magic symlink**. This makes debugging
+    easier, but it also means that secrets are protected only by standard
+    filesystem permissions.
+
+## Implementation
+
+### Kubernetes API Changes
+
+`v1.PodSpec` gains a new field named `ShareProcessNamespace`:
+
+```
+// PodSpec is a description of a pod.
+type PodSpec struct {
+    ...
+       // Use the host's pid namespace.
+       // Note that HostPID and ShareProcessNamespace cannot both be set.
+       // Optional: Default to false.
+       // +k8s:conversion-gen=false
+       // +optional
+       HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
+       // Share a single process namespace between all of the containers in a pod.
+       // Note that HostPID and ShareProcessNamespace cannot both be set.
+       // Optional: Default to false.
+       // +k8s:conversion-gen=false
+       // +optional
+       ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"`
+       ...
+```
+
+The field name deviates from that of HostPID in an attempt to [better signal the
+consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536)
+of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will
+cause a validation error.
+
+### Container Runtime Interface Changes
+
+Namespace options in the CRI are currently specified for both `PodSandbox` and
+`Container` creation requests via booleans in `NamespaceOption`:
+
+```
+message NamespaceOption {
+    // If set, use the host's network namespace.
+     bool host_network = 1;
+     // If set, use the host's PID namespace.
+     bool host_pid = 2;
+     // If set, use the host's IPC namespace.
+     bool host_ipc = 3;
+}
+```
+
+We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the
+existing namespace options:
+
+```
+enum NamespaceMode {
+    POD       = 0;
+    CONTAINER = 1;
+    NODE      = 2;
+}
+
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+    // Network namespace for this container/sandbox.
+    // Runtimes must support: POD, NODE
+    NamespaceMode network = 1;
+    // PID namespace for this container/sandbox.
+    // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
+    // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
+    // Runtimes must support: POD, CONTAINER, NODE
+    NamespaceMode pid = 2;
+    // IPC namespace for this container/sandbox.
+    // Runtimes must support: POD, NODE
+    NamespaceMode ipc = 3;
+}
+```
+
+Note that this breaks backwards compatibility in the CRI, which is still in
+alpha.
+
+The protocol default for a namespace is `POD` because that's the default for
+network and IPC, and we will consider making it the default for PID in `v2.Pod`.
+The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so
+that the default behavior of `v1.Pod` does not change.
+
+This CRI design allows different namespace configuration for each of the
+containers in the pod and the sandbox, but currently we have no plans to support
+this in the Kubernetes API. The kubelet will translate namespace booleans from
+v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all
+regular and init containers in a pod.
+
+#### Targeting a Specific Container's Namespace
+
+Though we don't intend to support this in general pod configuration, there is a
+use case for mixed process namespaces within a single pod. [Troubleshooting
+Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral
+Debug Container in an existing, running pod. In order for this to be useful we
+want to share, within the pod, a process namespace between the new container
+performing the debugging and its existing target container.
+
+This is done with the additional `NamespaceMode` `TARGET` and field `target_id`:
+
+```
+enum NamespaceMode {
+    POD       = 0;
+    CONTAINER = 1;
+    NODE      = 2;
+    TARGET    = 3;
+}
+
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+    // Network namespace for this container/sandbox.
+    // Runtimes must support: POD, NODE
+    NamespaceMode network = 1;
+    // PID namespace for this container/sandbox.
+    // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
+    // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
+    // Runtimes must support: POD, CONTAINER, NODE, TARGET
+    NamespaceMode pid = 2;
+    // IPC namespace for this container/sandbox.
+    // Runtimes must support: POD, NODE
+    NamespaceMode ipc = 3;
+    // Target Container ID for NamespaceMode of TARGET. This container must be in the
+    // same pod as the target container.
+    string target_id = 4;
+}
+```
+
+When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new
+container in the namespace used by the container ID in `target_id`. If the
+target container has `NamespaceOption.pid` set to `POD`, then the new container
+should also use the pod namespace. If the target container has an isolated
+process namespace, then the new container will join only that container's
+namespace. Examples are provided for dockershim below.
+
+There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It
+exists for the kubelet to run automation or debugging from a container image in
+the namespace of an existing pod and container. Additionally, we choose to
+explicitly not support sharing namespaces between different pods. The kubelet
+must not generate such a reference, and the runtime should not accept it. That
+is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated
+Container `C`:
+
+valid `target_id` | invalid `target_id`
+----------------- | -------------------
+containerID(A)    | sandboxID(S)
+containerID(B)    | containerID(C)
+
+### dockershim Changes
+
+The Docker runtime implements the pod sandbox as a container running the pause
+container image. When configured for `POD` namespace sharing, the PID namespace
+of the sandbox will become the single PID namespace for the pod. This means a
+namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping
+of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod`
+settings provided as reference):
+
+ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode
+--------------------- | ------- | ---------------- | ------------------
+false                 | false   | CONTAINER        | *unset*
+true                  | false   | POD              | *unset*
+false                 | true    | NODE             | "host"
+\-                    | \-      | TARGET           | *Error*
+
+For _containers_, `HostConfig.PidMode` will be set as follows:
+
+ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode
+--------------------- | ------- | ------------------ | ------------------
+false                 | false   | CONTAINER          | *unset*
+true                  | false   | POD                | "container:[sandbox-container-id]"
+false                 | true    | NODE               | "host"
+false                 | false   | TARGET             | "container:[target-container-id]"
+true                  | false   | TARGET             | "container:[sandbox-container-id]"
+false                 | true    | TARGET             | "host"
+
+If the Docker runtime version does not support sharing pid namespaces, a
+`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return
+an error.
+
+### Deprecation of existing kubelet flag
+
+SIG Node did not anticipate the strong objections to migrating from isolated to
+shared process namespaces for Docker. The previous (now abandoned) migration
+plan introduced a kubelet flag to toggle the shared namespace behavior, but
+objections did not materialize until the flag had moved from experimental to GA.
 
-We will modify the Docker implementation of the CRI to use a shared PID
-namespace when running with a version of Docker >= 1.12. The legacy
-`dockertools` implementation will not be changed.
+The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use
+of shared process namespaces for the Docker runtime. We will immediately mark it
+as deprecated, but according to the [deprecation
+policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must
+support it for 6 months.
 
-Linking this change to the CRI means that Kubernetes users who care to test such
-changes can test the combined changes at once. Users who do not care to test
-such changes will be insulated by Kubernetes not recommending Docker >= 1.12
-until after switching to the CRI.
+We must provide a transition path for users setting this kubelet flag to false.
+Setting this flag asserts a desire to override the default Kubernetes behavior
+for all pods. Until the flag is removed, the kubelet will honor this assertion
+by ignoring the value of `ShareProcessNamespace` and logging a warning to the
+event log.
 
-Other changes that must be made to support this change:
+## Alternatives Considered
 
-1. Add a test to verify all containers restart if the infra container
-   responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source
-   of the PID namespace dies all containers sharing that namespace are killed
-   as well.)
-2. Modify the Infra container used by the Docker runtime to reap orphaned
-   zombies ([#36853](https://pr.k8s.io/36853)).
+### Explicit Container/Sandbox ID Targeting
 
-## Rollout Plan
+Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string
+that explicitly targets a container or sandbox ID:
 
-SIG Node is planning to switch to the CRI as a default in 1.6, at which point
-users with Docker >= 1.12 will receive a shared PID namespace by default.
-Cluster administrators will be able to disable this behavior by providing a flag
-to the kubelet which will cause the dockershim to revert to previous behavior.
+```
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+    ...
+    // ID of Sandbox or Container to use for PID namespace, or "host"
+    string pid = 2;
+    ...
+}
+```
 
-The ability to disable shared PID namespaces is intended as a way to roll back
-to prior behavior in the event of unforeseen problems. It won't be possible to
-configure the behavior per-pod. We believe this is acceptable because:
+This removes the need for a separate `TARGET` mode, but a mode enumeration
+better captures the intent of the option.
 
-* We have not identified a concrete use case requiring isolated PID namespaces.
-* Making PID namespace configurable requires changing the CRI, which we would
-  like to avoid since there are no use cases.
+### Defaulting to PID Namespace Sharing
 
-In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling
-use case for isolated PID namespaces is discovered, we will remove the ability
-to disable the shared PID namespace in the subsequent release.
+Other Kubernetes runtimes already share a single PID namespace between
+containers in a pod. We could easily change the Docker runtime to always share a
+PID namespace when supported by the installed Docker version, but this would
+cause problems for container images that assume they will always be PID 1.
 
+### Migration to Shared-only Namespaces
 
-[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
+Rather than adding support to the API for configuring namespaces we could allow
+changing the default behavior with pod annotations with the intention of
+removing support for isolated PID namespaces in v2.Pod. Many members of the
+community want to use the isolated namespaces as security boundary between
+containers in a pod, however.
author	Lee Verberne <verb@google.com>	2017-09-12 18:22:33 +0200
committer	Lee Verberne <verb@google.com>	2018-01-23 22:22:22 +0100
commit	333729bd66910b3c0b45aa74a8a54b5c9075226b (patch)
tree	cbb93352243211e4d334aca9618a9a5b2bc94262
parent	19d1097b58bc1c051b6ff1760b6d293548f8fb24 (diff)