summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLee Verberne <verb@google.com>2017-09-12 18:22:33 +0200
committerLee Verberne <verb@google.com>2018-01-23 22:22:22 +0100
commit333729bd66910b3c0b45aa74a8a54b5c9075226b (patch)
treecbb93352243211e4d334aca9618a9a5b2bc94262
parent19d1097b58bc1c051b6ff1760b6d293548f8fb24 (diff)
Update Docker Shared PID NS for per-pod config
-rw-r--r--contributors/design-proposals/node/container-runtime-interface-v1.md6
-rw-r--r--contributors/design-proposals/node/pod-pid-namespace.md345
2 files changed, 300 insertions, 51 deletions
diff --git a/contributors/design-proposals/node/container-runtime-interface-v1.md b/contributors/design-proposals/node/container-runtime-interface-v1.md
index 9e89abf5..f2de2640 100644
--- a/contributors/design-proposals/node/container-runtime-interface-v1.md
+++ b/contributors/design-proposals/node/container-runtime-interface-v1.md
@@ -86,8 +86,10 @@ container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*
A container in a PodSandbox maps to an application in the Pod Spec. For Linux
-containers, they are expected to share at least network, PID and IPC namespaces,
-with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
+containers, they are expected to share at least network, IPC and sometimes PID
+namespaces. PID sharing is defined in [Shared PID
+Namespace](pod-pid-namespace.md). Other namespaces are discussed in
+[#1615](https://issues.k8s.io/1615).
Below is an example of the proposed interfaces.
diff --git a/contributors/design-proposals/node/pod-pid-namespace.md b/contributors/design-proposals/node/pod-pid-namespace.md
index 6ac16b3b..e73bdafe 100644
--- a/contributors/design-proposals/node/pod-pid-namespace.md
+++ b/contributors/design-proposals/node/pod-pid-namespace.md
@@ -1,72 +1,319 @@
# Shared PID Namespace
-Pods share namespaces where possible, but a requirement for sharing the PID
-namespace has not been defined due to lack of support in Docker. Docker began
-supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt,
-cri-o, hyper) have already implemented a shared PID namespace.
-
-This proposal defines a shared PID namespace as a requirement of the Container
-Runtime Interface and links its rollout in Docker to that of the CRI.
+* Status: Pending
+* Version: Alpha
+* Implementation Owner: [@verb](https://github.com/verb)
## Motivation
-Sharing a PID namespace between containers in a pod is discussed in
-[#1615](https://issues.k8s.io/1615), and enables:
+Pods share namespaces where possible, but support for sharing the PID namespace
+had not been defined due to lack of support in Docker. This created an implicit
+API on which certain container images now rely. This document proposes adding
+support for sharing a process namespace between containers in a pod while
+maintaining backwards compatibility with the existing implicit API.
- 1. signaling between containers, which is useful for side cars (e.g. for
- signaling a daemon process after rotating logs).
- 2. easier troubleshooting of pods.
- 3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
- infra container.
+## Proposal
-## Goals and Non-Goals
+### Goals and Non-Goals
Goals include:
- - Changing default behavior in the Docker runtime as implemented by the CRI
- - Making Docker behavior compatible with the other Kubernetes runtimes
+
+* Backwards compatibility with container images expecting `pid == 1` semantics
+* Per-pod configuration of PID namespace sharing
+* Ability to change default sharing behavior in `v2.Pod`
Non-goals include:
- - Creating an init solution that works for all runtimes
- - Supporting isolated PID namespace indefinitely
-## Modification to the Docker Runtime
+* Creating a general purpose container init solution
+* Multiple shared PID namespaces per pod
+* Per-container configuration of PID namespace sharing
+
+### Summary
+
+We will add support for configuring pod-shared process namespaces by adding a
+new boolean field `ShareProcessNamespace` to the pod spec. The default to false
+means that each container will have a separate process namespace. When set to
+true, all containers in the pod will share a single process namespace.
+
+The Container Runtime Interface (CRI) will be updated to support three namespace
+modes: Container, Pod & Node. The Runtime Manager will translate the pod spec
+into one of these modes as follows:
+
+Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode
+--------------------------- | ------------- | ------------
+false | false | Container
+false | true | Node
+true | false | Pod
+true | true | *Error*
+
+If a runtime does not implement a particular PID mode, it must return an error.
+For reference, Docker will support all three modes when using version >= 1.13.1.
+
+The shared PID functionality will be hidden behind a new feature gate in both
+the API server and the kubelet, and the existing `--docker-disable-shared-pid`
+flag will be removed from the kubelet, subject to [deprecation
+policy](https://kubernetes.io/docs/reference/deprecation-policy/).
+
+## User Experience
+
+### Use Cases
+
+Sharing a PID namespace between containers in a pod is discussed in
+[#1615](https://issues.k8s.io/1615) and enables:
+
+1. signaling between containers, which is useful for side cars (e.g. for
+ signaling a daemon process after rotating logs).
+1. easier troubleshooting of pods.
+1. addressing [Docker's zombie
+ problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)
+ by reaping orphaned zombies in the infra container.
+
+### Behavioral Changes
+
+Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's
+a significant departure from the traditional behavior of Docker. This may break
+container images and development patterns that have come to rely on process
+isolation. Notably:
+
+1. **The main container process no longer has PID 1**. It cannot be signalled
+ using `kill 1`, and attempting to do so will instead signal the
+ infrastructure container and potentially restart the pod. Containers
+ shipping an init system like systemd may [require additional
+ flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669).
+1. **Processes are visible to other containers in the pod**. This includes all
+ information visible in `/proc`, such as passwords as arguments or
+ environment variables, and process signalling. This can be somewhat
+ mitigated by running processes as separate, non-root users.
+1. **Container filesystems are visible to other containers in the pod through
+ the <code>/proc/$pid/root</code> magic symlink**. This makes debugging
+ easier, but it also means that secrets are protected only by standard
+ filesystem permissions.
+
+## Implementation
+
+### Kubernetes API Changes
+
+`v1.PodSpec` gains a new field named `ShareProcessNamespace`:
+
+```
+// PodSpec is a description of a pod.
+type PodSpec struct {
+ ...
+ // Use the host's pid namespace.
+ // Note that HostPID and ShareProcessNamespace cannot both be set.
+ // Optional: Default to false.
+ // +k8s:conversion-gen=false
+ // +optional
+ HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
+ // Share a single process namespace between all of the containers in a pod.
+ // Note that HostPID and ShareProcessNamespace cannot both be set.
+ // Optional: Default to false.
+ // +k8s:conversion-gen=false
+ // +optional
+ ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"`
+ ...
+```
+
+The field name deviates from that of HostPID in an attempt to [better signal the
+consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536)
+of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will
+cause a validation error.
+
+### Container Runtime Interface Changes
+
+Namespace options in the CRI are currently specified for both `PodSandbox` and
+`Container` creation requests via booleans in `NamespaceOption`:
+
+```
+message NamespaceOption {
+ // If set, use the host's network namespace.
+ bool host_network = 1;
+ // If set, use the host's PID namespace.
+ bool host_pid = 2;
+ // If set, use the host's IPC namespace.
+ bool host_ipc = 3;
+}
+```
+
+We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the
+existing namespace options:
+
+```
+enum NamespaceMode {
+ POD = 0;
+ CONTAINER = 1;
+ NODE = 2;
+}
+
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+ // Network namespace for this container/sandbox.
+ // Runtimes must support: POD, NODE
+ NamespaceMode network = 1;
+ // PID namespace for this container/sandbox.
+ // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
+ // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
+ // Runtimes must support: POD, CONTAINER, NODE
+ NamespaceMode pid = 2;
+ // IPC namespace for this container/sandbox.
+ // Runtimes must support: POD, NODE
+ NamespaceMode ipc = 3;
+}
+```
+
+Note that this breaks backwards compatibility in the CRI, which is still in
+alpha.
+
+The protocol default for a namespace is `POD` because that's the default for
+network and IPC, and we will consider making it the default for PID in `v2.Pod`.
+The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so
+that the default behavior of `v1.Pod` does not change.
+
+This CRI design allows different namespace configuration for each of the
+containers in the pod and the sandbox, but currently we have no plans to support
+this in the Kubernetes API. The kubelet will translate namespace booleans from
+v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all
+regular and init containers in a pod.
+
+#### Targeting a Specific Container's Namespace
+
+Though we don't intend to support this in general pod configuration, there is a
+use case for mixed process namespaces within a single pod. [Troubleshooting
+Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral
+Debug Container in an existing, running pod. In order for this to be useful we
+want to share, within the pod, a process namespace between the new container
+performing the debugging and its existing target container.
+
+This is done with the additional `NamespaceMode` `TARGET` and field `target_id`:
+
+```
+enum NamespaceMode {
+ POD = 0;
+ CONTAINER = 1;
+ NODE = 2;
+ TARGET = 3;
+}
+
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+ // Network namespace for this container/sandbox.
+ // Runtimes must support: POD, NODE
+ NamespaceMode network = 1;
+ // PID namespace for this container/sandbox.
+ // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
+ // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
+ // Runtimes must support: POD, CONTAINER, NODE, TARGET
+ NamespaceMode pid = 2;
+ // IPC namespace for this container/sandbox.
+ // Runtimes must support: POD, NODE
+ NamespaceMode ipc = 3;
+ // Target Container ID for NamespaceMode of TARGET. This container must be in the
+ // same pod as the target container.
+ string target_id = 4;
+}
+```
+
+When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new
+container in the namespace used by the container ID in `target_id`. If the
+target container has `NamespaceOption.pid` set to `POD`, then the new container
+should also use the pod namespace. If the target container has an isolated
+process namespace, then the new container will join only that container's
+namespace. Examples are provided for dockershim below.
+
+There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It
+exists for the kubelet to run automation or debugging from a container image in
+the namespace of an existing pod and container. Additionally, we choose to
+explicitly not support sharing namespaces between different pods. The kubelet
+must not generate such a reference, and the runtime should not accept it. That
+is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated
+Container `C`:
+
+valid `target_id` | invalid `target_id`
+----------------- | -------------------
+containerID(A) | sandboxID(S)
+containerID(B) | containerID(C)
+
+### dockershim Changes
+
+The Docker runtime implements the pod sandbox as a container running the pause
+container image. When configured for `POD` namespace sharing, the PID namespace
+of the sandbox will become the single PID namespace for the pod. This means a
+namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping
+of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod`
+settings provided as reference):
+
+ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode
+--------------------- | ------- | ---------------- | ------------------
+false | false | CONTAINER | *unset*
+true | false | POD | *unset*
+false | true | NODE | "host"
+\- | \- | TARGET | *Error*
+
+For _containers_, `HostConfig.PidMode` will be set as follows:
+
+ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode
+--------------------- | ------- | ------------------ | ------------------
+false | false | CONTAINER | *unset*
+true | false | POD | "container:[sandbox-container-id]"
+false | true | NODE | "host"
+false | false | TARGET | "container:[target-container-id]"
+true | false | TARGET | "container:[sandbox-container-id]"
+false | true | TARGET | "host"
+
+If the Docker runtime version does not support sharing pid namespaces, a
+`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return
+an error.
+
+### Deprecation of existing kubelet flag
+
+SIG Node did not anticipate the strong objections to migrating from isolated to
+shared process namespaces for Docker. The previous (now abandoned) migration
+plan introduced a kubelet flag to toggle the shared namespace behavior, but
+objections did not materialize until the flag had moved from experimental to GA.
-We will modify the Docker implementation of the CRI to use a shared PID
-namespace when running with a version of Docker >= 1.12. The legacy
-`dockertools` implementation will not be changed.
+The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use
+of shared process namespaces for the Docker runtime. We will immediately mark it
+as deprecated, but according to the [deprecation
+policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must
+support it for 6 months.
-Linking this change to the CRI means that Kubernetes users who care to test such
-changes can test the combined changes at once. Users who do not care to test
-such changes will be insulated by Kubernetes not recommending Docker >= 1.12
-until after switching to the CRI.
+We must provide a transition path for users setting this kubelet flag to false.
+Setting this flag asserts a desire to override the default Kubernetes behavior
+for all pods. Until the flag is removed, the kubelet will honor this assertion
+by ignoring the value of `ShareProcessNamespace` and logging a warning to the
+event log.
-Other changes that must be made to support this change:
+## Alternatives Considered
-1. Add a test to verify all containers restart if the infra container
- responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source
- of the PID namespace dies all containers sharing that namespace are killed
- as well.)
-2. Modify the Infra container used by the Docker runtime to reap orphaned
- zombies ([#36853](https://pr.k8s.io/36853)).
+### Explicit Container/Sandbox ID Targeting
-## Rollout Plan
+Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string
+that explicitly targets a container or sandbox ID:
-SIG Node is planning to switch to the CRI as a default in 1.6, at which point
-users with Docker >= 1.12 will receive a shared PID namespace by default.
-Cluster administrators will be able to disable this behavior by providing a flag
-to the kubelet which will cause the dockershim to revert to previous behavior.
+```
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+ ...
+ // ID of Sandbox or Container to use for PID namespace, or "host"
+ string pid = 2;
+ ...
+}
+```
-The ability to disable shared PID namespaces is intended as a way to roll back
-to prior behavior in the event of unforeseen problems. It won't be possible to
-configure the behavior per-pod. We believe this is acceptable because:
+This removes the need for a separate `TARGET` mode, but a mode enumeration
+better captures the intent of the option.
-* We have not identified a concrete use case requiring isolated PID namespaces.
-* Making PID namespace configurable requires changing the CRI, which we would
- like to avoid since there are no use cases.
+### Defaulting to PID Namespace Sharing
-In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling
-use case for isolated PID namespaces is discovered, we will remove the ability
-to disable the shared PID namespace in the subsequent release.
+Other Kubernetes runtimes already share a single PID namespace between
+containers in a pod. We could easily change the Docker runtime to always share a
+PID namespace when supported by the installed Docker version, but this would
+cause problems for container images that assume they will always be PID 1.
+### Migration to Shared-only Namespaces
-[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
+Rather than adding support to the API for configuring namespaces we could allow
+changing the default behavior with pod annotations with the intention of
+removing support for isolated PID namespaces in v2.Pod. Many members of the
+community want to use the isolated namespaces as security boundary between
+containers in a pod, however.