Move Ephemeral Containers into pod.Spec

After discussing with API reviewers and relevant SIG leads, we've agreed that the configuration for Ephemeral Containers should live in the pod spec.
author: Lee Verberne <verb@google.com> 2018-08-15 14:29:55 +0200
committer: Lee Verberne <verb@google.com> 2018-08-21 17:55:42 +0200
commit: 502b75727129e0aba965f3955ff3db6526a5593d (patch)
tree: 7bca70607cf8dda4d233c1df3726ee1f018e458a
parent: 660f409cdd9979782455984f9df2c14b76cf1985 (diff)
1 files changed, 277 insertions, 269 deletions
diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md
index cb86c35b..3fcc0223 100644
--- a/contributors/design-proposals/node/troubleshoot-running-pods.md
+++ b/contributors/design-proposals/node/troubleshoot-running-pods.md
@@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an
 execution platform for custom binaries produced by a build system. These users
 can forgo the scripted OS install of traditional Dockerfiles and instead `COPY`
 the output of their build system into a container image built `FROM scratch` or
-a [distroless container
-image](https://github.com/GoogleCloudPlatform/distroless). This confers several
-advantages:
+a
+[distroless container image](https://github.com/GoogleCloudPlatform/distroless).
+This confers several advantages:
 
 1.  **Minimal images** lower operational burden and reduce attack vectors.
 1.  **Immutable images** improve correctness and reliability.
@@ -61,10 +61,9 @@ command, `kubectl debug`, which parallels an existing command, `kubectl exec`.
 Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will
 be similar but run a _container_ in a _pod_.
 
-A container created by `kubectl debug` is a _Debug Container_. Just like a
-process run by `kubectl exec`, a Debug Container is not part of the pod spec.
-Unlike `kubectl exec`, a Debug Container _does_ have status that is reported in
-`v1.PodStatus` and displayed by `kubectl describe pod`.
+A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl
+exec`, Debug Containers have status that is reported in `PodStatus` and
+displayed by `kubectl describe pod`.
 
 For example, the following command would attach to a newly created container in
 a pod:
@@ -100,70 +99,90 @@ subsequently be used to reattach and is reported by `kubectl describe`.
 
 ### Kubernetes API Changes
 
-There has been much discussion about how this fits best into the Kubernetes API.
-The consensus is for an imperative "debug this pod" action whereby the kubelet
-creates a new, temporary container in a pod on command. SIG Node would like to
-avoid new dependencies in the kubelet, so this will be implemented in the Core
-API. Three possible implementations follow, and additional implementations that
-were evaluated and dismissed are at the end of this document.
+This will be implemented in the Core API to avoid new dependencies in the
+kubelet. The user-level concept of a _Debug Container_ implemented with the
+API-level concept of an _Ephemeral Container_. The API doesn't require an
+Ephemeral Container to be used as a Debug Container. It's intended as a general
+purpose construct for running a short-lived process in a pod.
 
-All of the proposed solutions implement the user-level concept of a _Debug
-Container_ using the API-level concept of an _Ephemeral Container_. The API
-doesn't prescribe how an Ephemeral Container is used. It could conceivably see
-use other than Debug Containers, but we don't currently have other use cases.
+#### Pod Changes
 
-#### Chosen Solution: Subresource to Update PodStatus
-
-An Ephemeral Container is not part of the pod specification as it's not part of
-the declared state of the pod, but we describe it using the same primitives as
-in `PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target:
+Ephemeral Containers are represented in `PodSpec` and `PodStatus`:
 
 ```
-// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
-type EphemeralContainer struct {
-        metav1.TypeMeta `json:",inline"`
-
-        // Spec describes the Ephemeral Container to be created.
-        Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
-
-        // Most recently observed status of the container.
-        // This data may not be up to date.
-        // Populated by the system.
-        // Read-only.
-        // +optional
-        Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
+type PodSpec struct {
+  ...
+  // List of user-initiated ephemeral containers to run in this pod.
+  // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
+  // +optional
+  EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"`
+}
 
-        // If set, the name of the container from PodSpec that this ephemeral container targets.
-        // If not set then the ephemeral container is run in whatever namespaces are shared
-        // for the pod.
-        TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
+type PodStatus struct {
+  ...
+  // Status for any Ephemeral Containers that running in this pod.
+  // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
+  // +optional
+  EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"`
 }
 ```
 
-Ephemeral Containers for a pod are listed in the pod's status:
+`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and
+`InitContainerStatuses`, but `EphemeralContainers` introduces a new type:
 
 ```
-type PodStatus struct {
-        ...
-        // List of user-initiated ephemeral containers that have been run in this pod.
-        // +optional
-        EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
-
+// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions
+// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag.
+type EphemeralContainer struct {
+  // Spec describes the Ephemeral Container to be created.
+  Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"`
+
+  // If set, the name of the container from PodSpec that this ephemeral container targets.
+  // The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container.
+  // If not set then the ephemeral container is run in whatever namespaces are shared
+  // for the pod.
+  // +optional
+  TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"`
 }
 ```
 
-To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
-the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in
-the API. Users cannot normally modify the pod status, so we'll create a new
-subresource `/ephemeralcontainers` that allows an update of solely
-`EphemeralContainers` and enforces append-only semantics.
+Much of the utility of Ephemeral Containers comes from the ability to run a
+container within the PID namespace of another container. `TargetContainerName`
+allows targeting a container that doesn't share its PID namespace with the rest
+of the pod. We must modify the CRI to enable this functionality (see below).
+
+##### Alternative Considered: Omitting TargetContainerName
+
+It would be simpler for the API, kubelet and kubectl if `EphemeralContainers`
+was a `[]Container`, but as isolated PID namespaces will be the default for some
+time, being able to target a container will provide a better user experience.
 
-**Note that Ephemeral Containers are not regular containers and should not be
-used to build services.** They lack guarantees for resources or execution, they
-will never be automatically restarted, and many of the fields of `v1.Container`
-will not be allowed for Debug Containers. In particular, the following fields
-are explicitly disallowed by API validation: `resources`, `ports`,
-`livenessProbe`, `readinessProbe`, and `lifecycle`.
+#### Updates
+
+Most fields of `Pod.Spec` are immutable once created. There is a short whitelist
+of fields which may be updated, and we could extend this to include
+`EphemeralContainers`. The ability to add new containers is a large change for
+Pod, however, and we'd like to begin conservatively by enforcing the following
+best practices:
+
+1.  Ephemeral Containers lack guarantees for resources or execution, and they
+    will never be automatically restarted. To avoid pods that depend on
+    Ephemeral Containers, we allow their addition only in pod updates and
+    disallow them during pod create.
+1.  Some fields of `v1.Container` imply a fundamental role in a pod. We will
+    disallow the following fields in Ephemeral Containers: `resources`, `ports`,
+    `livenessProbe`, `readinessProbe`, and `lifecycle.`
+1.  Cluster administrators may want to restrict access to Ephemeral Containers
+    independent of other pod updates.
+
+To enforce these restrictions and new permissions, we will introduce a new Pod
+subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified
+via this subresource. `EphemeralContainerStatuses` is updated with everything
+else in `Pod.Status` via `/status`.
+
+To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
+the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and
+`PUT`s the pod to `/ephemeralcontainers`.
 
 The subresources `attach`, `exec`, `log`, and `portforward` are available for
 Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl
@@ -174,111 +193,34 @@ Once the pod is updated, the kubelet worker watching this pod will launch the
 Ephemeral Container and update its status. The client is expected to watch for
 the creation of the container status and then attach to the console of a debug
 container using the existing attach endpoint,
-`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that output of the new
+`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new
 container occurring between its creation and attach will not be replayed, but it
 can be viewed using `kubectl log`.
 
-#### Alternative 1: "exec++"
-
-A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
-"executing" container images. The current `/exec` endpoint must implement `GET`
-to support streaming for all clients. We don't want to encode a (potentially
-large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
-with the specific fields required for creating a Debug Container:
-
-```
-// PodExecOptions is the query options to a Pod's remote exec call
-type PodExecOptions struct {
-        ...
-        // EphemeralContainerName is the name of an ephemeral container in which the
-        // command ought to be run. Either both EphemeralContainerName and
-        // EphemeralContainerImage fields must be set, or neither.
-        EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
-
-        // EphemeralContainerImage is the image of an ephemeral container in which the command
-        // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
-        // fields must be set, or neither.
-        EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
-}
-```
+##### Alternative Considered: Standard Pod Updates
 
-After creating the Ephemeral Container, the kubelet would upgrade the connection
-to streaming and perform an attach to the container's console. If disconnected,
-the Ephemeral Container could be reattached using the pod's `/attach` endpoint
-with `EphemeralContainerName`.
+It would simplify initial implementation if we updated the pod spec via the
+normal means, and switched to a new update subresource if required at a future
+date. It's easier to begin with a too-restrictive policy than a too-permissive
+one on which users come to rely, and we expect to be able to remove the
+`/ephemeralcontainers` subresource prior to exiting alpha should it prove
+unnecessary.
 
-Ephemeral Containers could not be removed via the API and instead the process
-must terminate. While not ideal, this parallels existing behavior of `kubectl
-exec`. To kill an Ephemeral Container one would `attach` and exit the process
-interactively or create a new Ephemeral Container to send a signal with
-`kill(1)` to the original process.
-
-#### Alternative 2: Ephemeral Container Controller
+### Container Runtime Interface (CRI) changes
 
-Using subresources is an imperative style API where the client instructs the
-kubelet to perform an action, but in general Kubernetes prefers declarative APIs
-where the client declares a state for Kubernetes to enact.
-
-We could implement this in a declarative manner by creating a new
-`EphemeralContainer` type:
-
-```
-type EphemeralContainer struct {
-        metav1.TypeMeta
-        metav1.ObjectMeta
-
-        Spec v1.Container
-        Status v1.ContainerStatus
-}
-```
-
-A new controller in the kubelet would watch for EphemeralContainers and
-create/delete debug containers. `EphemeralContainer.Status` would be updated by
-the kubelet at the same time it updates `ContainerStatus` for regular and init
-containers. Clients would create a new `EphemeralContainer` object, wait for it
-to be started and then attach using the pod's attach subresource and the name of
-the `EphemeralContainer`.
-
-Debugging is inherently imperative, however, and not the a desired state to
-describe. Once a Debug Container is started it should not be automatically
-restarted, for example. A declarative API adds new states for the kubelet to
-enforce, and SIG Node strongly prefers to minimize kubelet complexity.
-
-### Ephemeral Container Status
-
-The kubelet should be able to construct `PodStatus` without relying on prior
-state, so we will store the Ephemeral Container's `Spec` and
-`TargetContainerName` as runtime metadata. The kubelet persists container
-metadata as CRI
-[labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606)
-and
-[annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613).
-The entire `v1.Container` used in the request will be serialized and stored as a
-runtime annotation. The value of `TargetContainerName` will be stored as a
-runtime label. Persisting this data in the runtime means it survives kubelet
-restarts.
-
-At least for the Docker runtime, this is [an intended use of docker
-labels](https://docs.docker.com/engine/userguide/labels-custom-metadata/#value-guidelines).
-Docker does not document the maximum length of labels in its API. Empirically,
-it supports up to the 64K constraint of the docker client's `bufio.Scanner`
-size. We will conservatively limit the size of the spec to 32K and add a 32K
-minimum label length test to runtime qualification.
-
-`EphemeralContainer.Status` is populated by the kubelet in the same way as
-regular container statuses. The kubelet then updates the pod's status in the API
-server using the pod's `/status` endpoint, which imposes no restrictions on
-updates to `ephemeralContainers`.
+The CRI requires no changes for basic functionality, but it will need to be
+updated to support container namespace targeting, as described in the
+[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace).
 
 ### Creating Debug Containers
 
-1.  `kubectl` constructs and `EphemeralContainer` based on command line
-    arguments and appends it to `Pod.Status.EphemeralContainers`. It `PUT`s the
-    modified pod to the pod's `/ephemeralcontainers`.
+To create a debug container, kubectl will take the following steps:
+
+1.  `kubectl` constructs an `EphemeralContainer` based on command line arguments
+    and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod
+    to the pod's `/ephemeralcontainers`.
 1.  The apiserver discards changes other than additions to
-    `Pod.Status.EphemeralContainers` and validates the pod update.
-    1.  Update discards `EphemeralContainer.Status` for new Ephemeral
-        Containers.
+    `Pod.Spec.EphemeralContainers` and validates the pod update.
     1.  Pod validation fails if container spec contains fields disallowed for
         Ephemeral Containers or the same name as a container in the spec or
         `EphemeralContainers`.
@@ -286,8 +228,8 @@ updates to `ephemeralContainers`.
 1.  The kubelet's pod watcher notices the update and triggers a `syncPod()`.
     During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
     for any new Ephemeral Container.
-    1.  `StartEphemeralContainer()` uses the existing `startContainer()` method,
-        which gains support for targeting the namespaces of a container by name.
+    1.  `StartEphemeralContainer()` uses the existing `startContainer()` to
+        start the Ephemeral Container.
     1.  After initial creation, future invocations of `syncPod()` will publish
         its ContainerStatus but otherwise ignore the Ephemeral Container. It
         will exist for the life of the pod sandbox or it exits. In no event will
@@ -312,31 +254,16 @@ terminal. This is supported by Docker.
 
 ### Killing Debug Containers
 
-Debug containers will not be killed automatically unless the pod (specifically,
-the pod sandbox) is destroyed. Debug Containers will stop when their command
-exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
-Containers will not receive an EOF if their connection is interrupted.
-
-### Container Lifecycle Changes
+Debug containers will not be killed automatically unless the pod is destroyed.
+Debug Containers will stop when their command exits, such as exiting a shell.
+Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if
+their connection is interrupted.
 
-Implementing debug requires no changes to the Container Runtime Interface as
-it's the same operation as creating a regular container. The following changes
-are necessary in the kubelet:
-
-1.  `SyncPod()` must not kill any Debug Container even though it is not part of
-    the pod spec.
-1.  As an exception to the above, `SyncPod()` will kill Debug Containers when
-    the pod sandbox changes since a lone Debug Container in an abandoned sandbox
-    is not useful. Debug Containers are not started automatically in the new
-    sandbox.
-1.  `convertStatusToAPIStatus()` must sort Debug Containers status into
-    `EphemeralContainer.Status` similar to as it does for
-    `InitContainerStatuses`
-1.  Debug Containers must be excluded from calculation of pod phase and
-    condition
-
-`KillPod()` already operates on all running containers returned by the runtime
-and requires no changes
+A future improvement to Ephemeral Containers could allow killing Debug
+Containers when they're removed the `EphemeralContainers`, but it's not clear
+that we want to allow this. Removing an Ephemeral Container spec makes it
+unavailable for future authorization decisions (e.g. whether to authorize exec
+in a pod that had a privileged Ephemeral Container).
 
 ### Security Considerations
 
@@ -344,9 +271,8 @@ Debug Containers have no additional privileges above what is available to any
 `v1.Container`. It's the equivalent of configuring an shell container in a pod
 spec except that it is created on demand.
 
-Admission plugins must be updated to guard `/ephemeralcontainers`. In
-particular, they should enforce the same container image policy on the
-`EphemeralContainer.Spec` parameter as is enforced for regular containers.
+Admission plugins must be updated to guard `/ephemeralcontainers`. They should
+apply the same container image and security policy as for regular containers.
 
 ### Additional Consideration
 
@@ -356,70 +282,33 @@ particular, they should enforce the same container image policy on the
     troubleshooting causes a pod to exceed its resource limit it may be evicted.
 1.  There's an output stream race inherent to creating then attaching a
     container which causes output generated between the start and attach to go
-    to the log rather than the client. This is not specific to Debug Containers
-    and exists because Kubernetes has no mechanism to attach a container prior
-    to starting it. This larger issue will not be addressed by Debug Containers,
-    but Debug Containers would benefit from future improvements or work arounds.
-1.  Debug Containers should not be used to build services, which we've attempted
-    to reflect in the API.
-1.  If a pod is configured with isolated PID namespaces, the Debug Container
-    will join the PID namespace of the target container. Debug Containers will
-    not be available with runtimes that do not implement PID namespace sharing.
+    to the log rather than the client. This is not specific to Ephemeral
+    Containers and exists because Kubernetes has no mechanism to attach a
+    container prior to starting it. This larger issue will not be addressed by
+    Ephemeral Containers, but Ephemeral Containers would benefit from future
+    improvements or work arounds.
+1.  Ephemeral Containers should not be used to build services, which we've
+    attempted to reflect in the API.
 
 ## Implementation Plan
 
-### Alpha Release
-
-#### Goals and Non-Goals for Alpha Release
+### 1.12: Initial Alpha Release
 
-We're targeting an alpha release in Kubernetes 1.11 that includes the following
+We're targeting an alpha release in Kubernetes 1.12 that includes the following
 basic functionality:
 
-*   Support in the kubelet for creating debug containers in a running pod
-*   A `kubectl alpha debug` command to initiate a debug container
-*   `kubectl describe pod` will list status of debug containers running in a pod
-
-Functionality will be hidden behind an alpha feature flag and disabled by
-default.
-
-#### Kubernetes API Changes
-
-The following changes must be implemented in the API:
-
-1.  `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as
-    described above.
-1.  The new subresource will be added to the pods API.
-1.  The API server must check for Ephemeral Containers when validating `attach`.
+1.  Approval for basic core API changes to Pod
+1.  Basic support in the kubelet for creating Ephemeral Containers
 
-#### kubelet Implementation
+Functionality out of scope for 1.12:
 
-Debug Containers are implemented in the kubelet's generic runtime manager.
-Performing this operation with a legacy (non-CRI) runtime will result in a not
-implemented error. Implementation in the kubelet will be split into the
-following steps:
+*   Killing running Ephemeral Containers by removing them from the Pod Spec.
+*   Updating `pod.Spec.EphemeralContainers` when containers are garbage
+    collected.
+*   `kubectl` commands for creating Ephemeral Containers
 
-1.  New container metadata `ContainerType`, `ContainerSpec` &
-    `TargetContainerName` is stored using CRI labels and annotations.
-    `kubecontainer.ContainerStatus` will be extended with a `ContainerType`
-    field (possible values: `REGULAR`, `INIT` & `EPHEMERAL`) so a container can
-    be identified as a debug container.
-1.  `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the
-    existing `startContainer()`.
-1.  `syncPod()` will call `StartEphemeralContainer()` to start the Debug
-    Container. The existing `generateAPIPodStatus()` will be updated to also
-    populate `EphemeralContainers.Status`.
-
-#### kubectl changes
-
-In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
-`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
-debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
-`kubectl alpha debug` will be visible by default in `kubectl` 1.11 and return an
-error when used on a cluster with the feature disabled.
-
-`kubectl describe pod` will report the contents of `EphemeralContainers` when
-not empty as it means the feature is enabled. The field will be hidden when
-empty.
+Functionality will be hidden behind an alpha feature flag and disabled by
+default.
 
 ## Appendices
 
@@ -550,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command
 is run.
 
 **Respect admission restrictions.** Requests from kubectl are proxied through
-the apiserver and so are available to existing [admission
-controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins
-already exist to intercept `exec` and `attach` calls, but extending this to
-support `debug` has not yet been scoped.
+the apiserver and so are available to existing
+[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/).
+Plugins already exist to intercept `exec` and `attach` calls, but extending this
+to support `debug` has not yet been scoped.
 
 **Allow introspection of pod state using existing tools**. The list of
 `EphemeralContainerStatuses` is never truncated. If a debug container has run in
@@ -587,26 +476,146 @@ active debug container.
 
 ### Appendix 3: Alternatives Considered
 
-#### Mutable Pod Spec
+#### Container Spec in PodStatus
+
+Originally there was a desire to keep the pod spec immutable, so we explored
+modifying only the pod status. An `EphemeralContainer` would contain a Spec, a
+Status and a Target:
+
+```
+// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
+type EphemeralContainer struct {
+        metav1.TypeMeta `json:",inline"`
+
+        // Spec describes the Ephemeral Container to be created.
+        Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
+
+        // Most recently observed status of the container.
+        // This data may not be up to date.
+        // Populated by the system.
+        // Read-only.
+        // +optional
+        Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
+
+        // If set, the name of the container from PodSpec that this ephemeral container targets.
+        // If not set then the ephemeral container is run in whatever namespaces are shared
+        // for the pod.
+        TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
+}
+```
+
+Ephemeral Containers for a pod would be listed in the pod's status:
+
+```
+type PodStatus struct {
+        ...
+        // List of user-initiated ephemeral containers that have been run in this pod.
+        // +optional
+        EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
+
+}
+```
+
+To create a new Ephemeral Container, one would append a new `EphemeralContainer`
+with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod`
+in the API. Users cannot normally modify the pod status, so we'd create a new
+subresource `/ephemeralcontainers` that allows an update of solely
+`EphemeralContainers` and enforces append-only semantics.
+
+Since we have a requirement to describe the Ephemeral Container with a
+`v1.Container`, this lead to a "spec in status" that seemed to violate API best
+practices. It was confusing, and it required added complexity in the kubelet to
+persist and publish user intent, which is rightfully the job of the apiserver.
+
+#### Extend the Existing Exec API ("exec++")
+
+A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
+"executing" container images. The current `/exec` endpoint must implement `GET`
+to support streaming for all clients. We don't want to encode a (potentially
+large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
+with the specific fields required for creating a Debug Container:
+
+```
+// PodExecOptions is the query options to a Pod's remote exec call
+type PodExecOptions struct {
+        ...
+        // EphemeralContainerName is the name of an ephemeral container in which the
+        // command ought to be run. Either both EphemeralContainerName and
+        // EphemeralContainerImage fields must be set, or neither.
+        EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
+
+        // EphemeralContainerImage is the image of an ephemeral container in which the command
+        // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
+        // fields must be set, or neither.
+        EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
+}
+```
+
+After creating the Ephemeral Container, the kubelet would upgrade the connection
+to streaming and perform an attach to the container's console. If disconnected,
+the Ephemeral Container could be reattached using the pod's `/attach` endpoint
+with `EphemeralContainerName`.
+
+Ephemeral Containers could not be removed via the API and instead the process
+must terminate. While not ideal, this parallels existing behavior of `kubectl
+exec`. To kill an Ephemeral Container one would `attach` and exit the process
+interactively or create a new Ephemeral Container to send a signal with
+`kill(1)` to the original process.
+
+Since the user cannot specify the `v1.Container`, this approach sacrifices a
+great deal of flexibility. This solution still requires the kubelet to publish a
+`Container` spec in the `PodStatus` that can be examined for future admission
+decisions and so retains many of the downsides of the Container Spec in
+PodStatus approach.
+
+#### Ephemeral Container Controller
+
+Kubernetes prefers declarative APIs where the client declares a state for
+Kubernetes to enact. We could implement this in a declarative manner by creating
+a new `EphemeralContainer` type:
+
+```
+type EphemeralContainer struct {
+        metav1.TypeMeta
+        metav1.ObjectMeta
+
+        Spec v1.Container
+        Status v1.ContainerStatus
+}
+```
+
+A new controller in the kubelet would watch for EphemeralContainers and
+create/delete debug containers. `EphemeralContainer.Status` would be updated by
+the kubelet at the same time it updates `ContainerStatus` for regular and init
+containers. Clients would create a new `EphemeralContainer` object, wait for it
+to be started and then attach using the pod's attach subresource and the name of
+the `EphemeralContainer`.
+
+A new controller is a significant amount of complexity to add to the kubelet,
+especially considering that the kubelet is already watching for changes to pods.
+The kubelet would have to be modified to create containers in a pod from
+multiple config sources. SIG Node strongly prefers to minimize kubelet
+complexity.
+
+#### Mutable Pod Spec Containers
 
-Rather than adding an operation to have Kubernetes attach a pod we could instead
-make the pod spec mutable so the client can generate an update adding a
-container. `SyncPod()` has no issues adding the container to the pod at that
-point, but an immutable pod spec has been a basic assumption in Kubernetes thus
-far and changing it carries risk. It's preferable to keep the pod spec immutable
-as a best practice.
+Rather than adding to the pod API, we could instead make the pod spec mutable so
+the client can generate an update adding a container. `SyncPod()` has no issues
+adding the container to the pod at that point, but an immutable pod spec has
+been a basic assumption and best practice in Kubernetes. Changing this
+assumption complicates the requirements of the kubelet state machine. Since the
+kubelet was not written with this in mind, we should expect such a change would
+create bugs we cannot predict.
 
-#### Ephemeral container
+#### Image Exec
 
-An earlier version of this proposal suggested running an ephemeral container in
-the pod namespaces. The container would not be added to the pod spec and would
-exist only as long as the process it ran. This has the advantage of behaving
-similarly to the current kubectl exec, but it is opaque and likely violates
-design assumptions. We could add constructs to track and report on both
-traditional exec process and exec containers, but this would probably be more
-work than adding to the pod spec. Both are generally useful, and neither
-precludes the other in the future, so we chose mutating the pod spec for
-expedience.
+An earlier version of this proposal suggested simply adding `Image` parameter to
+the exec API. This would run an ephemeral container in the pod namespaces
+without adding it to the pod spec or status. This container would exist only as
+long as the process it ran. This parallels the current kubectl exec, including
+its lack of transparency. We could add constructs to track and report on both
+traditional exec process and exec containers. In the end this failed to meet our
+transparency requirements.
 
 #### Attaching Container Type Volume
 
@@ -627,9 +636,8 @@ this simplifies the solution by working within the existing constraints of
 If Kubernetes supported the concept of an "inactive" container, we could
 configure it as part of a pod and activate it at debug time. In order to avoid
 coupling the debug tool versions with those of the running containers, we would
-need to ensure the debug image was pulled at debug time. The container could
-then be run with a TTY and attached using kubectl. We would need to figure out a
-solution that allows access the filesystem of other containers.
+want to ensure the debug image was pulled at debug time. The container could
+then be run with a TTY and attached using kubectl.
 
 The downside of this approach is that it requires prior configuration. In
 addition to requiring prior consideration, it would increase boilerplate config.
@@ -639,14 +647,14 @@ than a feature of the platform.
 #### Implicit Empty Volume
 
 Kubernetes could implicitly create an EmptyDir volume for every pod which would
-then be available as target for either the kubelet or a sidecar to extract a
+then be available as a target for either the kubelet or a sidecar to extract a
 package of binaries.
 
 Users would have to be responsible for hosting a package build and distribution
 infrastructure or rely on a public one. The complexity of this solution makes it
 undesirable.
 
-#### Standalone Pod in Shared Namespace
+#### Standalone Pod in Shared Namespace ("Debug Pod")
 
 Rather than inserting a new container into a pod namespace, Kubernetes could
 instead support creating a new pod with container namespaces shared with
@@ -656,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces
 (network, pid, etc) of the target pod but remain in a separate resource group
 (e.g. cgroup for container-based runtimes).
 
-This would be a rather fundamental change to pod, which is currently treated as
-an atomic unit. The Container Runtime Interface has no provisions for sharing
+This would be a rather large change for pod, which is currently treated as an
+atomic unit. The Container Runtime Interface has no provisions for sharing
 outside of a pod sandbox and would need a refactor. This could be a complicated
 change for non-container runtimes (e.g. hypervisor runtimes) which have more
 rigid boundaries between pods.
 
-Effectively, Debug Pod must be implemented by the runtimes while Debug
-Containers are implemented by the kubelet. Minimizing change to the Kubernetes
-API is not worth the increased complexity for the kubelet and runtimes.
+This is pushing the complexity of the solution from the kubelet to the runtimes.
+Minimizing change to the Kubernetes API is not worth the increased complexity
+for the kubelet and runtimes.
 
 It could also be possible to implement a Debug Pod as a privileged pod that runs
 in the host namespace and interacts with the runtime directly to run a new
 container in the appropriate namespace. This solution would be runtime-specific
-and effectively pushes the complexity of debugging to the user. Additionally,
-requiring node-level access to debug a pod does not meet our requirements.
+and pushes the complexity of debugging to the user. Additionally, requiring
+node-level access to debug a pod does not meet our requirements.
 
 #### Exec from Node
author	Lee Verberne <verb@google.com>	2018-08-15 14:29:55 +0200
committer	Lee Verberne <verb@google.com>	2018-08-21 17:55:42 +0200
commit	502b75727129e0aba965f3955ff3db6526a5593d (patch)
tree	7bca70607cf8dda4d233c1df3726ee1f018e458a
parent	660f409cdd9979782455984f9df2c14b76cf1985 (diff)