From 8136e3da8778e720f6e4b9c6cbcc7535e4928ec9 Mon Sep 17 00:00:00 2001 From: Lee Verberne Date: Mon, 25 Sep 2017 11:08:37 +0200 Subject: Move Pod Troubleshooting proposal into node subdir --- .../node/troubleshoot-running-pods.md | 739 +++++++++++++++++++++ .../design-proposals/troubleshoot-running-pods.md | 739 --------------------- 2 files changed, 739 insertions(+), 739 deletions(-) create mode 100644 contributors/design-proposals/node/troubleshoot-running-pods.md delete mode 100644 contributors/design-proposals/troubleshoot-running-pods.md diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md new file mode 100644 index 00000000..de299d27 --- /dev/null +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -0,0 +1,739 @@ +# Troubleshoot Running Pods + +[bit.ly/k8s-pod-troubleshooting](bit.ly/k8s-pod-troubleshooting) + +This proposal seeks to add first class support for troubleshooting by creating a +mechanism to execute a shell or other troubleshooting tools inside a running pod +without requiring that the associated container images include such tools. + +## Motivation + +### Development + +Many developers of native Kubernetes applications wish to treat Kubernetes as an +execution platform for custom binaries produced by a build system. These users +can forgo the scripted OS install of traditional Dockerfiles and instead `COPY` +the output of their build system into a container image built `FROM scratch` or +a [distroless container +image](https://github.com/GoogleCloudPlatform/distroless). This confers several +advantages: + +1. **Minimal images** lower operational burden and reduce attack vectors. +1. **Immutable images** improve correctness and reliability. +1. **Smaller image size** reduces resource usage and speeds deployments. + +The disadvantage of using containers built `FROM scratch` is the lack of system +binaries provided by an Operating System image makes it difficult to +troubleshoot running containers. Kubernetes should enable one to troubleshoot +pods regardless of the contents of the container images. + +### Operations and Support + +As Kubernetes gains in popularity, it's becoming the case that a person +troubleshooting an application is not necessarily the person who built it. +Operations staff and Support organizations want the ability to attach a "known +good" or automated debugging environment to a pod. + +## Requirements + +A solution to troubleshoot arbitrary container images MUST: + +* troubleshoot arbitrary running containers with minimal prior configuration +* allow access to namespaces and the file systems of individual containers +* fetch troubleshooting utilities at debug time rather than at the time of pod + creation +* be compatible with admission controllers and audit logging +* allow discovery of debugging status +* support arbitrary runtimes via the CRI (possibly with reduced feature set) +* require no administrative access to the node +* have an excellent user experience (i.e. should be a feature of the platform + rather than config-time trickery) +* have no *inherent* side effects to the running container image + +## Feature Summary + +Any new debugging functionality will require training users. We can ease the +transition by building on an existing usage pattern. We will create a new +command, `kubectl debug`, which parallels an existing command, `kubectl exec`. +Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will +be similar but run a *container* in a *pod*. + +A container created by `kubectl debug` is a *Debug Container*. Just like a +process run by `kubectl exec`, a Debug Container is not part of the pod spec and +has no resource stored in the API. Unlike `kubectl exec`, a Debug Container +*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl +describe pod`. + +For example, the following command would attach to a newly created container in +a pod: + +``` +kubectl debug -c debug-shell --image=debian target-pod -- bash +``` + +It would be reasonable for Kubernetes to provide a default container name and +image, making the minimal possible debug command: + +``` +kubectl debug target-pod +``` + +This creates an interactive shell in a pod which can examine and signal other +processes in the pod. It has access to the same network and IPC as processes in +the pod. It can access the filesystem of other processes by `/proc/$PID/root`. +As is already the case with regular containers, Debug Containers can enter +arbitrary namespaces of another container via `nsenter` when run with +`CAP_SYS_ADMIN`. + +*Please see the User Stories section for additional examples and Alternatives +Considered for the considerable list of other solutions we considered.* + +## Implementation Details + +The implementation of `kubectl debug` closely mirrors the implementation of +`kubectl exec`, with most of the complexity implemented in the `kubelet`. How +functionality like this best fits into Kubernetes API has been contentious. In +order to make progress, we will start with the smallest possible API change, +extending `/exec` to support Debug Containers, and iterate. + +From the perspective of the user, there's a new command, `kubectl debug`, that +creates a Debug Container and attaches to its console. We believe a new command +will be less confusing for users than overloading `kubectl exec` with a new +concept. Users give Debug Containers a name (e.g. "debug" or "shell") which can +subsequently be used to reattach and is reported by `kubectl describe`. + +### Kubernetes API Changes + +#### Chosen Solution: "exec++" + +We will extend `v1.Pod`'s `/exec` subresource to support "executing" container +images. The current `/exec` endpoint must implement `GET` to support streaming +for all clients. We don't want to encode a (potentially large) `v1.Container` as +an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific +fields required for creating a Debug Container: + +``` +// PodExecOptions is the query options to a Pod's remote exec call +type PodExecOptions struct { + ... + // DebugName is the name of the Debug Container. Its presence will cause + // exec to create a Debug Container rather than performing a runtime exec. + DebugName string `json:"debugName,omitempty" ...` + // Image is an optional container image name that will be used to for the Debug + // Container in the specified Pod with Command as ENTRYPOINT. If omitted a + // default image will be used. + Image string `json:"image,omitempty" ...` +} +``` + +After creating the Debug Container, the kubelet will upgrade the connection to +streaming and perform an attach to the container's console. If disconnected, the +Debug Container can be reattached using the pod's `/attach` endpoint with +`DebugName`. + +Debug Containers cannot be removed via the API and instead the process must +terminate. While not ideal, this parallels existing behavior of `kubectl exec`. +To kill a Debug Container one would `attach` and exit the process interactively +or create a new Debug Container to send a signal with `kill(1)` to the original +process. + +#### Alternative 1: Debug Subresource + +Rather than extending an existing subresource, we could create a new, +non-streaming `debug` subresource. We would create a new API Object: + +``` +// DebugContainer describes a container to attach to a running pod for troubleshooting. +type DebugContainer struct { + metav1.TypeMeta + metav1.ObjectMeta + + // Name is the name of the Debug Container. Its presence will cause + // exec to create a Debug Container rather than performing a runtime exec. + Name string `json:"name,omitempty" ...` + + // Image is an optional container image name that will be used to for the Debug + // Container in the specified Pod with Command as ENTRYPOINT. If omitted a + // default image will be used. + Image string `json:"image,omitempty" ...` +} +``` + +The pod would gain a new `/debug` subresource that allows the following: + +1. A `POST` of a `PodDebugContainer` to + `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug + Container named `$NAME` running in pod `$POD_NAME`. +1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop + the Debug Container `$NAME` in pod `$POD_NAME`. + +Once created, a client would attach to the console of a debug container using +the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. + +However, this pattern does not resemble any other current usage of the API, so +we prefer to start with exec++ and reevaluate if we discover a compelling +reason. + +#### Alternative 2: Declarative Configuration + +Using subresources is an imperative style API where the client instructs the +kubelet to perform an action, but in general Kubernetes prefers declarative APIs +where the client declares a state for Kubernetes to enact. + +We could implement this in a declarative manner by creating a new +`EphemeralContainer` type: + +``` +type EphemeralContainer struct { + metav1.TypeMeta + metav1.ObjectMeta + + Spec EphemeralContainerSpec + Status v1.ContainerStatus +} +``` + +`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields +relevant to Debug Containers: + +``` +type EphemeralContainerSpec struct { + // Target is the pod in which to run the EphemeralContainer + // Required. + Target v1.ObjectReference + + Name string + Image String + Command []string + Args []string + ImagePullPolicy PullPolicy + SecurityContext *SecurityContext +} +``` + +A new controller in the kubelet would watch for EphemeralContainers and +create/delete debug containers. `EphemeralContainer.Status` would be updated by +the kubelet at the same time it updates `ContainerStatus` for regular and init +containers. Clients would create a new `EphemeralContainer` object, wait for it +to be started and then attach using the pod's attach subresource and the name of +the `EphemeralContainer`. + +Debugging is inherently imperative, however, rather than a state for Kubernetes +to enforce. Once a Debug Container is started it should not be automatically +restarted, for example. This solution imposes additionally complexity and +dependencies on the kubelet, but it's not yet clear if the complexity is +justified. + +### Debug Container Status + +The status of a Debug Container is reported in a new field in `v1.PodStatus`: + +``` +type PodStatus struct { + ... + DebugStatuses []DebugStatus +} + +type DebugStatus struct { + Name string + Command []string + Args []string + // Set only for Debug Containers + DebugContainerStatus v1.ContainerStatus +} +``` + +Initially this will be populated only for Debug Containers, but there's interest +in tracking status for traditional exec in a similar manner. Ideally we can +report both types of user intervention into a container with a single new type. + +Note that `Command` and `Args` must be tracked in the status object because +there is no spec for Debug Containers or exec. These must either be made +available by the runtime or tracked by the kubelet. For Debug Containers this +could be stored as runtime labels, but the kubelet currently has no method of +storing state across restarts for exec. Solving this problem for exec is out of +scope for Debug Containers, but we will look for a solution as we implement this +feature. + +`DebugStatuses` is populated by the kubelet in the same way as regular and init +container statuses. This is sent to the API server and displayed by `kubectl +describe pod`. + +### Creating Debug Containers + +1. `kubectl` invokes the debug API as described in the preceding section. +1. The API server checks for name collisions with existing containers, performs + admission control and proxies the connection to the kubelet's + `/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint. +1. The kubelet instructs the Runtime Manager to create a Debug Container. +1. The runtime manager uses the existing `startContainer()` method to create a + container in an existing pod. `startContainer()` has one modification for + Debug Containers: it creates a new runtime label (e.g. a docker label) that + identifies this container as a Debug Container. +1. After creating the container, the kubelet schedules an asynchronous update + of `PodStatus`. The update publishes the debug container status to the API + server at which point the Debug Container becomes visible via `kubectl + describe pod`. +1. The kubelet will upgrade the connection to streaming and attach to the + container's console. + +Rather than performing the implicit attach the kubelet could return success to +the client and require the client to perform an explicit attach, but the +implicit attach maintains consistent semantics across `/exec` rather than +varying behavior based on parameters. + +The apiserver detects container name collisions with both containers in the pod +spec and other running Debug Containers by checking `DebugStatuses`. In a race +to create two Debug Containers with the same name, the API server will pass both +requests and the kubelet must return an error to all but one request. + +There are no limits on the number of Debug Containers that can be created in a +pod, but exceeding a pod's resource allocation may cause the pod to be evicted. + +### Restarting and Reattaching Debug Containers + +Debug Containers will never be restarted automatically. It is possible to +replace a Debug Container that has exited by re-using a Debug Container name. It +is an error to attempt to replace a Debug Container that is still running, which +is detected by both the API server and the kubelet. + +One can reattach to a Debug Container using `kubectl attach`. When supported by +a runtime, multiple clients can attach to a single debug container and share the +terminal. This is supported by Docker. + +### Killing Debug Containers + +Debug containers will not be killed automatically until the pod (specifically, +the pod sandbox) is destroyed. Debug Containers will stop when their command +exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug +Containers will not receive an EOF if their connection is interrupted. + +### Container Lifecycle Changes + +Implementing debug requires no changes to the Container Runtime Interface as +it's the same operation as creating a regular container. The following changes +are necessary in the kubelet: + +1. `SyncPod()` must not kill any Debug Container even though it is not part of + the pod spec. +1. As an exception to the above, `SyncPod()` will kill Debug Containers when + the pod sandbox changes since a lone Debug Container in an abandoned sandbox + is not useful. Debug Containers are not automatically started in the new + sandbox. +1. `convertStatusToAPIStatus()` must sort Debug Containers status into + `DebugStatuses` similar to as it does for `InitContainerStatuses` +1. The kubelet must preserve `ContainerStatus` on debug containers for + reporting. +1. Debug Containers must be excluded from calculation of pod phase and + condition + +It's worth noting some things that do not change: + +1. `KillPod()` already operates on all running containers returned by the + runtime. +1. Containers created prior to this feature being enabled will have a + `containerType` of `""`. Since this does not match `"DEBUG"` the special + handling of Debug Containers is backwards compatible. + +### Security Considerations + +Debug Containers have no additional privileges above what is available to any +`v1.Container`. It's the equivalent of configuring an shell container in a pod +spec but created on demand. + +Admission plugins that guard `/exec` must be updated for the new parameters. In +particular, they should enforce the same container image policy on the `Image` +parameter as is enforced for regular containers. During the alpha phase we will +additionally support a container image whitelist as a kubelet flag to allow +cluster administrators to easily constraint debug container images. + +### Additional Consideration + +1. Debug Containers are intended for interactive use and always have TTY and + Stdin enabled. +1. There are no guaranteed resources for ad-hoc troubleshooting. If + troubleshooting causes a pod to exceed its resource limit it may be evicted. +1. There's an output stream race inherent to creating then attaching a + container which causes output generated between the start and attach to go + to the log rather than the client. This is not specific to Debug Containers + and exists because Kubernetes has no mechanism to attach a container prior + to starting it. This larger issue will not be addressed by Debug Containers, + but Debug Containers would benefit from future improvements or work arounds. +1. We do not want to describe Debug Containers using `v1.Container`. This is to + reinforce that Debug Containers are not general purpose containers by + limiting their configurability. Debug Containers should not be used to build + services. +1. Debug Containers are of limited usefulness without a shared PID namespace. + If a pod is configured with isolated PID namespaces, the Debug Container + will join the PID namespace of the target container. Debug Containers will + not be available with runtimes that do not implement PID namespace sharing + in some form. + +## Implementation Plan + +### Alpha Release + +#### Goals and Non-Goals for Alpha Release + +We're targeting an alpha release in Kubernetes 1.9 that includes the following +basic functionality: + +* Support in the kubelet for creating debug containers in a running pod +* A `kubectl debug` command to initiate a debug container +* `kubectl describe pod` will list status of debug containers running in a pod + +Functionality will be hidden behind an alpha feature flag and disabled by +default. The following are explicitly out of scope for the 1.9 alpha release: + +* Exited Debug Containers will be garbage collected as regular containers and + may disappear from the list of Debug Container Statuses. +* Security Context for the Debug Container is not configurable. It will always + be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. +* Image pull policy for the Debug Container is not configurable. It will + always be run with `PullAlways`. + +#### kubelet Implementation + +Debug Containers are implemented in the kubelet's generic runtime manager. +Performing this operation with a legacy (non-CRI) runtime will result in a not +implemented error. Implementation in the kubelet will be split into the +following steps: + +##### Step 1: Container Type + +The first step is to add a feature gate to ensure all changes are off by +default. This will be added in the `pkg/features` `DefaultFeatureGate`. + +The runtime manager stores metadata about containers in the runtime via labels +(e.g. docker labels). These labels are used to populate the fields of +`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug +Containers differently in a few situations, we must add a new piece of metadata +to distinguish Debug Containers from regular containers. + +`startContainer()` will be updated to write a new label +`io.kubernetes.container.type` to the runtime. Existing containers will be +started with a type of `REGULAR` or `INIT`. When added in a subsequent step, +Debug Containers will start with with the type `DEBUG`. + +##### Step 2: Creation and Handling of Debug Containers + +This step adds methods for creating debug containers, but doesn't yet modify the +kubelet API. Since the runtime manager discards runtime (e.g. docker) labels +after populating `kubecontainer.ContainerStatus`, the label value will be stored +in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`. + +The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container` +and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented. +Currently only the Generic Runtime Manager (i.e. the CRI) implements the +`DebugContainerRunner` interface. + +The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to +create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug +Containers unless the sandbox is restarted. + +##### Step 3: kubelet API changes + +The kubelet exposes the new functionality in its existing `/exec/` endpoint. +`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls +`RunDebugContainer()`, and performs the attach. + +##### Step 4: Reporting DebugStatus + +The last major change to the kubelet is to populate v1.`PodStatus.DebugStatuses` +based on the `kubecontainer.ContainerStatus` for the Debug Container. + +#### Kubernetes API Changes + +There are two changes to be made to the Kubernetes, which will be made +independently: + +1. `v1.PodExecOptions` must be extended with new fields. +1. `v1.PodStatus` gains a new field to hold Debug Container statuses. + +In all cases, new fields will be prepended with `Alpha` for the duration of this +feature's alpha status. + +#### kubectl changes + +In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a +`kubectl alpha` command to contain alpha features. We will add `kubectl alpha +debug` to invoke Debug Containers. `kubectl` does not use feature gates, so +`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an +error when used on a cluster with the feature disabled. + +`kubectl describe pod` will report the contents of `DebugStatuses` when not +empty as it means the feature is enabled. The field will be hidden when empty. + +## Appendices + +We've researched many options over the life of this proposal. These Appendices +are included as optional reference material. It's not necessary to read this +material in order to understand the proposal in its current form. + +### Appendix 1: User Stories + +These user stories are intended to give examples how this proposal addresses the +above requirements. + +#### Operations + +Jonas runs a service "neato" that consists of a statically compiled Go binary +running in a minimal container image. One of the its pods is suddenly having +trouble connecting to an internal service. Being in operations, Jonas wants to +be able to inspect the running pod without restarting it, but he doesn't +necessarily need to enter the container itself. He wants to: + +1. Inspect the filesystem of target container +1. Execute debugging utilities not included in the container image +1. Initiate network requests from the pod network namespace + +This is achieved by running a new "debug" container in the pod namespaces. His +troubleshooting session might resemble: + +``` +% kubectl debug -it -m debian neato-5thn0 -- bash +root@debug-image:~# ps x + PID TTY STAT TIME COMMAND + 1 ? Ss 0:00 /pause + 13 ? Ss 0:00 bash + 26 ? Ss+ 0:00 /neato + 107 ? R+ 0:00 ps x +root@debug-image:~# cat /proc/26/root/etc/resolv.conf +search default.svc.cluster.local svc.cluster.local cluster.local +nameserver 10.155.240.10 +options ndots:5 +root@debug-image:~# dig @10.155.240.10 neato.svc.cluster.local. + +; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> @10.155.240.10 neato.svc.cluster.local. +; (1 server found) +;; global options: +cmd +;; connection timed out; no servers could be reached +``` + +Thus Jonas discovers that the cluster's DNS service isn't responding. + +#### Debugging + +Thurston is debugging a tricky issue that's difficult to reproduce. He can't +reproduce the issue with the debug build, so he attaches a debug container to +one of the pods exhibiting the problem: + +``` +% kubectl debug -it --image=gcr.io/neato/debugger neato-5x9k3 -- sh +Defaulting container name to debug. +/ # ps x +PID USER TIME COMMAND + 1 root 0:00 /pause + 13 root 0:00 /neato + 26 root 0:00 sh + 32 root 0:00 ps x +/ # gdb -p 13 +... +``` + +He discovers that he needs access to the actual container, which he can achieve +by installing busybox into the target container: + +``` +root@debug-image:~# cp /bin/busybox /proc/13/root +root@debug-image:~# nsenter -t 13 -m -u -p -n -r /busybox sh + + +BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash) +Enter 'help' for a list of built-in commands. + +/ # ls -l /neato +-rwxr-xr-x 2 0 0 746888 May 4 2016 /neato +``` + +Note that running the commands referenced above require `CAP_SYS_ADMIN` and +`CAP_SYS_PTRACE`. + +#### Automation + +Ginger is a security engineer tasked with running security audits across all of +her company's running containers. Even though his company has no standard base +image, she's able to audit all containers using: + +``` +% for pod in $(kubectl get -o name pod); do + kubectl debug -m gcr.io/neato/security-audit -p $pod /security-audit.sh + done +``` + +#### Technical Support + +Roy's team provides support for his company's multi-tenant cluster. He can +access the Kubernetes API (as a viewer) on behalf of the users he's supporting, +but he does not have administrative access to nodes or a say in how the +application image is constructed. When someone asks for help, Roy's first step +is to run his team's autodiagnose script: + +``` +% kubectl debug --image=gcr.io/google_containers/autodiagnose nginx-pod-1234 +``` + +### Appendix 2: Requirements Analysis + +Many people have proposed alternate solutions to this problem. This section +discusses how the proposed solution meets all of the stated requirements and is +intended to contrast the alternatives listed below. + +**Troubleshoot arbitrary running containers with minimal prior configuration.** +This solution requires no prior configuration. + +**Access to namespaces and the file systems of individual containers.** This +solution runs a container in the shared pod namespaces (e.g. network) and will +attach to the PID namespace of a target container when not shared with the +entire pod. It relies on the behavior of `/proc//root` to provide access to +filesystems of individual containers. + +**Fetch troubleshooting utilities at debug time**. This solution uses normal +container image distribution mechanisms to fetch images when the debug command +is run. + +**Respect admission restrictions.** Requests from kubectl are proxied through +the apiserver and so are available to existing [admission +controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins +already exist to intercept `exec` and `attach` calls, but extending this to +support `debug` has not yet been scoped. + +**Allow introspection of pod state using existing tools**. The list of +`DebugContainerStatuses` is never truncated. If a debug container has run in +this pod it will appear here. + +**Support arbitrary runtimes via the CRI**. This proposal is implemented +entirely in the kubelet runtime manager and requires no changes in the +individual runtimes. + +**Have an excellent user experience**. This solution is conceptually +straightforward and surfaced in a single `kubectl` command that "runs a thing in +a pod". Debug tools are distributed by container image, which is already well +understood by users. There is no automatic copying of files or hidden paths. + +By using container images, users are empowered to create custom debug images. +Available images can be restricted by admission policy. Some examples of +possible debug images: + +* A script that automatically gathers a debugging snapshot and uploads it to a + cloud storage bucket before killing the pod. +* An image with a shell modified to log every statement to an audit API. + +**Require no direct access to the node.** This solution uses the standard +streaming API. + +**Have no inherent side effects to the running container image.** The target pod +is not modified by default, but resources used by the debug container will be +billed to the pod's cgroup, which means it could be evicted. A future +improvement could be to decrease the likelihood of eviction when there's an +active debug container. + +### Appendix 3: Alternatives Considered + +#### Mutable Pod Spec + +Rather than adding an operation to have Kubernetes attach a pod we could instead +make the pod spec mutable so the client can generate an update adding a +container. `SyncPod()` has no issues adding the container to the pod at that +point, but an immutable pod spec has been a basic assumption in Kubernetes thus +far and changing it carries risk. It's preferable to keep the pod spec immutable +as a best practice. + +#### Ephemeral container + +An earlier version of this proposal suggested running an ephemeral container in +the pod namespaces. The container would not be added to the pod spec and would +exist only as long as the process it ran. This has the advantage of behaving +similarly to the current kubectl exec, but it is opaque and likely violates +design assumptions. We could add constructs to track and report on both +traditional exec process and exec containers, but this would probably be more +work than adding to the pod spec. Both are generally useful, and neither +precludes the other in the future, so we chose mutating the pod spec for +expedience. + +#### Attaching Container Type Volume + +Combining container volumes ([#831](https://issues.k8s.io/831)) with the ability +to add volumes to the pod spec would get us most of the way there. One could +mount a volume of debug utilities at debug time. Docker does not allow adding a +volume to a running container, however, so this would require a container +restart. A restart doesn't meet our requirements for troubleshooting. + +Rather than attaching the container at debug time, kubernetes could always +attach a volume at a random path at run time, just in case it's needed. Though +this simplifies the solution by working within the existing constraints of +`kubectl exec`, it has a sufficient list of minor limitations (detailed in +[#10834](https://issues.k8s.io/10834)) to result in a poor user experience. + +#### Inactive container + +If Kubernetes supported the concept of an "inactive" container, we could +configure it as part of a pod and activate it at debug time. In order to avoid +coupling the debug tool versions with those of the running containers, we would +need to ensure the debug image was pulled at debug time. The container could +then be run with a TTY and attached using kubectl. We would need to figure out a +solution that allows access the filesystem of other containers. + +The downside of this approach is that it requires prior configuration. In +addition to requiring prior consideration, it would increase boilerplate config. +A requirement for prior configuration makes it feel like a workaround rather +than a feature of the platform. + +#### Implicit Empty Volume + +Kubernetes could implicitly create an EmptyDir volume for every pod which would +then be available as target for either the kubelet or a sidecar to extract a +package of binaries. + +Users would have to be responsible for hosting a package build and distribution +infrastructure or rely on a public one. The complexity of this solution makes it +undesirable. + +#### Standalone Pod in Shared Namespace + +Rather than inserting a new container into a pod namespace, Kubernetes could +instead support creating a new pod with container namespaces shared with +another, target pod. This would be a simpler change to the Kubernetes API, which +would only need a new field in the pod spec to specify the target pod. To be +useful, the containers in this "Debug Pod" should be run inside the namespaces +(network, pid, etc) of the target pod but remain in a separate resource group +(e.g. cgroup for container-based runtimes). + +This would be a rather fundamental change to pod, which is currently treated as +an atomic unit. The Container Runtime Interface has no provisions for sharing +outside of a pod sandbox and would need a refactor. This could be a complicated +change for non-container runtimes (e.g. hypervisor runtimes) which have more +rigid boundaries between pods. + +Effectively, Debug Pod must be implemented by the runtimes while Debug +Containers are implemented by the kubelet. Minimizing change to the Kubernetes +API is not worth the increased complexity for the kubelet and runtimes. + +It could also be possible to implement a Debug Pod as a privileged pod that runs +in the host namespace and interacts with the runtime directly to run a new +container in the appropriate namespace. This solution would be runtime-specific +and effectively pushes the complexity of debugging to the user. Additionally, +requiring node-level access to debug a pod does not meet our requirements. + +#### Exec from Node + +The kubelet could support executing a troubleshooting binary from the node in +the namespaces of the container. Once executed this binary would lose access to +other binaries from the node, making it of limited utility and a confusing user +experience. + +This couples the debug tools with the lifecycle of the node, which is worse than +coupling it with container images. + +## Reference + +* [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140) +* [CRI Tracking Issue](https://issues.k8s.io/28789) +* [CRI: expose optional runtime features](https://issues.k8s.io/32803) +* [Resource QoS in + Kubernetes](https://github.com/kubernetes/kubernetes/blob/master/docs/design/resource-qos.md) +* Related Features + * [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across + containers in a pod + * [#26751](https://issues.k8s.io/26751) - Pod-Level cgroup + * [#10782](https://issues.k8s.io/10782) - Vertical pod autoscaling diff --git a/contributors/design-proposals/troubleshoot-running-pods.md b/contributors/design-proposals/troubleshoot-running-pods.md deleted file mode 100644 index de299d27..00000000 --- a/contributors/design-proposals/troubleshoot-running-pods.md +++ /dev/null @@ -1,739 +0,0 @@ -# Troubleshoot Running Pods - -[bit.ly/k8s-pod-troubleshooting](bit.ly/k8s-pod-troubleshooting) - -This proposal seeks to add first class support for troubleshooting by creating a -mechanism to execute a shell or other troubleshooting tools inside a running pod -without requiring that the associated container images include such tools. - -## Motivation - -### Development - -Many developers of native Kubernetes applications wish to treat Kubernetes as an -execution platform for custom binaries produced by a build system. These users -can forgo the scripted OS install of traditional Dockerfiles and instead `COPY` -the output of their build system into a container image built `FROM scratch` or -a [distroless container -image](https://github.com/GoogleCloudPlatform/distroless). This confers several -advantages: - -1. **Minimal images** lower operational burden and reduce attack vectors. -1. **Immutable images** improve correctness and reliability. -1. **Smaller image size** reduces resource usage and speeds deployments. - -The disadvantage of using containers built `FROM scratch` is the lack of system -binaries provided by an Operating System image makes it difficult to -troubleshoot running containers. Kubernetes should enable one to troubleshoot -pods regardless of the contents of the container images. - -### Operations and Support - -As Kubernetes gains in popularity, it's becoming the case that a person -troubleshooting an application is not necessarily the person who built it. -Operations staff and Support organizations want the ability to attach a "known -good" or automated debugging environment to a pod. - -## Requirements - -A solution to troubleshoot arbitrary container images MUST: - -* troubleshoot arbitrary running containers with minimal prior configuration -* allow access to namespaces and the file systems of individual containers -* fetch troubleshooting utilities at debug time rather than at the time of pod - creation -* be compatible with admission controllers and audit logging -* allow discovery of debugging status -* support arbitrary runtimes via the CRI (possibly with reduced feature set) -* require no administrative access to the node -* have an excellent user experience (i.e. should be a feature of the platform - rather than config-time trickery) -* have no *inherent* side effects to the running container image - -## Feature Summary - -Any new debugging functionality will require training users. We can ease the -transition by building on an existing usage pattern. We will create a new -command, `kubectl debug`, which parallels an existing command, `kubectl exec`. -Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will -be similar but run a *container* in a *pod*. - -A container created by `kubectl debug` is a *Debug Container*. Just like a -process run by `kubectl exec`, a Debug Container is not part of the pod spec and -has no resource stored in the API. Unlike `kubectl exec`, a Debug Container -*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl -describe pod`. - -For example, the following command would attach to a newly created container in -a pod: - -``` -kubectl debug -c debug-shell --image=debian target-pod -- bash -``` - -It would be reasonable for Kubernetes to provide a default container name and -image, making the minimal possible debug command: - -``` -kubectl debug target-pod -``` - -This creates an interactive shell in a pod which can examine and signal other -processes in the pod. It has access to the same network and IPC as processes in -the pod. It can access the filesystem of other processes by `/proc/$PID/root`. -As is already the case with regular containers, Debug Containers can enter -arbitrary namespaces of another container via `nsenter` when run with -`CAP_SYS_ADMIN`. - -*Please see the User Stories section for additional examples and Alternatives -Considered for the considerable list of other solutions we considered.* - -## Implementation Details - -The implementation of `kubectl debug` closely mirrors the implementation of -`kubectl exec`, with most of the complexity implemented in the `kubelet`. How -functionality like this best fits into Kubernetes API has been contentious. In -order to make progress, we will start with the smallest possible API change, -extending `/exec` to support Debug Containers, and iterate. - -From the perspective of the user, there's a new command, `kubectl debug`, that -creates a Debug Container and attaches to its console. We believe a new command -will be less confusing for users than overloading `kubectl exec` with a new -concept. Users give Debug Containers a name (e.g. "debug" or "shell") which can -subsequently be used to reattach and is reported by `kubectl describe`. - -### Kubernetes API Changes - -#### Chosen Solution: "exec++" - -We will extend `v1.Pod`'s `/exec` subresource to support "executing" container -images. The current `/exec` endpoint must implement `GET` to support streaming -for all clients. We don't want to encode a (potentially large) `v1.Container` as -an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific -fields required for creating a Debug Container: - -``` -// PodExecOptions is the query options to a Pod's remote exec call -type PodExecOptions struct { - ... - // DebugName is the name of the Debug Container. Its presence will cause - // exec to create a Debug Container rather than performing a runtime exec. - DebugName string `json:"debugName,omitempty" ...` - // Image is an optional container image name that will be used to for the Debug - // Container in the specified Pod with Command as ENTRYPOINT. If omitted a - // default image will be used. - Image string `json:"image,omitempty" ...` -} -``` - -After creating the Debug Container, the kubelet will upgrade the connection to -streaming and perform an attach to the container's console. If disconnected, the -Debug Container can be reattached using the pod's `/attach` endpoint with -`DebugName`. - -Debug Containers cannot be removed via the API and instead the process must -terminate. While not ideal, this parallels existing behavior of `kubectl exec`. -To kill a Debug Container one would `attach` and exit the process interactively -or create a new Debug Container to send a signal with `kill(1)` to the original -process. - -#### Alternative 1: Debug Subresource - -Rather than extending an existing subresource, we could create a new, -non-streaming `debug` subresource. We would create a new API Object: - -``` -// DebugContainer describes a container to attach to a running pod for troubleshooting. -type DebugContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - // Name is the name of the Debug Container. Its presence will cause - // exec to create a Debug Container rather than performing a runtime exec. - Name string `json:"name,omitempty" ...` - - // Image is an optional container image name that will be used to for the Debug - // Container in the specified Pod with Command as ENTRYPOINT. If omitted a - // default image will be used. - Image string `json:"image,omitempty" ...` -} -``` - -The pod would gain a new `/debug` subresource that allows the following: - -1. A `POST` of a `PodDebugContainer` to - `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug - Container named `$NAME` running in pod `$POD_NAME`. -1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop - the Debug Container `$NAME` in pod `$POD_NAME`. - -Once created, a client would attach to the console of a debug container using -the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. - -However, this pattern does not resemble any other current usage of the API, so -we prefer to start with exec++ and reevaluate if we discover a compelling -reason. - -#### Alternative 2: Declarative Configuration - -Using subresources is an imperative style API where the client instructs the -kubelet to perform an action, but in general Kubernetes prefers declarative APIs -where the client declares a state for Kubernetes to enact. - -We could implement this in a declarative manner by creating a new -`EphemeralContainer` type: - -``` -type EphemeralContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - Spec EphemeralContainerSpec - Status v1.ContainerStatus -} -``` - -`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields -relevant to Debug Containers: - -``` -type EphemeralContainerSpec struct { - // Target is the pod in which to run the EphemeralContainer - // Required. - Target v1.ObjectReference - - Name string - Image String - Command []string - Args []string - ImagePullPolicy PullPolicy - SecurityContext *SecurityContext -} -``` - -A new controller in the kubelet would watch for EphemeralContainers and -create/delete debug containers. `EphemeralContainer.Status` would be updated by -the kubelet at the same time it updates `ContainerStatus` for regular and init -containers. Clients would create a new `EphemeralContainer` object, wait for it -to be started and then attach using the pod's attach subresource and the name of -the `EphemeralContainer`. - -Debugging is inherently imperative, however, rather than a state for Kubernetes -to enforce. Once a Debug Container is started it should not be automatically -restarted, for example. This solution imposes additionally complexity and -dependencies on the kubelet, but it's not yet clear if the complexity is -justified. - -### Debug Container Status - -The status of a Debug Container is reported in a new field in `v1.PodStatus`: - -``` -type PodStatus struct { - ... - DebugStatuses []DebugStatus -} - -type DebugStatus struct { - Name string - Command []string - Args []string - // Set only for Debug Containers - DebugContainerStatus v1.ContainerStatus -} -``` - -Initially this will be populated only for Debug Containers, but there's interest -in tracking status for traditional exec in a similar manner. Ideally we can -report both types of user intervention into a container with a single new type. - -Note that `Command` and `Args` must be tracked in the status object because -there is no spec for Debug Containers or exec. These must either be made -available by the runtime or tracked by the kubelet. For Debug Containers this -could be stored as runtime labels, but the kubelet currently has no method of -storing state across restarts for exec. Solving this problem for exec is out of -scope for Debug Containers, but we will look for a solution as we implement this -feature. - -`DebugStatuses` is populated by the kubelet in the same way as regular and init -container statuses. This is sent to the API server and displayed by `kubectl -describe pod`. - -### Creating Debug Containers - -1. `kubectl` invokes the debug API as described in the preceding section. -1. The API server checks for name collisions with existing containers, performs - admission control and proxies the connection to the kubelet's - `/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint. -1. The kubelet instructs the Runtime Manager to create a Debug Container. -1. The runtime manager uses the existing `startContainer()` method to create a - container in an existing pod. `startContainer()` has one modification for - Debug Containers: it creates a new runtime label (e.g. a docker label) that - identifies this container as a Debug Container. -1. After creating the container, the kubelet schedules an asynchronous update - of `PodStatus`. The update publishes the debug container status to the API - server at which point the Debug Container becomes visible via `kubectl - describe pod`. -1. The kubelet will upgrade the connection to streaming and attach to the - container's console. - -Rather than performing the implicit attach the kubelet could return success to -the client and require the client to perform an explicit attach, but the -implicit attach maintains consistent semantics across `/exec` rather than -varying behavior based on parameters. - -The apiserver detects container name collisions with both containers in the pod -spec and other running Debug Containers by checking `DebugStatuses`. In a race -to create two Debug Containers with the same name, the API server will pass both -requests and the kubelet must return an error to all but one request. - -There are no limits on the number of Debug Containers that can be created in a -pod, but exceeding a pod's resource allocation may cause the pod to be evicted. - -### Restarting and Reattaching Debug Containers - -Debug Containers will never be restarted automatically. It is possible to -replace a Debug Container that has exited by re-using a Debug Container name. It -is an error to attempt to replace a Debug Container that is still running, which -is detected by both the API server and the kubelet. - -One can reattach to a Debug Container using `kubectl attach`. When supported by -a runtime, multiple clients can attach to a single debug container and share the -terminal. This is supported by Docker. - -### Killing Debug Containers - -Debug containers will not be killed automatically until the pod (specifically, -the pod sandbox) is destroyed. Debug Containers will stop when their command -exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug -Containers will not receive an EOF if their connection is interrupted. - -### Container Lifecycle Changes - -Implementing debug requires no changes to the Container Runtime Interface as -it's the same operation as creating a regular container. The following changes -are necessary in the kubelet: - -1. `SyncPod()` must not kill any Debug Container even though it is not part of - the pod spec. -1. As an exception to the above, `SyncPod()` will kill Debug Containers when - the pod sandbox changes since a lone Debug Container in an abandoned sandbox - is not useful. Debug Containers are not automatically started in the new - sandbox. -1. `convertStatusToAPIStatus()` must sort Debug Containers status into - `DebugStatuses` similar to as it does for `InitContainerStatuses` -1. The kubelet must preserve `ContainerStatus` on debug containers for - reporting. -1. Debug Containers must be excluded from calculation of pod phase and - condition - -It's worth noting some things that do not change: - -1. `KillPod()` already operates on all running containers returned by the - runtime. -1. Containers created prior to this feature being enabled will have a - `containerType` of `""`. Since this does not match `"DEBUG"` the special - handling of Debug Containers is backwards compatible. - -### Security Considerations - -Debug Containers have no additional privileges above what is available to any -`v1.Container`. It's the equivalent of configuring an shell container in a pod -spec but created on demand. - -Admission plugins that guard `/exec` must be updated for the new parameters. In -particular, they should enforce the same container image policy on the `Image` -parameter as is enforced for regular containers. During the alpha phase we will -additionally support a container image whitelist as a kubelet flag to allow -cluster administrators to easily constraint debug container images. - -### Additional Consideration - -1. Debug Containers are intended for interactive use and always have TTY and - Stdin enabled. -1. There are no guaranteed resources for ad-hoc troubleshooting. If - troubleshooting causes a pod to exceed its resource limit it may be evicted. -1. There's an output stream race inherent to creating then attaching a - container which causes output generated between the start and attach to go - to the log rather than the client. This is not specific to Debug Containers - and exists because Kubernetes has no mechanism to attach a container prior - to starting it. This larger issue will not be addressed by Debug Containers, - but Debug Containers would benefit from future improvements or work arounds. -1. We do not want to describe Debug Containers using `v1.Container`. This is to - reinforce that Debug Containers are not general purpose containers by - limiting their configurability. Debug Containers should not be used to build - services. -1. Debug Containers are of limited usefulness without a shared PID namespace. - If a pod is configured with isolated PID namespaces, the Debug Container - will join the PID namespace of the target container. Debug Containers will - not be available with runtimes that do not implement PID namespace sharing - in some form. - -## Implementation Plan - -### Alpha Release - -#### Goals and Non-Goals for Alpha Release - -We're targeting an alpha release in Kubernetes 1.9 that includes the following -basic functionality: - -* Support in the kubelet for creating debug containers in a running pod -* A `kubectl debug` command to initiate a debug container -* `kubectl describe pod` will list status of debug containers running in a pod - -Functionality will be hidden behind an alpha feature flag and disabled by -default. The following are explicitly out of scope for the 1.9 alpha release: - -* Exited Debug Containers will be garbage collected as regular containers and - may disappear from the list of Debug Container Statuses. -* Security Context for the Debug Container is not configurable. It will always - be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. -* Image pull policy for the Debug Container is not configurable. It will - always be run with `PullAlways`. - -#### kubelet Implementation - -Debug Containers are implemented in the kubelet's generic runtime manager. -Performing this operation with a legacy (non-CRI) runtime will result in a not -implemented error. Implementation in the kubelet will be split into the -following steps: - -##### Step 1: Container Type - -The first step is to add a feature gate to ensure all changes are off by -default. This will be added in the `pkg/features` `DefaultFeatureGate`. - -The runtime manager stores metadata about containers in the runtime via labels -(e.g. docker labels). These labels are used to populate the fields of -`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug -Containers differently in a few situations, we must add a new piece of metadata -to distinguish Debug Containers from regular containers. - -`startContainer()` will be updated to write a new label -`io.kubernetes.container.type` to the runtime. Existing containers will be -started with a type of `REGULAR` or `INIT`. When added in a subsequent step, -Debug Containers will start with with the type `DEBUG`. - -##### Step 2: Creation and Handling of Debug Containers - -This step adds methods for creating debug containers, but doesn't yet modify the -kubelet API. Since the runtime manager discards runtime (e.g. docker) labels -after populating `kubecontainer.ContainerStatus`, the label value will be stored -in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`. - -The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container` -and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented. -Currently only the Generic Runtime Manager (i.e. the CRI) implements the -`DebugContainerRunner` interface. - -The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to -create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug -Containers unless the sandbox is restarted. - -##### Step 3: kubelet API changes - -The kubelet exposes the new functionality in its existing `/exec/` endpoint. -`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls -`RunDebugContainer()`, and performs the attach. - -##### Step 4: Reporting DebugStatus - -The last major change to the kubelet is to populate v1.`PodStatus.DebugStatuses` -based on the `kubecontainer.ContainerStatus` for the Debug Container. - -#### Kubernetes API Changes - -There are two changes to be made to the Kubernetes, which will be made -independently: - -1. `v1.PodExecOptions` must be extended with new fields. -1. `v1.PodStatus` gains a new field to hold Debug Container statuses. - -In all cases, new fields will be prepended with `Alpha` for the duration of this -feature's alpha status. - -#### kubectl changes - -In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a -`kubectl alpha` command to contain alpha features. We will add `kubectl alpha -debug` to invoke Debug Containers. `kubectl` does not use feature gates, so -`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an -error when used on a cluster with the feature disabled. - -`kubectl describe pod` will report the contents of `DebugStatuses` when not -empty as it means the feature is enabled. The field will be hidden when empty. - -## Appendices - -We've researched many options over the life of this proposal. These Appendices -are included as optional reference material. It's not necessary to read this -material in order to understand the proposal in its current form. - -### Appendix 1: User Stories - -These user stories are intended to give examples how this proposal addresses the -above requirements. - -#### Operations - -Jonas runs a service "neato" that consists of a statically compiled Go binary -running in a minimal container image. One of the its pods is suddenly having -trouble connecting to an internal service. Being in operations, Jonas wants to -be able to inspect the running pod without restarting it, but he doesn't -necessarily need to enter the container itself. He wants to: - -1. Inspect the filesystem of target container -1. Execute debugging utilities not included in the container image -1. Initiate network requests from the pod network namespace - -This is achieved by running a new "debug" container in the pod namespaces. His -troubleshooting session might resemble: - -``` -% kubectl debug -it -m debian neato-5thn0 -- bash -root@debug-image:~# ps x - PID TTY STAT TIME COMMAND - 1 ? Ss 0:00 /pause - 13 ? Ss 0:00 bash - 26 ? Ss+ 0:00 /neato - 107 ? R+ 0:00 ps x -root@debug-image:~# cat /proc/26/root/etc/resolv.conf -search default.svc.cluster.local svc.cluster.local cluster.local -nameserver 10.155.240.10 -options ndots:5 -root@debug-image:~# dig @10.155.240.10 neato.svc.cluster.local. - -; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> @10.155.240.10 neato.svc.cluster.local. -; (1 server found) -;; global options: +cmd -;; connection timed out; no servers could be reached -``` - -Thus Jonas discovers that the cluster's DNS service isn't responding. - -#### Debugging - -Thurston is debugging a tricky issue that's difficult to reproduce. He can't -reproduce the issue with the debug build, so he attaches a debug container to -one of the pods exhibiting the problem: - -``` -% kubectl debug -it --image=gcr.io/neato/debugger neato-5x9k3 -- sh -Defaulting container name to debug. -/ # ps x -PID USER TIME COMMAND - 1 root 0:00 /pause - 13 root 0:00 /neato - 26 root 0:00 sh - 32 root 0:00 ps x -/ # gdb -p 13 -... -``` - -He discovers that he needs access to the actual container, which he can achieve -by installing busybox into the target container: - -``` -root@debug-image:~# cp /bin/busybox /proc/13/root -root@debug-image:~# nsenter -t 13 -m -u -p -n -r /busybox sh - - -BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash) -Enter 'help' for a list of built-in commands. - -/ # ls -l /neato --rwxr-xr-x 2 0 0 746888 May 4 2016 /neato -``` - -Note that running the commands referenced above require `CAP_SYS_ADMIN` and -`CAP_SYS_PTRACE`. - -#### Automation - -Ginger is a security engineer tasked with running security audits across all of -her company's running containers. Even though his company has no standard base -image, she's able to audit all containers using: - -``` -% for pod in $(kubectl get -o name pod); do - kubectl debug -m gcr.io/neato/security-audit -p $pod /security-audit.sh - done -``` - -#### Technical Support - -Roy's team provides support for his company's multi-tenant cluster. He can -access the Kubernetes API (as a viewer) on behalf of the users he's supporting, -but he does not have administrative access to nodes or a say in how the -application image is constructed. When someone asks for help, Roy's first step -is to run his team's autodiagnose script: - -``` -% kubectl debug --image=gcr.io/google_containers/autodiagnose nginx-pod-1234 -``` - -### Appendix 2: Requirements Analysis - -Many people have proposed alternate solutions to this problem. This section -discusses how the proposed solution meets all of the stated requirements and is -intended to contrast the alternatives listed below. - -**Troubleshoot arbitrary running containers with minimal prior configuration.** -This solution requires no prior configuration. - -**Access to namespaces and the file systems of individual containers.** This -solution runs a container in the shared pod namespaces (e.g. network) and will -attach to the PID namespace of a target container when not shared with the -entire pod. It relies on the behavior of `/proc//root` to provide access to -filesystems of individual containers. - -**Fetch troubleshooting utilities at debug time**. This solution uses normal -container image distribution mechanisms to fetch images when the debug command -is run. - -**Respect admission restrictions.** Requests from kubectl are proxied through -the apiserver and so are available to existing [admission -controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins -already exist to intercept `exec` and `attach` calls, but extending this to -support `debug` has not yet been scoped. - -**Allow introspection of pod state using existing tools**. The list of -`DebugContainerStatuses` is never truncated. If a debug container has run in -this pod it will appear here. - -**Support arbitrary runtimes via the CRI**. This proposal is implemented -entirely in the kubelet runtime manager and requires no changes in the -individual runtimes. - -**Have an excellent user experience**. This solution is conceptually -straightforward and surfaced in a single `kubectl` command that "runs a thing in -a pod". Debug tools are distributed by container image, which is already well -understood by users. There is no automatic copying of files or hidden paths. - -By using container images, users are empowered to create custom debug images. -Available images can be restricted by admission policy. Some examples of -possible debug images: - -* A script that automatically gathers a debugging snapshot and uploads it to a - cloud storage bucket before killing the pod. -* An image with a shell modified to log every statement to an audit API. - -**Require no direct access to the node.** This solution uses the standard -streaming API. - -**Have no inherent side effects to the running container image.** The target pod -is not modified by default, but resources used by the debug container will be -billed to the pod's cgroup, which means it could be evicted. A future -improvement could be to decrease the likelihood of eviction when there's an -active debug container. - -### Appendix 3: Alternatives Considered - -#### Mutable Pod Spec - -Rather than adding an operation to have Kubernetes attach a pod we could instead -make the pod spec mutable so the client can generate an update adding a -container. `SyncPod()` has no issues adding the container to the pod at that -point, but an immutable pod spec has been a basic assumption in Kubernetes thus -far and changing it carries risk. It's preferable to keep the pod spec immutable -as a best practice. - -#### Ephemeral container - -An earlier version of this proposal suggested running an ephemeral container in -the pod namespaces. The container would not be added to the pod spec and would -exist only as long as the process it ran. This has the advantage of behaving -similarly to the current kubectl exec, but it is opaque and likely violates -design assumptions. We could add constructs to track and report on both -traditional exec process and exec containers, but this would probably be more -work than adding to the pod spec. Both are generally useful, and neither -precludes the other in the future, so we chose mutating the pod spec for -expedience. - -#### Attaching Container Type Volume - -Combining container volumes ([#831](https://issues.k8s.io/831)) with the ability -to add volumes to the pod spec would get us most of the way there. One could -mount a volume of debug utilities at debug time. Docker does not allow adding a -volume to a running container, however, so this would require a container -restart. A restart doesn't meet our requirements for troubleshooting. - -Rather than attaching the container at debug time, kubernetes could always -attach a volume at a random path at run time, just in case it's needed. Though -this simplifies the solution by working within the existing constraints of -`kubectl exec`, it has a sufficient list of minor limitations (detailed in -[#10834](https://issues.k8s.io/10834)) to result in a poor user experience. - -#### Inactive container - -If Kubernetes supported the concept of an "inactive" container, we could -configure it as part of a pod and activate it at debug time. In order to avoid -coupling the debug tool versions with those of the running containers, we would -need to ensure the debug image was pulled at debug time. The container could -then be run with a TTY and attached using kubectl. We would need to figure out a -solution that allows access the filesystem of other containers. - -The downside of this approach is that it requires prior configuration. In -addition to requiring prior consideration, it would increase boilerplate config. -A requirement for prior configuration makes it feel like a workaround rather -than a feature of the platform. - -#### Implicit Empty Volume - -Kubernetes could implicitly create an EmptyDir volume for every pod which would -then be available as target for either the kubelet or a sidecar to extract a -package of binaries. - -Users would have to be responsible for hosting a package build and distribution -infrastructure or rely on a public one. The complexity of this solution makes it -undesirable. - -#### Standalone Pod in Shared Namespace - -Rather than inserting a new container into a pod namespace, Kubernetes could -instead support creating a new pod with container namespaces shared with -another, target pod. This would be a simpler change to the Kubernetes API, which -would only need a new field in the pod spec to specify the target pod. To be -useful, the containers in this "Debug Pod" should be run inside the namespaces -(network, pid, etc) of the target pod but remain in a separate resource group -(e.g. cgroup for container-based runtimes). - -This would be a rather fundamental change to pod, which is currently treated as -an atomic unit. The Container Runtime Interface has no provisions for sharing -outside of a pod sandbox and would need a refactor. This could be a complicated -change for non-container runtimes (e.g. hypervisor runtimes) which have more -rigid boundaries between pods. - -Effectively, Debug Pod must be implemented by the runtimes while Debug -Containers are implemented by the kubelet. Minimizing change to the Kubernetes -API is not worth the increased complexity for the kubelet and runtimes. - -It could also be possible to implement a Debug Pod as a privileged pod that runs -in the host namespace and interacts with the runtime directly to run a new -container in the appropriate namespace. This solution would be runtime-specific -and effectively pushes the complexity of debugging to the user. Additionally, -requiring node-level access to debug a pod does not meet our requirements. - -#### Exec from Node - -The kubelet could support executing a troubleshooting binary from the node in -the namespaces of the container. Once executed this binary would lose access to -other binaries from the node, making it of limited utility and a confusing user -experience. - -This couples the debug tools with the lifecycle of the node, which is worse than -coupling it with container images. - -## Reference - -* [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140) -* [CRI Tracking Issue](https://issues.k8s.io/28789) -* [CRI: expose optional runtime features](https://issues.k8s.io/32803) -* [Resource QoS in - Kubernetes](https://github.com/kubernetes/kubernetes/blob/master/docs/design/resource-qos.md) -* Related Features - * [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across - containers in a pod - * [#26751](https://issues.k8s.io/26751) - Pod-Level cgroup - * [#10782](https://issues.k8s.io/10782) - Vertical pod autoscaling -- cgit v1.2.3