summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJess Frazelle <acidburn@google.com>2017-05-18 21:24:46 -0400
committerJess Frazelle <acidburn@google.com>2017-05-24 04:14:26 -0400
commitefba5a6608f4cfaa3172a09bcbdef39d14c4912b (patch)
tree8cdea77e740c5a5e78318e46ffa16e4cee5bad06
parent3bd5f9f368b87ba3de268cd3cc39bc14ec34072a (diff)
update no new privs proposal
Signed-off-by: Jess Frazelle <acidburn@google.com>
-rw-r--r--contributors/design-proposals/no-new-privs.md144
1 files changed, 110 insertions, 34 deletions
diff --git a/contributors/design-proposals/no-new-privs.md b/contributors/design-proposals/no-new-privs.md
index bd6cac31..f764e399 100644
--- a/contributors/design-proposals/no-new-privs.md
+++ b/contributors/design-proposals/no-new-privs.md
@@ -1,65 +1,141 @@
-#Support "no new privileges" in Kubernetes
+# No New Privileges
-##Description
+- [Description](#description)
+ * [Interactions with other Linux primitives](#interactions-with-other-linux-primitives)
+- [Current Implementations](#current-implementations)
+ * [Support in Docker](#support-in-docker)
+ * [Support in rkt](#support-in-rkt)
+ * [Support in OCI runtimes](#support-in-oci-runtimes)
+- [Existing SecurityContext objects](#existing-securitycontext-objects)
+- [Changes of SecurityContext objects](#changes-of-securitycontext-objects)
+- [Pod Security Policy changes](#pod-security-policy-changes)
-In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes.
-`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call.
+## Description
-For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
+In Linux, the `execve` system call can grant more privileges to a newly-created
+process than its parent process. Considering security issues, since Linux kernel
+v3.5, there is a new flag named `no_new_privs` added to prevent those new
+privileges from being granted to the processes.
-Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option.
+[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt)
+is inherited across `fork`, `clone` and `execve` and can not be unset. With
+`no_new_privs` set, `execve` promises not to grant the privilege to do anything
+that could not have been done without the `execve` call.
-We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal.
+For more details about `no_new_privs`, please check the
+[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
+This is different from `NOSUID` in that `no_new_privs`can give permission to
+the container process to further restrict child processes with seccomp. This
+permission goes only one-way in that the container process can not grant more
+permissions, only further restrict.
-##Current implementation
+### Interactions with other Linux primitives
-###Support in Docker
+- suid binaries: will break when `no_new_privs` is enabled
+- seccomp2 as a non root user: requires `no_new_privs`
+- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs`
+- ambient capabilities: requires `no_new_privs`
+- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969)
-Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox`
-For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers.
+## Current Implementations
-###Support in OCI runtimes
+### Support in Docker
-Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file.
+Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs`
+while creating containers, for example
+`docker run --security-opt=no_new_privs busybox`.
-More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290).
+Docker provides via their Go api an object named `ContainerCreateConfig` to
+configure container creation parameters. In this object, there is a string
+array `HostConfig.SecurityOpt` to specify the security options. Client can
+utilize this field to specify the arguments for security options while
+creating new containers.
-###SecurityContext in Kubernetes
+This field did not scale well for the Docker client, so it's suggested that
+Kubernetes does not follow that design.
-Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options.
+This is not on by default in Docker.
-While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings:
-* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime
-* method `#getContainerSecurityOpts` in `docker_container.go` for CRI
+More details of the Docker implementation can be read
+[here](https://github.com/moby/moby/pull/20727) as well as the original
+discussion [here](https://github.com/moby/moby/issues/20329).
+### Support in rkt
-##Proposal to support "no new privileges"
+Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt.
-To support "no new privileges" options in Kubernetes, it is proposed to make the following changes:
+More details of the rkt implementation can be read
+[here](https://github.com/rkt/rkt/pull/2677).
-###Changes of SecurityContext objects
+### Support in OCI runtimes
-Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition:
-* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag.
-* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting.
+Since version 0.3.0 of the OCI runtime specification, a user can specify the
+`noNewPrivs` boolean flag in the configuration file.
-By default, `noNewPrivileges` is `false`.
+More details of the OCI implementation can be read
+[here](https://github.com/opencontainers/runtime-spec/pull/290).
-The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this.
+## Existing SecurityContext objects
-###Changes of docker runtime
+Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
+for `PodSpec`. `SecurityContext` objects define the related security options
+for Kubernetes containers, e.g. selinux options.
-When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
+To support "no new privileges" options in Kubernetes, it is proposed to make
+the following changes:
-###Changes of CRI runtime
+## Changes of SecurityContext objects
-When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
+Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext`
+definition.
-###Changes of kubectl
+By default, ie when `allowPrivilegeEscalation=nil`, we will set `no_new_privs=true`
+with the following exceptions:
-This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled.
+- when a container is `privileged`
+- when `CAP_SYS_ADMIN` is added to a container
+- when a container is not run as root, uid `0` (to prevent breaking suid
+ binaries)
-If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR.
+The API will reject as invalid `privileged=true` and
+`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and
+`allowPrivilegeEscalation=false.`
+
+When `allowPrivilegeEscalation` is set to `false` it will enable `no_new_privs`
+for that container.
+
+`allowPrivilegeEscalation` in `SecurityContext` provides container level
+control of the `no_new_privs` flag and can override the default in both directions
+of the `allowPrivilegeEscalation` setting.
+
+This requires changes to the Docker, rkt, and CRI runtime integrations so that
+kubelet will add the specific `no_new_privs` option.
+
+## Pod Security Policy changes
+
+The default can be set via a new `*bool` type field named `defaultAllowPrivilegeEscalation`
+in a Pod Security Policy.
+This would allow users to set `defaultAllowPrivilegeEscalation=false`, overriding the
+default `nil` behavior of `no_new_privs=false` for containers
+whose uids are not 0.
+
+This would also keep the behavior of setting the security context as
+`allowPrivilegeEscalation=true`
+for privileged containers and those with `capAdd=CAP_SYS_ADMIN`.
+
+To recap, below is a table defining the default behavior at the pod security
+policy level and what can be set as a default with a pod security policy.
+
+| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN |
+|----------------------------------|--------------------|--------------------|--------------------------|
+| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false |
+| false | no_new_privs=true | no_new_privs=true | no_new_privs=false |
+| true | no_new_privs=false | no_new_privs=false | no_new_privs=false |
+
+A new `bool` field named `allowPrivilegeEscalation` will be added to the Pod
+Security Policy as well to gate whether or not a user is allowed to set the
+security context to `allowPrivilegeEscalation=true`. This field will default to
+false.