summaryrefslogtreecommitdiff
path: root/contributors
diff options
context:
space:
mode:
authorIhor Dvoretskyi <ihor@linux.com>2018-06-05 14:52:40 +0000
committerIhor Dvoretskyi <ihor@linux.com>2018-06-05 14:52:40 +0000
commitbed39ba418bde30d95bdcb969bc13dfbd779621c (patch)
tree56b519975c95fd10cea63fffb8524fe49423dfc8 /contributors
parent759cd201e0d76ee30c8b7aa5e750620917fd8c00 (diff)
parent920d87ea659ca1fe238b3bdb9c94e4d834451fdb (diff)
sig-list.md updated
Signed-off-by: Ihor Dvoretskyi <ihor@linux.com>
Diffstat (limited to 'contributors')
-rw-r--r--contributors/design-proposals/api-machinery/aggregated-api-servers.md2
-rw-r--r--contributors/design-proposals/apps/controller_history.md2
-rw-r--r--contributors/design-proposals/apps/daemonset-update.md2
-rw-r--r--contributors/design-proposals/apps/statefulset-update.md2
-rw-r--r--contributors/design-proposals/auth/proc-mount-type.md93
-rw-r--r--contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md89
-rw-r--r--contributors/design-proposals/node/cri-windows.md2
-rw-r--r--contributors/design-proposals/node/kubelet-cri-logging.md17
-rw-r--r--contributors/design-proposals/node/node-usernamespace-remapping.md209
-rw-r--r--contributors/design-proposals/scheduling/rescheduling.md2
-rw-r--r--contributors/design-proposals/scheduling/taint-node-by-condition.md7
-rw-r--r--contributors/design-proposals/storage/container-storage-interface.md2
-rw-r--r--contributors/design-proposals/storage/grow-volume-size.md2
-rw-r--r--contributors/design-proposals/storage/pv-to-rbd-mapping.md2
-rw-r--r--contributors/design-proposals/storage/svcacct-token-volume-source.md148
-rw-r--r--contributors/design-proposals/storage/volume-topology-scheduling.md777
-rw-r--r--contributors/devel/api_changes.md13
-rw-r--r--contributors/devel/coding-conventions.md3
-rw-r--r--contributors/devel/development.md4
-rw-r--r--contributors/devel/faster_reviews.md4
-rw-r--r--contributors/devel/flexvolume.md8
-rw-r--r--contributors/devel/go-code.md3
-rw-r--r--contributors/devel/owners.md4
-rw-r--r--contributors/devel/pull-requests.md4
-rw-r--r--contributors/devel/release/OWNERS8
-rw-r--r--contributors/devel/release/README.md3
-rw-r--r--contributors/devel/release/issues.md3
-rw-r--r--contributors/devel/release/patch-release-manager.md3
-rw-r--r--contributors/devel/release/patch_release.md3
-rw-r--r--contributors/devel/release/scalability-validation.md3
-rw-r--r--contributors/devel/release/testing.md3
-rw-r--r--contributors/devel/scalability-good-practices.md4
-rw-r--r--contributors/devel/scheduler.md2
-rw-r--r--contributors/devel/security-release-process.md3
-rw-r--r--contributors/guide/README.md2
-rw-r--r--contributors/guide/contributor-cheatsheet.md13
-rw-r--r--contributors/guide/github-workflow.md16
-rw-r--r--contributors/new-contributor-playground/OWNERS14
-rw-r--r--contributors/new-contributor-playground/README.md12
-rw-r--r--contributors/new-contributor-playground/hello-from-copenhagen.md4
-rw-r--r--contributors/new-contributor-playground/new-contributor-notes.md350
-rw-r--r--contributors/new-contributor-playground/new-contributors.md5
42 files changed, 1622 insertions, 230 deletions
diff --git a/contributors/design-proposals/api-machinery/aggregated-api-servers.md b/contributors/design-proposals/api-machinery/aggregated-api-servers.md
index c5f8ca1a..d436c6b9 100644
--- a/contributors/design-proposals/api-machinery/aggregated-api-servers.md
+++ b/contributors/design-proposals/api-machinery/aggregated-api-servers.md
@@ -31,7 +31,7 @@ aggregated servers.
* Developers should be able to write their own API server and cluster admins
should be able to add them to their cluster, exposing new APIs at runtime. All
of this should not require any change to the core kubernetes API server.
-* These new APIs should be seamless extension of the core kubernetes APIs (ex:
+* These new APIs should be seamless extensions of the core kubernetes APIs (ex:
they should be operated upon via kubectl).
## Non Goals
diff --git a/contributors/design-proposals/apps/controller_history.md b/contributors/design-proposals/apps/controller_history.md
index af58fad2..6e313ce8 100644
--- a/contributors/design-proposals/apps/controller_history.md
+++ b/contributors/design-proposals/apps/controller_history.md
@@ -390,7 +390,7 @@ the following command.
### Rollback
-For future work, `kubeclt rollout undo` can be implemented in the general case
+For future work, `kubectl rollout undo` can be implemented in the general case
as an extension of the [above](#viewing-history ).
```bash
diff --git a/contributors/design-proposals/apps/daemonset-update.md b/contributors/design-proposals/apps/daemonset-update.md
index aea7e244..f4ce1256 100644
--- a/contributors/design-proposals/apps/daemonset-update.md
+++ b/contributors/design-proposals/apps/daemonset-update.md
@@ -42,7 +42,7 @@ Here are some potential requirements that haven't been covered by this proposal:
- Uptime is critical for each pod of a DaemonSet during an upgrade (e.g. the time
from a DaemonSet pods being killed to recreated and healthy should be < 5s)
- Each DaemonSet pod can still fit on the node after being updated
-- Some DaemonSets require the node to be drained before the DeamonSet's pod on it
+- Some DaemonSets require the node to be drained before the DaemonSet's pod on it
is updated (e.g. logging daemons)
- DaemonSet's pods are implicitly given higher priority than non-daemons
- DaemonSets can only be operated by admins (i.e. people who manage nodes)
diff --git a/contributors/design-proposals/apps/statefulset-update.md b/contributors/design-proposals/apps/statefulset-update.md
index 27d3000f..b4089011 100644
--- a/contributors/design-proposals/apps/statefulset-update.md
+++ b/contributors/design-proposals/apps/statefulset-update.md
@@ -747,7 +747,7 @@ kubectl rollout undo statefulset web
### Rolling Forward
Rolling back is usually the safest, and often the fastest, strategy to mitigate
deployment failure, but rolling forward is sometimes the only practical solution
-for stateful applications (e.g. A users has a minor configuration error but has
+for stateful applications (e.g. A user has a minor configuration error but has
already modified the storage format for the application). Users can use
sequential `kubectl apply`'s to update the StatefulSet's current
[target state](#target-state). The StatefulSet's `.Spec.GenerationPartition`
diff --git a/contributors/design-proposals/auth/proc-mount-type.md b/contributors/design-proposals/auth/proc-mount-type.md
new file mode 100644
index 00000000..073fc23e
--- /dev/null
+++ b/contributors/design-proposals/auth/proc-mount-type.md
@@ -0,0 +1,93 @@
+# ProcMount/ProcMountType Option
+
+## Background
+
+Currently the way docker and most other container runtimes work is by masking
+and setting as read-only certain paths in `/proc`. This is to prevent data
+from being exposed into a container that should not be. However, there are
+certain use-cases where it is necessary to turn this off.
+
+## Motivation
+
+For end-users who would like to run unprivileged containers using user namespaces
+_nested inside_ CRI containers, we need an option to have a `ProcMount`. That is,
+we need an option to designate explicitly turn off masking and setting
+read-only of paths so that we can
+mount `/proc` in the nested container as an unprivileged user.
+
+Please see the following filed issues for more information:
+- [opencontainers/runc#1658](https://github.com/opencontainers/runc/issues/1658#issuecomment-373122073)
+- [moby/moby#36597](https://github.com/moby/moby/issues/36597)
+- [moby/moby#36644](https://github.com/moby/moby/pull/36644)
+
+Please also see the [use case for building images securely in kubernetes](https://github.com/jessfraz/blog/blob/master/content/post/building-container-images-securely-on-kubernetes.md).
+
+Unmasking the paths in `/proc` option really only makes sense for when a user
+is nesting
+unprivileged containers with user namespaces as it will allow more information
+than is necessary to the program running in the container spawned by
+kubernetes.
+
+The main use case for this option is to run
+[genuinetools/img](https://github.com/genuinetools/img) inside a kubernetes
+container. That program then launches sub-containers that take advantage of
+user namespaces and re-mask /proc and set /proc as read-only. So therefore
+there is no concern with having an unmasked proc open in the top level container.
+
+It should be noted that this is different that the host /proc. It is still
+a newly mounted /proc just the container runtimes will not mask the paths.
+
+Since the only use case for this option is to run unprivileged nested
+containers,
+this option should only be allowed or used if the user in the container is not `root`.
+This can be easily enforced with `MustRunAs`.
+Since the user inside is still unprivileged,
+doing things to `/proc` would be off limits regardless, since linux user
+support already prevents this.
+
+## Existing SecurityContext objects
+
+Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
+for `PodSpec`. `SecurityContext` objects define the related security options
+for Kubernetes containers, e.g. selinux options.
+
+To support "ProcMount" options in Kubernetes, it is proposed to make
+the following changes:
+
+## Changes of SecurityContext objects
+
+Add a new `string` type field named `ProcMountType` will hold the viable
+options for `procMount` to the `SecurityContext`
+definition.
+
+By default,`procMount` is `default`, aka the same behavior as today and the
+paths are masked.
+
+This will look like the following in the spec:
+
+```go
+type ProcMountType string
+
+const (
+ // DefaultProcMount uses the container runtime default ProcType. Most
+ // container runtimes mask certain paths in /proc to avoid accidental security
+ // exposure of special devices or information.
+ DefaultProcMount ProcMountType = "Default"
+
+ // UnmaskedProcMount bypasses the default masking behavior of the container
+ // runtime and ensures the newly created /proc the container stays in tact with
+ // no modifications.
+ UnmaskedProcMount ProcMountType = "Unmasked"
+)
+
+procMount *ProcMountType
+```
+
+This requires changes to the CRI runtime integrations so that
+kubelet will add the specific `unmasked` or `whatever_it_is_named` option.
+
+## Pod Security Policy changes
+
+A new `[]ProcMountType{}` field named `allowedProcMounts` will be added to the Pod
+Security Policy as well to gate the allowed ProcMountTypes a user is allowed to
+set. This field will default to `[]ProcMountType{ DefaultProcMount }`.
diff --git a/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md b/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md
new file mode 100644
index 00000000..827df5a8
--- /dev/null
+++ b/contributors/design-proposals/network/support_traffic_shaping_for_kubelet_cni.md
@@ -0,0 +1,89 @@
+# Support traffic shaping for CNI network plugin
+
+Version: Alpha
+
+Authors: @m1093782566
+
+## Motivation and background
+
+Currently the kubenet code supports applying basic traffic shaping during pod setup. This will happen if bandwidth-related annotations have been added to the pod's metadata, for example:
+
+```json
+{
+ "kind": "Pod",
+ "metadata": {
+ "name": "iperf-slow",
+ "annotations": {
+ "kubernetes.io/ingress-bandwidth": "10M",
+ "kubernetes.io/egress-bandwidth": "10M"
+ }
+ }
+}
+```
+
+Our current implementation uses the `linux tc` to add an download(ingress) and upload(egress) rate limiter using 1 root `qdisc`, 2 `class `(one for ingress and one for egress) and 2 `filter`(one for ingress and one for egress attached to the ingress and egress classes respectively).
+
+Kubelet CNI code doesn't support it yet, though CNI has already added a [traffic sharping plugin](https://github.com/containernetworking/plugins/tree/master/plugins/meta/bandwidth). We can replicate the behavior we have today in kubenet for kubelet CNI network plugin if we feel this is an important feature.
+
+## Goal
+
+Support traffic shaping for CNI network plugin in Kubernetes.
+
+## Non-goal
+
+CNI plugins to implement this sort of traffic shaping guarantee.
+
+## Proposal
+
+If kubelet starts up with `network-plugin = cni` and user enabled traffic shaping via the network plugin configuration, it would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin.
+
+Traffic shaping in Kubelet CNI network plugin can work with ptp and bridge network plugins.
+
+### Pod Setup
+
+When we create a pod with bandwidth configuration in its metadata, for example,
+
+```json
+{
+ "kind": "Pod",
+ "metadata": {
+ "name": "iperf-slow",
+ "annotations": {
+ "kubernetes.io/ingress-bandwidth": "10M",
+ "kubernetes.io/egress-bandwidth": "10M"
+ }
+ }
+}
+```
+
+Kubelet would firstly parse the ingress and egress bandwidth values and transform them to Kbps because both `ingressRate` and `egressRate` in cni bandwidth plugin are in Kbps. A user would add something like this to their CNI config list if they want to enable traffic shaping via the plugin:
+
+```json
+{
+ "type": "bandwidth",
+ "capabilities": {"trafficShaping": true}
+}
+```
+
+Kubelet would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin:
+
+```json
+{
+ "type": "bandwidth",
+ "runtimeConfig": {
+ "trafficShaping": {
+ "ingressRate": "X",
+ "egressRate": "Y"
+ }
+ }
+}
+```
+
+### Pod Teardown
+
+When we delete a pod, kubelet will bulid the runtime config for calling cni plugin `DelNetwork/DelNetworkList` API, which will remove this pod's bandwidth configuration.
+
+## Next step
+
+* Support ingress and egress burst bandwidth in Pod.
+* Graduate annotations to Pod Spec.
diff --git a/contributors/design-proposals/node/cri-windows.md b/contributors/design-proposals/node/cri-windows.md
index e1a7f1fa..6589d985 100644
--- a/contributors/design-proposals/node/cri-windows.md
+++ b/contributors/design-proposals/node/cri-windows.md
@@ -85,7 +85,7 @@ The implementation will mainly be in two parts:
In both parts, we need to implement:
* Fork code for Windows from Linux.
-* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configration in CRI to container configuration.
+* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configuration in CRI to container configuration.
To implement resource controls for Windows containers, refer to [this MSDN documentation](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls) and [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go).
diff --git a/contributors/design-proposals/node/kubelet-cri-logging.md b/contributors/design-proposals/node/kubelet-cri-logging.md
index a19ff3f5..12d0624d 100644
--- a/contributors/design-proposals/node/kubelet-cri-logging.md
+++ b/contributors/design-proposals/node/kubelet-cri-logging.md
@@ -142,11 +142,22 @@ extend this by maintaining a metadata file in the pod directory.
**Log format**
The runtime should decorate each log entry with a RFC 3339Nano timestamp
-prefix, the stream type (i.e., "stdout" or "stderr"), and ends with a newline.
+prefix, the stream type (i.e., "stdout" or "stderr"), the tags of the log
+entry, the log content that ends with a newline.
+The `tags` fields can support multiple tags, delimited by `:`. Currently, only
+one tag is defined in CRI to support multi-line log entries: partial or full.
+Partial (`P`) is used when a log entry is split into multiple lines by the
+runtime, and the entry has not ended yet. Full (`F`) indicates that the log
+entry is completed -- it is either a single-line entry, or this is the last
+line of the muiltple-line entry.
+
+For example,
```
-2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1
-2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2
+2016-10-06T00:17:09.669794202Z stdout F The content of the log entry 1
+2016-10-06T00:17:09.669794202Z stdout P First line of log entry 2
+2016-10-06T00:17:09.669794202Z stdout P Second line of the log entry 2
+2016-10-06T00:17:10.113242941Z stderr F Last line of the log entry 2
```
With the knowledge, kubelet can parses the logs and serve them for `kubectl
diff --git a/contributors/design-proposals/node/node-usernamespace-remapping.md b/contributors/design-proposals/node/node-usernamespace-remapping.md
new file mode 100644
index 00000000..75cb0888
--- /dev/null
+++ b/contributors/design-proposals/node/node-usernamespace-remapping.md
@@ -0,0 +1,209 @@
+# Support Node-Level User Namespaces Remapping
+
+- [Summary](#summary)
+- [Motivation](#motivation)
+- [Goals](#goals)
+- [Non-Goals](#non-goals)
+- [Use Stories](#user-stories)
+- [Proposal](#proposal)
+- [Future Work](#future-work)
+- [Risks and Mitigations](risks-and-mitigations)
+- [Graduation Criteria](graduation-criteria)
+- [Alternatives](alternatives)
+
+
+_Authors:_
+
+* Mrunal Patel &lt;mpatel@redhat.com&gt;
+* Jan Pazdziora &lt;jpazdziora@redhat.com&gt;
+* Vikas Choudhary &lt;vichoudh@redhat.com&gt;
+
+## Summary
+Container security consists of many different kernel features that work together to make containers secure. User namespaces is one such feature that enables interesting possibilities for containers by allowing them to be root inside the container while not being root on the host. This gives more capabilities to the containers while protecting the host from the container being root and adds one more layer to container security.
+In this proposal we discuss:
+- use-cases/user-stories that benefit from this enhancement
+- implementation design and scope for alpha release
+- long-term roadmap to fully support this feature beyond alpha
+
+## Motivation
+From user_namespaces(7):
+> User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
+
+In order to run Pods with software which expects to run as root or with elevated privileges while still containing the processes and protecting both the Nodes and other Pods, Linux kernel mechanism of user namespaces can be used make the processes in the Pods view their environment as having the privileges, while on the host (Node) level these processes appear as without privileges or with privileges only affecting processes in the same Pods
+
+The purpose of using user namespaces in Kubernetes is to let the processes in Pods think they run as one uid set when in fact they run as different “real” uids on the Nodes.
+
+In this text, most everything said about uids can also be applied to gids.
+
+## Goals
+Enable user namespace support in a kubernetes cluster so that workloads that work today also work with user namespaces enabled at runtime. Furthermore, make workloads that require root/privileged user inside the container, safer for the node using the additional security of user namespaces. Containers will run in a user namespace different from user-namespace of the underlying host.
+
+## Non-Goals
+- Non-goal is to support pod/container level user namespace isolation. There can be images using different users but on the node, pods/containers running with these images will share common user namespace remapping configuration. In other words, all containers on a node share a common user-namespace range.
+- Remote volumes support eg. NFS
+
+## User Stories
+- As a cluster admin, I want to protect the node from the rogue container process(es) running inside pod containers with root privileges. If such a process is able to break out into the node, it could be a security issue.
+- As a cluster admin, I want to support all the images irrespective of what user/group that image is using.
+- As a cluster admin, I want to allow some pods to disable user namespaces if they require elevated privileges.
+
+## Proposal
+Proposal is to support user-namespaces for the pod containers. This can be done at two levels:
+- Node-level : This proposal explains this part in detail.
+- Namespace-Level/Pod-level: Plan is to target this in future due to missing support in the low level system components such as runtimes and kernel. More on this in the `Future Work` section.
+
+Node-level user-namespace support means that, if feature is enabled, all pods on a node will share a common user-namespace, common UID(and GID) range (which is a subset of node’s total UIDs(and GIDs)). This common user-namespace is runtime’s default user-namespace range which is remapped to containers’ UIDs(and GID), starting with the first UID as container’s ‘root’.
+In general Linux convention, UID(or GID) mapping consists of three parts:
+1. Host (U/G)ID: First (U/G)ID of the range on the host that is being remapped to the (U/G)IDs in the container user-namespace
+2. Container (U/G)ID: First (U/G)ID of the range in the container namespace and this is mapped to the first (U/G)ID on the host(mentioned in previous point).
+3. Count/Size: Total number of consecutive mapping between host and container user-namespaces, starting from the first one (including) mentioned above.
+
+As an example, `host_id 1000, container_id 0, size 10`
+In this case, 1000 to 1009 on host will be mapped to 0 to 9 inside the container.
+
+User-namespace support should be enabled only when container runtime on the node supports user-namespace remapping and is enabled in its configuration. To enable user-namespaces, feature-gate flag will need to be passed to Kubelet like this `--feature-gates=”NodeUserNamespace=true”`
+
+A new CRI API, `GetRuntimeConfigInfo` will be added. Kubelet will use this API:
+- To verify if user-namespace remapping is enabled at runtime. If found disabled, kubelet will fail to start
+- To determine the default user-namespace range at the runtime, starting UID of which is mapped to the UID '0' of the container.
+
+### Volume Permissions
+Kubelet will change the file permissions, i.e chown, at `/var/lib/kubelet/pods` prior to any container start to get file permissions updated according to remapped UID and GID.
+This proposal will work only for local volumes and not with remote volumes such as NFS.
+
+### How to disable `NodeUserNamespace` for a specific pod
+This can be done in two ways:
+- **Alpha:** Implicitly using host namespace for the pod containers
+This support is already present (currently it seems broken, will be fixed) in Kubernetes as an experimental functionality, which can be enabled using `feature-gates=”ExperimentalHostUserNamespaceDefaulting=true”`.
+If Pod-Security-Policy is configured to allow the following to be requested by a pod, host user-namespace will be enabled for the container:
+ - host namespaces (pid, ipc, net)
+ - non-namespaced capabilities (mknod, sys_time, sys_module)
+ - the pod contains a privileged container or using host path volumes.
+ - https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c#diff-a53fa76e941e0bdaee26dcbc435ad2ffR437 introduced via https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c.
+
+- **Beta:** Explicit API to request host user-namespace in pod spec
+ This is being targeted under Beta graduation plans.
+
+### CRI API Changes
+Proposed CRI API changes:
+
+```golang
+// Runtime service defines the public APIs for remote container runtimes
+service RuntimeService {
+ // Version returns the runtime name, runtime version, and runtime API version.
+ rpc Version(VersionRequest) returns (VersionResponse) {}
+ …….
+ …….
+ // GetRuntimeConfigInfo returns the configuration details of the runtime.
+ rpc GetRuntimeConfigInfo(GetRuntimeConfigInfoRequest) returns (GetRuntimeConfigInfoResponse) {}
+}
+// LinuxIDMapping represents a single user namespace mapping in Linux.
+message LinuxIDMapping {
+ // container_id is the starting id for the mapping inside the container.
+ uint32 container_id = 1;
+ // host_id is the starting id for the mapping on the host.
+ uint32 host_id = 2;
+ // size is the length of the mapping.
+ uint32 size = 3;
+}
+
+message LinuxUserNamespaceConfig {
+ // is_enabled, if true indicates that user-namespaces are supported and enabled in the container runtime
+ bool is_enabled = 1;
+ // uid_mappings is an array of user id mappings.
+ repeated LinuxIDMapping uid_mappings = 1;
+ // gid_mappings is an array of group id mappings.
+ repeated LinuxIDMapping gid_mappings = 2;
+}
+message GetRuntimeConfig {
+ LinuxUserNamespaceConfig user_namespace_config = 1;
+}
+
+message GetRuntimeConfigInfoRequest {}
+
+message GetRuntimeConfigInfoResponse {
+ GetRuntimeConfig runtime_config = 1
+}
+
+...
+
+// NamespaceOption provides options for Linux namespaces.
+message NamespaceOption {
+ // Network namespace for this container/sandbox.
+ // Note: There is currently no way to set CONTAINER scoped network in the Kubernetes API.
+ // Namespaces currently set by the kubelet: POD, NODE
+ NamespaceMode network = 1;
+ // PID namespace for this container/sandbox.
+ // Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
+ // The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
+ // Namespaces currently set by the kubelet: POD, CONTAINER, NODE
+ NamespaceMode pid = 2;
+ // IPC namespace for this container/sandbox.
+ // Note: There is currently no way to set CONTAINER scoped IPC in the Kubernetes API.
+ // Namespaces currently set by the kubelet: POD, NODE
+ NamespaceMode ipc = 3;
+ // User namespace for this container/sandbox.
+ // Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API.
+ // The container runtime should ignore this if user namespace is NOT enabled.
+ // POD is the default value. Kubelet will set it to NODE when trying to use host user-namespace
+ // Namespaces currently set by the kubelet: POD, NODE
+ NamespaceMode user = 4;
+}
+
+```
+
+### Runtime Support
+- Docker: Here is the [user-namespace documentation](https://docs.docker.com/engine/security/userns-remap/) and this is the [implementation PR](https://github.com/moby/moby/pull/12648)
+ - Concerns:
+Docker API does not provide user-namespace mapping. Therefore to handle `GetRuntimeConfigInfo` API, changes will be done in `dockershim` to read system files, `/etc/subuid` and `/etc/subgid`, for figuring out default user-namespace mapping. `/info` api will be used to figure out if user-namespace is enabled and `Docker Root Dir` will be used to figure out host uid mapped to the uid `0` in container. eg. `Docker Root Dir: /var/lib/docker/2131616.2131616` this shows host uid `2131616` will be mapped to uid `0`
+- CRI-O: https://github.com/kubernetes-incubator/cri-o/pull/1519
+- Containerd: https://github.com/containerd/containerd/blob/129167132c5e0dbd1b031badae201a432d1bd681/container_opts_unix.go#L149
+
+### Implementation Roadmap
+#### Phase 1: Support in Kubelet, Alpha, [Target: Kubernetes v1.11]
+- Add feature gate `NodeUserNamespace`, disabled by default
+- Add new CRI API, `GetRuntimeConfigInfo()`
+- Add logic in Kubelet to handle pod creation which includes parsing GetRuntimeConfigInfo response and changing file-permissions in /var/lib/kubelet with learned userns mapping.
+- Add changes in dockershim to implement GetRuntimeConfigInfo() for docker runtime
+- Add changes in CRI-O to implement userns support and GetRuntimeConfigInfo() support
+- Unit test cases
+- e2e tests
+
+#### Phase 2: Beta Support [Target: Kubernetes v1.12]
+- PSP integration
+- To grow ExperimentalHostUserNamespaceDefaulting from experimental feature gate to a Kubelet flag
+- API changes to allow pod able to request HostUserNamespace in pod spec
+- e2e tests
+
+### References
+- Default host user namespace via experimental flag
+ - https://github.com/kubernetes/kubernetes/pull/31169
+- Enable userns support for containers launched by kubelet
+ - https://github.com/kubernetes/features/issues/127
+- Track Linux User Namespaces in the Pod Security Policy
+ - https://github.com/kubernetes/kubernetes/issues/59152
+- Add support for experimental-userns-remap-root-uid and experimental-userns-remap-root-gid options to match the remapping used by the container runtime.
+ - https://github.com/kubernetes/kubernetes/pull/55707
+- rkt User Namespaces Background
+ - https://coreos.com/rkt/docs/latest/devel/user-namespaces.html
+
+## Future Work
+### Namespace-Level/Pod-Level user-namespace support
+There is no runtime today which supports creating containers with a specified user namespace configuration. For example here is the discussion related to this support in Docker https://github.com/moby/moby/issues/28593
+Once user-namespace feature in the runtimes has evolved to support container’s request for a specific user-namespace mapping(UID and GID range), we can extend current Node-Level user-namespace support in Kubernetes to support Namespace-level isolation(or if desired even pod-level isolation) by dividing and allocating learned mapping from runtime among Kubernetes namespaces (or pods, if desired). From end-user UI perspective, we dont expect any change in the UI related to user namespaces support.
+### Remote Volumes
+Remote Volumes support should be investigated and should be targeted in future once support is there at lower infra layers.
+
+
+## Risks and Mitigations
+The main risk with this change stems from the fact that processes in Pods will run with different “real” uids than they used to, while expecting the original uids to make operations on the Nodes or consistently access shared persistent storage.
+- This can be mitigated by turning the feature on gradually, per-Pod or per Kubernetes namespace.
+- For the Kubernetes' cluster Pods (that provide the Kubernetes functionality), testing of their behaviour and ability to run in user namespaced setups is crucial.
+
+## Graduation Criteria
+- PSP integration
+- API changes to allow pod able to request host user namespace using for example, `HostUserNamespace: True`, in pod spec
+- e2e tests
+
+## Alternatives
+User Namespace mappings can be passed explicitly through kubelet flags similar to https://github.com/kubernetes/kubernetes/pull/55707 but we do not prefer this option because this is very much prone to mis-configuration.
diff --git a/contributors/design-proposals/scheduling/rescheduling.md b/contributors/design-proposals/scheduling/rescheduling.md
index db960934..32d86a27 100644
--- a/contributors/design-proposals/scheduling/rescheduling.md
+++ b/contributors/design-proposals/scheduling/rescheduling.md
@@ -28,7 +28,7 @@ implied. However, describing the process as "moving" the pod is approximately ac
and easier to understand, so we will use this terminology in the document.
We use the term "rescheduling" to describe any action the system takes to move an
-already-running pod. The decision may be made and executed by any component; we wil
+already-running pod. The decision may be made and executed by any component; we will
introduce the concept of a "rescheduler" component later, but it is not the only
component that can do rescheduling.
diff --git a/contributors/design-proposals/scheduling/taint-node-by-condition.md b/contributors/design-proposals/scheduling/taint-node-by-condition.md
index 550e9cd9..2e352d4f 100644
--- a/contributors/design-proposals/scheduling/taint-node-by-condition.md
+++ b/contributors/design-proposals/scheduling/taint-node-by-condition.md
@@ -19,8 +19,8 @@ In addition to this, with taint-based-eviction, the Node Controller already tain
| ------------------ | ------------------ | ------------ | -------- |
|Ready |True | - | |
| |False | NoExecute | node.kubernetes.io/not-ready |
-| |Unknown | NoExecute | node.kubernetes.io/unreachable |
-|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |
+| |Unknown | NoExecute | node.kubernetes.io/unreachable |
+|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |
| |False | - | |
| |Unknown | - | |
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure |
@@ -32,6 +32,9 @@ In addition to this, with taint-based-eviction, the Node Controller already tain
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable |
| |False | - | |
| |Unknown | - | |
+|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure |
+| |False | - | |
+| |Unknown | - | |
For example, if a CNI network is not detected on the node (e.g. a network is unavailable), the Node Controller will taint the node with `node.kubernetes.io/network-unavailable=:NoSchedule`. This will then allow users to add a toleration to their `PodSpec`, ensuring that the pod can be scheduled to this node if necessary. If the kubelet did not update the node’s status after a grace period, the Node Controller will only taint the node with `node.kubernetes.io/unreachable`; it will not taint the node with any unknown condition.
diff --git a/contributors/design-proposals/storage/container-storage-interface.md b/contributors/design-proposals/storage/container-storage-interface.md
index 1522539a..27e10bd1 100644
--- a/contributors/design-proposals/storage/container-storage-interface.md
+++ b/contributors/design-proposals/storage/container-storage-interface.md
@@ -314,7 +314,7 @@ The attach/detach controller,running as part of the kube-controller-manager bina
When the controller decides to attach a CSI volume, it will call the in-tree CSI volume plugin’s attach method. The in-tree CSI volume plugin’s attach method will do the following:
1. Create a new `VolumeAttachment` object (defined in the “Communication Channels” section) to attach the volume.
- * The name of the of the `VolumeAttachment` object will be `pv-<SHA256(PVName+NodeName)>`.
+ * The name of the `VolumeAttachment` object will be `pv-<SHA256(PVName+NodeName)>`.
* `pv-` prefix is used to allow using other scheme(s) for inline volumes in the future, with their own prefix.
* SHA256 hash is to reduce length of `PVName` plus `NodeName` string, each of which could be max allowed name length (hexadecimal representation of SHA256 is 64 characters).
* `PVName` is `PV.name` of the attached PersistentVolume.
diff --git a/contributors/design-proposals/storage/grow-volume-size.md b/contributors/design-proposals/storage/grow-volume-size.md
index 4fb53292..a968d91c 100644
--- a/contributors/design-proposals/storage/grow-volume-size.md
+++ b/contributors/design-proposals/storage/grow-volume-size.md
@@ -198,7 +198,7 @@ we have considered following options:
Cons:
* I don't know if there is a pattern that exists in kube today for shipping shell scripts that are called out from code in Kubernetes. Flex is
- different because, none of the flex scripts are shipped with Kuberntes.
+ different because, none of the flex scripts are shipped with Kubernetes.
3. Ship resizing tools in a container.
diff --git a/contributors/design-proposals/storage/pv-to-rbd-mapping.md b/contributors/design-proposals/storage/pv-to-rbd-mapping.md
index a64a1018..8071cbbe 100644
--- a/contributors/design-proposals/storage/pv-to-rbd-mapping.md
+++ b/contributors/design-proposals/storage/pv-to-rbd-mapping.md
@@ -55,7 +55,7 @@ the RBD image.
### Pros
- Simple to implement
- Does not cause regression in RBD image names, which remains same as earlier.
-- The metada information is not immediately visible to RBD admins
+- The metadata information is not immediately visible to RBD admins
### Cons
- NA
diff --git a/contributors/design-proposals/storage/svcacct-token-volume-source.md b/contributors/design-proposals/storage/svcacct-token-volume-source.md
new file mode 100644
index 00000000..3069e677
--- /dev/null
+++ b/contributors/design-proposals/storage/svcacct-token-volume-source.md
@@ -0,0 +1,148 @@
+# Service Account Token Volumes
+
+Authors:
+ @smarterclayton
+ @liggitt
+ @mikedanese
+
+## Summary
+
+Kubernetes is able to provide pods with unique identity tokens that can prove
+the caller is a particular pod to a Kubernetes API server. These tokens are
+injected into pods as secrets. This proposal proposes a new mechanism of
+distribution with support for [improved service account tokens][better-tokens]
+and explores how to migrate from the existing mechanism backwards compatibly.
+
+## Motivation
+
+Many workloads running on Kubernetes need to prove to external parties who they
+are in order to participate in a larger application environment. This identity
+must be attested to by the orchestration system in a way that allows a third
+party to trust that an arbitrary container on the cluster is who it says it is.
+In addition, infrastructure running on top of Kubernetes needs a simple
+mechanism to communicate with the Kubernetes APIs and to provide more complex
+tooling. Finally, a significant set of security challenges are associated with
+storing service account tokens as secrets in Kubernetes and limiting the methods
+whereby malicious parties can get access to these tokens will reduce the risk of
+platform compromise.
+
+As a platform, Kubernetes should evolve to allow identity management systems to
+provide more powerful workload identity without breaking existing use cases, and
+provide a simple out of the box workload identity that is sufficient to cover
+the requirements of bootstrapping low-level infrastructure running on
+Kubernetes. We expect that other systems to cover the more advanced scenarios,
+and see this effort as necessary glue to allow more powerful systems to succeed.
+
+With this feature, we hope to provide a backwards compatible replacement for
+service account tokens that strengthens the security and improves the
+scalability of the platform.
+
+## Proposal
+
+Kubernetes should implement a ServiceAccountToken volume projection that
+maintains a service account token requested by the node from the TokenRequest
+API.
+
+### Token Volume Projection
+
+A new volume projection will be implemented with an API that closely matches the
+TokenRequest API.
+
+```go
+type ProjectedVolumeSource struct {
+ Sources []VolumeProjection
+ DefaultMode *int32
+}
+
+type VolumeProjection struct {
+ Secret *SecretProjection
+ DownwardAPI *DownwardAPIProjection
+ ConfigMap *ConfigMapProjection
+ ServiceAccountToken *ServiceAccountTokenProjection
+}
+
+// ServiceAccountTokenProjection represents a projected service account token
+// volume. This projection can be used to insert a service account token into
+// the pods runtime filesystem for use against APIs (Kubernetes API Server or
+// otherwise).
+type ServiceAccountTokenProjection struct {
+ // Audience is the intended audience of the token. A recipient of a token
+ // must identify itself with an identifier specified in the audience of the
+ // token, and otherwise should reject the token. The audience defaults to the
+ // identifier of the apiserver.
+ Audience string
+ // ExpirationSeconds is the requested duration of validity of the service
+ // account token. As the token approaches expiration, the kubelet volume
+ // plugin will proactively rotate the service account token. The kubelet will
+ // start trying to rotate the token if the token is older than 80 percent of
+ // its time to live or if the token is older than 24 hours.Defaults to 1 hour
+ // and must be at least 10 minutes.
+ ExpirationSeconds int64
+ // Path is the relative path of the file to project the token into.
+ Path string
+}
+```
+
+A volume plugin implemented in the kubelet will project a service account token
+sourced from the TokenRequest API into volumes created from
+ProjectedVolumeSources. As the token approaches expiration, the kubelet volume
+plugin will proactively rotate the service account token. The kubelet will start
+trying to rotate the token if the token is older than 80 percent of its time to
+live or if the token is older than 24 hours.
+
+To replace the current service account token secrets, we also need to inject the
+clusters CA certificate bundle. Initially we will deploy to data in a configmap
+per-namespace and reference it using a ConfigMapProjection.
+
+A projected volume source that is equivalent to the current service account
+secret:
+
+```yaml
+sources:
+- serviceAccountToken:
+ expirationSeconds: 3153600000 # 100 years
+ path: token
+- configMap:
+ name: kube-cacrt
+ items:
+ - key: ca.crt
+ path: ca.crt
+- downwardAPI:
+ items:
+ - path: namespace
+ fieldRef: metadata.namespace
+```
+
+
+This fixes one scalability issue with the current service account token
+deployment model where secret GETs are a large portion of overall apiserver
+traffic.
+
+A projected volume source that requests a token for vault and Istio CA:
+
+```yaml
+sources:
+- serviceAccountToken:
+ path: vault-token
+ audience: vault
+- serviceAccountToken:
+ path: istio-token
+ audience: ca.istio.io
+```
+
+### Alternatives
+
+1. Instead of implementing a service account token volume projection, we could
+ implement all injection as a flex volume or CSI plugin.
+ 1. Both flex volume and CSI are alpha and are unlikely to graduate soon.
+ 1. Virtual kubelets (like Fargate or ACS) may not be able to run flex
+ volumes.
+ 1. Service account tokens are a fundamental part of our API.
+1. Remove service accounts and service account tokens completely from core, use
+ an alternate mechanism that sits outside the platform.
+ 1. Other core features need service account integration, leading to all
+ users needing to install this extension.
+ 1. Complicates installation for the majority of users.
+
+
+[better-tokens]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md
diff --git a/contributors/design-proposals/storage/volume-topology-scheduling.md b/contributors/design-proposals/storage/volume-topology-scheduling.md
index 2603e225..402ca0f9 100644
--- a/contributors/design-proposals/storage/volume-topology-scheduling.md
+++ b/contributors/design-proposals/storage/volume-topology-scheduling.md
@@ -1,24 +1,36 @@
# Volume Topology-aware Scheduling
-Authors: @msau42
+Authors: @msau42, @lichuqiang
This document presents a detailed design for making the default Kubernetes
scheduler aware of volume topology constraints, and making the
PersistentVolumeClaim (PVC) binding aware of scheduling decisions.
+## Definitions
+* Topology: Rules to describe accessibility of an object with respect to
+ location in a cluster.
+* Domain: A grouping of locations within a cluster. For example, 'node1',
+ 'rack10', 'zone5'.
+* Topology Key: A description of a general class of domains. For example,
+ 'node', 'rack', 'zone'.
+* Hierarchical domain: Domain that can be fully encompassed in a larger domain.
+ For example, the 'zone1' domain can be fully encompassed in the 'region1'
+ domain.
+* Failover domain: A domain that a workload intends to run in at a later time.
## Goals
-* Allow a Pod to request one or more topology-constrained Persistent
-Volumes (PV) that are compatible with the Pod's other scheduling
-constraints, such as resource requirements and affinity/anti-affinity
-policies.
-* Support arbitrary PV topology constraints (i.e. node,
-rack, zone, foo, bar).
-* Support topology constraints for statically created PVs and dynamically
-provisioned PVs.
+* Allow topology to be specified for both pre-provisioned and dynamic
+ provisioned PersistentVolumes so that the Kubernetes scheduler can correctly
+ place a Pod using such a volume to an appropriate node.
+* Support arbitrary PV topology domains (i.e. node, rack, zone, foo, bar)
+ without encoding each as first class objects in the Kubernetes API.
+* Allow the Kubernetes scheduler to influence where a volume is provisioned or
+ which pre-provisioned volume to bind to based on scheduling constraints on the
+ Pod requesting a volume, such as Pod resource requirements and
+ affinity/anti-affinity policies.
* No scheduling latency performance regression for Pods that do not use
-topology-constrained PVs.
-
+ PVs with topology.
+* Allow administrators to restrict allowed topologies per StorageClass.
## Non Goals
* Fitting a pod after the initial PVC binding has been completed.
@@ -36,13 +48,34 @@ operator to schedule them together. Another alternative is to merge the two
pods into one.
* For two+ pods non-simultaneously sharing a PVC, this scenario could be
handled by pod priorities and preemption.
+* Provisioning multi-domain volumes where all the domains will be able to run
+ the workload. For example, provisioning a multi-zonal volume and making sure
+ the pod can run in all zones.
+ * Scheduler cannot make decisions based off of future resource requirements,
+ especially if those resources can fluctuate over time. For applications that
+ use such multi-domain storage, the best practice is to either:
+ * Configure cluster autoscaling with enough resources to accommodate
+ failing over the workload to any of the other failover domains.
+ * Manually configure and overprovision the failover domains to
+ accommodate the resource requirements of the workload.
+* Scheduler supporting volume topologies that are independent of the node's
+ topologies.
+ * The Kubernetes scheduler only handles topologies with respect to the
+ workload and the nodes it runs on. If a storage system is deployed on an
+ independent topology, it will be up to provisioner to correctly spread the
+ volumes for a workload. This could be facilitated as a separate feature
+ by:
+ * Passing the Pod's OwnerRef to the provisioner, and the provisioner
+ spreading volumes for Pods with the same OwnerRef
+ * Adding Volume Anti-Affinity policies, and passing those to the
+ provisioner.
## Problem
Volumes can have topology constraints that restrict the set of nodes that the
volume can be accessed on. For example, a GCE PD can only be accessed from a
single zone, and a local disk can only be accessed from a single node. In the
-future, there could be other topology constraints, such as rack or region.
+future, there could be other topology domains, such as rack or region.
A pod that uses such a volume must be scheduled to a node that fits within the
volume’s topology constraints. In addition, a pod can have further constraints
@@ -70,16 +103,21 @@ binding happens without considering if multiple PVCs are related, it is very lik
for the two PVCs to be bound to local disks on different nodes, making the pod
unschedulable.
* For multizone clusters and deployments requesting multiple dynamically provisioned
-zonal PVs, each PVC Is provisioned independently, and is likely to provision each PV
-In different zones, making the pod unschedulable.
+zonal PVs, each PVC is provisioned independently, and is likely to provision each PV
+in different zones, making the pod unschedulable.
To solve the issue of initial volume binding and provisioning causing an impossible
pod placement, volume binding and provisioning should be more tightly coupled with
pod scheduling.
-## New Volume Topology Specification
-To specify a volume's topology constraints in Kubernetes, the PersistentVolume
+## Volume Topology Specification
+First, volumes need a way to express topology constraints against nodes. Today, it
+is done for zonal volumes by having explicit logic to process zone labels on the
+PersistentVolume. However, this is not easily extendable for volumes with other
+topology keys.
+
+Instead, to support a generic specification, the PersistentVolume
object will be extended with a new NodeAffinity field that specifies the
constraints. It will closely mirror the existing NodeAffinity type used by
Pods, but we will use a new type so that we will not be bound by existing and
@@ -107,18 +145,27 @@ weights, but will not be included in the initial implementation.
The advantages of this NodeAffinity field vs the existing method of using zone labels
on the PV are:
-* We don't need to expose first-class labels for every topology domain.
-* Implementation does not need to be updated every time a new topology domain
+* We don't need to expose first-class labels for every topology key.
+* Implementation does not need to be updated every time a new topology key
is added to the cluster.
* NodeSelector is able to express more complex topology with ANDs and ORs.
+* NodeAffinity aligns with how topology is represented with other Kubernetes
+ resources.
Some downsides include:
* You can have a proliferation of Node labels if you are running many different
kinds of volume plugins, each with their own topology labeling scheme.
+* The NodeSelector is more expressive than what most storage providers will
+ need. Most storage providers only need a single topology key with
+ one or more domains. Non-hierarchical domains may present implementation
+ challenges, and it will be difficult to express all the functionality
+ of a NodeSelector in a non-Kubernetes specification like CSI.
### Example PVs with NodeAffinity
#### Local Volume
+In this example, the volume can only be accessed from nodes that have the
+label key `kubernetes.io/hostname` and label value `node-1`.
```
apiVersion: v1
kind: PersistentVolume
@@ -141,6 +188,9 @@ spec:
```
#### Zonal Volume
+In this example, the volume can only be accessed from nodes that have the
+label key `failure-domain.beta.kubernetes.io/zone` and label value
+`us-central1-a`.
```
apiVersion: v1
kind: PersistentVolume
@@ -164,6 +214,9 @@ spec:
```
#### Multi-Zonal Volume
+In this example, the volume can only be accessed from nodes that have the
+label key `failure-domain.beta.kubernetes.io/zone` and label value
+`us-central1-a` OR `us-central1-b`.
```
apiVersion: v1
kind: PersistentVolume
@@ -187,19 +240,154 @@ spec:
- us-central1-b
```
-### Default Specification
-Existing admission controllers and dynamic provisioners for zonal volumes
-will be updated to specify PV NodeAffinity in addition to the existing zone
-and region labels. This will handle newly created PV objects.
+#### Multi Label Volume
+In this example, the volume needs two labels to uniquely identify the topology.
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ Name: rack-volume-1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: my-class
+ csi:
+ driver: my-rack-storage-driver
+ volumeHandle: my-vol
+ volumeAttributes:
+ foo: bar
+ nodeAffinity:
+ required:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ operator: In
+ values:
+ - us-central1-a
+ - key: foo.io/rack
+ operator: In
+ values:
+ - rack1
+```
+
+### Zonal PV Upgrade and Downgrade
+Upgrading of zonal PVs to use the new PV.NodeAffinity API can be phased in as
+follows:
+
+1. Update PV label admission controllers to specify the new PV.NodeAffinity. New
+ PVs created will automatically use the new PV.NodeAffinity. Existing PVs are
+ not updated yet, so on a downgrade, existing PVs are unaffected. New PVCs
+ should be deleted and recreated if there were problems with this feature.
+2. Once PV.NodeAffinity is GA, deprecate the VolumeZoneChecker scheduler
+ predicate. Add a zonal PV upgrade controller to convert existing PVs. At this
+ point, if there are issues with this feature, then on a downgrade, the
+ VolumeScheduling feature would also need to be disabled.
+3. After deprecation period, remove VolumeZoneChecker predicate and PV upgrade
+ controller.
+
+The zonal PV upgrade controller will convert existing PVs leveraging the
+existing zonal scheduling logic using labels to PV.NodeAffinity. It will keep
+the existing labels for backwards compatibility.
+
+For example, this zonal volume:
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: zonal-volume-1
+ labels:
+ failure-domain.beta.kubernetes.io/zone: us-central1-a
+ failure-domain.beta.kubernetes.io/region: us-central1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: my-class
+ gcePersistentDisk:
+ diskName: my-disk
+ fsType: ext4
+```
+
+will be converted to:
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: zonal-volume-1
+ labels:
+ failure-domain.beta.kubernetes.io/zone: us-central1-a
+ failure-domain.beta.kubernetes.io/region: us-central1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: my-class
+ gcePersistentDisk:
+ diskName: my-disk
+ fsType: ext4
+ nodeAffinity:
+ required:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ operator: In
+ values:
+ - us-central1-a
+ - key: failure-domain.beta.kubernetes.io/region
+ operator: In
+ values:
+ - us-central1
+```
-Existing PV objects will have to be upgraded to use the new NodeAffinity field.
-This does not have to occur instantaneously, and can be updated within the
-deprecation period.
+### Multi-Zonal PV Upgrade
+The zone label for multi-zonal volumes need to be specially parsed.
-TODO: This can be done through one of the following methods:
-- Manual updates/scripts
-- cluster/update-storage-objects.sh?
-- A new PV update controller
+For example, this multi-zonal volume:
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: multi-zonal-volume-1
+ labels:
+ failure-domain.beta.kubernetes.io/zone: us-central1-a__us-central1-b
+ failure-domain.beta.kubernetes.io/region: us-central1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: my-class
+ gcePersistentDisk:
+ diskName: my-disk
+ fsType: ext4
+```
+
+will be converted to:
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: zonal-volume-1
+ labels:
+ failure-domain.beta.kubernetes.io/zone: us-central1-a__us-central1-b
+ failure-domain.beta.kubernetes.io/region: us-central1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: my-class
+ gcePersistentDisk:
+ diskName: my-disk
+ fsType: ext4
+ nodeAffinity:
+ required:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ operator: In
+ values:
+ - us-central1-a
+ - us-central1-b
+ - key: failure-domain.beta.kubernetes.io/region
+ operator: In
+ values:
+ - us-central1
+```
### Bound PVC Enforcement
For PVCs that are already bound to a PV with NodeAffinity, enforcement is
@@ -225,82 +413,362 @@ Both binding decisions of:
will be considered by the scheduler, so that all of a Pod's scheduling
constraints can be evaluated at once.
-The rest of this document describes the detailed design for implementing this
-new volume binding behavior.
-
-
-## New Volume Binding Design
-The design can be broken up into a few areas:
-* User-facing API to invoke new behavior
-* Integrating PV binding with pod scheduling
-* Binding multiple PVCs as a single transaction
-* Recovery from kubelet rejection of pod
-* Making dynamic provisioning topology-aware
-
-For the alpha phase, only the user-facing API and PV binding and scheduler
-integration are necessary. The remaining areas can be handled in beta and GA
-phases.
+The detailed design for implementing this new volume binding behavior will be
+described later in the scheduler integration section.
-### User-facing API
-In alpha, this feature is controlled by a feature gate, VolumeScheduling, and
-must be configured in the kube-scheduler and kube-controller-manager.
+## Delayed Volume Binding
+Today, volume binding occurs immediately once a PersistentVolumeClaim is
+created. In order for volume binding to take into account all of a pod's other scheduling
+constraints, volume binding must be delayed until a Pod is being scheduled.
-A new StorageClass field will be added to control the volume binding behavior.
+A new StorageClass field `BindingMode` will be added to control the volume
+binding behavior.
```
type StorageClass struct {
...
- VolumeBindingMode *VolumeBindingMode
+ BindingMode *BindingMode
}
-type VolumeBindingMode string
+type BindingMode string
const (
- VolumeBindingImmediate VolumeBindingMode = "Immediate"
- VolumeBindingWaitForFirstConsumer VolumeBindingMode = "WaitForFirstConsumer"
+ BindingImmediate BindingMode = "Immediate"
+ BindingWaitForFirstConsumer BindingMode = "WaitForFirstConsumer"
)
```
-`VolumeBindingImmediate` is the default and current binding method.
+`BindingImmediate` is the default and current binding method.
-This approach allows us to introduce the new binding behavior gradually and to
-be able to maintain backwards compatibility without deprecation of previous
-behavior. However, it has a few downsides:
+This approach allows us to:
+* Introduce the new binding behavior gradually.
+* Maintain backwards compatibility without deprecation of previous
+ behavior. Any automation that waits for PVCs to be bound before scheduling Pods
+ will not break.
+* Support scenarios where volume provisioning for globally-accessible volume
+ types could take a long time, where volume provisioning is a planned
+ event well in advance of workload deployment.
+
+However, it has a few downsides:
* StorageClass will be required to get the new binding behavior, even if dynamic
-provisioning is not used (in the case of local storage).
-* We have to maintain two different paths for volume binding.
+ provisioning is not used (in the case of local storage).
+* We have to maintain two different code paths for volume binding.
* We will be depending on the storage admin to correctly configure the
-StorageClasses for the volume types that need the new binding behavior.
+ StorageClasses for the volume types that need the new binding behavior.
* User experience can be confusing because PVCs could have different binding
-behavior depending on the StorageClass configuration. We will mitigate this by
-adding a new PVC event to indicate if binding will follow the new behavior.
+ behavior depending on the StorageClass configuration. We will mitigate this by
+ adding a new PVC event to indicate if binding will follow the new behavior.
+
+
+## Dynamic Provisioning with Topology
+To make dynamic provisioning aware of pod scheduling decisions, delayed volume
+binding must also be enabled. The scheduler will pass its selected node to the
+dynamic provisioner, and the provisioner will create a volume in the topology
+domain that the selected node is part of. The domain depends on the volume
+plugin. Zonal volume plugins will create the volume in the zone where the
+selected node is in. The local volume plugin will create the volume on the
+selected node.
-### Integrating binding with scheduling
-For the alpha phase, the focus is on static provisioning of PVs to support
-persistent local storage.
+### End to End Zonal Example
+This is an example of the most common use case for provisioning zonal volumes.
+For this use case, the user's specs are unchanged. Only one change
+to the StorageClass is needed to enable delayed volume binding.
+1. Admin sets up StorageClass, setting up delayed volume binding.
+```
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: standard
+provisioner: kubernetes.io/gce-pd
+bindingMode: WaitForFirstConsumer
+parameters:
+ type: pd-standard
+```
+2. Admin launches provisioner. For in-tree plugins, nothing needs to be done.
+3. User creates PVC. Nothing changes in the spec, although now the PVC won't be
+ immediately bound.
+```
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: my-pvc
+spec:
+ storageClassName: standard
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 100Gi
+```
+4. User creates Pod. Nothing changes in the spec.
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: my-pod
+spec:
+ containers:
+ ...
+ volumes:
+ - name: my-vol
+ persistentVolumeClaim:
+ claimName: my-pvc
+```
+5. Scheduler picks a node that can satisfy the Pod and
+ [passes it](#pv-controller-changes) to the provisioner.
+6. Provisioner dynamically provisions a PV that can be accessed from
+ that node.
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ Name: volume-1
+spec:
+ capacity:
+ storage: 100Gi
+ storageClassName: standard
+ gcePersistentDisk:
+ diskName: my-disk
+ fsType: ext4
+ nodeAffinity:
+ required:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ operator: In
+ values:
+ - us-central1-a
+```
+7. Pod gets scheduled to the node.
+
+
+### Restricting Topology
+For the common use case, volumes will be provisioned in whatever topology domain
+the scheduler has decided is best to run the workload. Users may impose further
+restrictions by setting label/node selectors, and pod affinity/anti-affinity
+policies on their Pods. All those policies will be taken into account when
+dynamically provisioning a volume.
+
+While less common, administrators may want to further restrict what topology
+domains are available to a StorageClass. To support these administrator
+policies, an AllowedTopology field can also be specified in the
+StorageClass to restrict the topology domains for dynamic provisioning.
+This is not expected to be a common use case, and there are some caveats,
+described below.
+
+```
+type StorageClass struct {
+ ...
+
+ // Restrict the node topologies where volumes can be dynamically provisioned.
+ // Each volume plugin defines its own supported topology specifications.
+ // Each entry in AllowedTopologies is ORed.
+ AllowedTopologies []TopologySelector
+}
+
+type TopologySelector struct {
+ // Topology must meet all of the TopologySelectorLabelRequirements
+ // These requirements are ANDed.
+ MatchLabelExpressions []TopologySelectorLabelRequirement
+}
+
+// Topology requirement expressed as Node labels.
+type TopologySelectorLabelRequirement struct{
+ // Topology label key
+ Key string
+ // Topology must match at least one of the label Values for the given label Key.
+ // Each entry in Values is ORed.
+ Values []string
+}
+```
+
+A nil value means there are no topology restrictions. A scheduler predicate
+will evaluate a non-nil value when considering dynamic provisioning for a node.
+
+The AllowedTopologies will also be provided to provisioners as a new field, detailed in
+the provisioner section. Provisioners can use the allowed topology information
+in the following scenarios:
+* StorageClass is using the default immediate binding mode. This is the
+ legacy topology-unaware behavior. In this scenario, the volume could be
+ provisioned in a domain that cannot run the Pod since it doesn't take any
+ scheduler input.
+* For volumes that span multiple domains, the AllowedTopologies can restrict those
+ additional domains. However, special care must be taken to avoid specifying
+ conflicting topology constraints in the Pod. For example, the administrator could
+ restrict a multi-zonal volume to zones 'zone1' and 'zone2', but the Pod could have
+ constraints that restrict it to 'zone1' and 'zone3'. If 'zone1'
+ fails, the Pod cannot be scheduled to the intended failover zone.
+
+Note that if delayed binding is enabled and the volume spans only a single domain,
+then the AllowedTopologies can be ignored by the provisioner because the
+scheduler would have already taken it into account when it selects the node.
+
+Kubernetes will leave validation and enforcement of the AllowedTopologies content up
+to the provisioner.
+
+Support in the GCE PD and AWS EBS provisioners for the existing `zone` and `zones`
+parameters will not be deprecated due to the CSI in-tree migration requirement
+of CSI plugins supporting all the previous functionality of in-tree plugins, and
+CSI plugin versioning being independent of Kubernetes versions.
+
+Admins must already create a new StorageClass with delayed volume binding to use
+this feature, so the documentation can encourage use of the AllowedTopologies
+instead of existing zone parameters. A plugin-specific admission controller
+can also validate that both zone and AllowedTopologies are not specified,
+although the CSI plugin should still be robust to handle this configuration
+error.
+
+##### Alternatives
+A new restricted TopologySelector is used here instead of reusing
+VolumeNodeAffinity because the provisioning operation requires
+allowed topologies to be explicitly enumerated, while NodeAffinity and
+NodeSelectors allow for non-explicit expressions of topology values (i.e.,
+operators NotIn, Exists, DoesNotExist, Gt, Lt). It would be difficult for
+provisioners to evaluate all the expressions without having to enumerate all the
+Nodes in the cluster.
+
+Another alternative is to have a list of allowed PV topologies, where each PV
+topology is exactly the same as a single PV topology. This expression can become
+very verbose for volume types that have multi-dimensional topologies or multiple
+selections. As an example, for a multi-zonal volume that needs to select
+two zones, if an administrator wants to restrict the selection to 4 zones, then
+all 6 combinations need to be explicitly enumerated.
+
+Another alternative is to expand ResourceQuota to support topology constraints.
+However, ResourceQuota is currently only evaluated during admission, and not
+scheduling.
+
+#### Zonal Example
+This example restricts the volumes provisioned to zones us-central1-a and
+us-central1-b.
+```
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: zonal-class
+provisioner: kubernetes.io/gce-pd
+parameters:
+ type: pd-standard
+allowedTopologies:
+- matchLabelExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ values:
+ - us-central1-a
+ - us-central1-b
+```
+
+#### Multi-Zonal Example
+This example restricts the volume's primary and failover zones
+to us-central1-a, us-central1-b and us-central1-c. The regional PD
+provisioner will pick two out of the three zones to provision in.
+```
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: multi-zonal-class
+provisioner: kubernetes.io/gce-pd
+parameters:
+ type: pd-standard
+ replication-type: regional-pd
+allowedTopologies:
+- matchLabelExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ values:
+ - us-central1-a
+ - us-central1-b
+ - us-central1-c
+```
+
+Topologies that are incompatible with the storage provider parameters
+will be enforced by the provisioner. For example, dynamic provisioning
+of regional PDs will fail if provisioning is restricted to fewer than
+two zones in all regions. This configuration will cause provisioning to fail:
+```
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: multi-zonal-class
+provisioner: kubernetes.io/gce-pd
+parameters:
+ type: pd-standard
+ replication-type: regional-pd
+allowedTopologies:
+- matchLabelExpressions:
+ - key: failure-domain.beta.kubernetes.io/zone
+ values:
+ - us-central1-a
+```
+
+#### Multi Label Example
+This example restricts the volume's topology to nodes that
+have the following labels:
+
+* "zone: us-central1-a" and "rack: rack1" or,
+* "zone: us-central1-b" and "rack: rack1" or,
+* "zone: us-central1-b" and "rack: rack2"
+
+```
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+ name: something-fancy
+provisioner: rack-based-provisioner
+parameters:
+allowedTopologies:
+- matchLabelExpressions:
+ - key: zone
+ values:
+ - us-central1-a
+ - key: rack
+ values:
+ - rack1
+- matchLabelExpressions:
+ - key: zone
+ values:
+ - us-central1-b
+ - key: rack
+ values:
+ - rack1
+ - rack2
+```
+
+
+## Feature Gates
+PersistentVolume.NodeAffinity and StorageClas.BindingMode fields will be
+controlled by the VolumeScheduling feature gate, and must be configured in the
+kube-scheduler, kube-controller-manager, and all kubelets.
+
+The StorageClass.AllowedTopology field will be controlled
+by the DynamicProvisioningScheduling feature gate, and must be configured in the
+kube-scheduler and kube-controller-manager.
+
+
+## Integrating volume binding with pod scheduling
For the new volume binding mode, the proposed new workflow is:
-1. Admin statically creates PVs and/or StorageClasses.
+1. Admin pre-provisions PVs and/or StorageClasses.
2. User creates unbound PVC and there are no prebound PVs for it.
3. **NEW:** PVC binding and provisioning is delayed until a pod is created that
references it.
4. User creates a pod that uses the PVC.
5. Pod starts to get processed by the scheduler.
-6. **NEW:** A new predicate function, called MatchUnboundPVCs, will look at all of
-a Pod’s unbound PVCs, and try to find matching PVs for that node based on the
-PV topology. If there are no matching PVs, then it checks if dynamic
-provisioning is possible for that node.
+6. **NEW:** A new predicate function, called CheckVolumeBinding, will process
+both bound and unbound PVCs of the Pod. It will validate the VolumeNodeAffinity
+for bound PVCs. For unbound PVCs, it will try to find matching PVs for that node
+based on the PV NodeAffinity. If there are no matching PVs, then it checks if
+dynamic provisioning is possible for that node based on StorageClass
+AllowedTopologies.
7. **NEW:** The scheduler continues to evaluate priorities. A new priority
-function, called PrioritizeUnboundPVCs, will get the PV matches per PVC per
+function, called PrioritizeVolumes, will get the PV matches per PVC per
node, and compute a priority score based on various factors.
8. **NEW:** After evaluating all the existing predicates and priorities, the
-scheduler will pick a node, and call a new assume function, AssumePVCs,
+scheduler will pick a node, and call a new assume function, AssumePodVolumes,
passing in the Node. The assume function will check if any binding or
provisioning operations need to be done. If so, it will update the PV cache to
-mark the PVs with the chosen PVCs.
+mark the PVs with the chosen PVCs and queue the Pod for volume binding.
9. **NEW:** If PVC binding or provisioning is required, we do NOT AssumePod.
-Instead, a new bind function, BindPVCs, will be called asynchronously, passing
+Instead, a new bind function, BindPodVolumes, will be called asynchronously, passing
in the selected node. The bind function will prebind the PV to the PVC, or
trigger dynamic provisioning. Then, it always sends the Pod through the
scheduler again for reasons explained later.
@@ -328,15 +796,18 @@ avoid these error conditions are to:
* Separate out volumes that the user prebinds from the volumes that are
available for the system to choose from by StorageClass.
-#### PV Controller Changes
+### PV Controller Changes
When the feature gate is enabled, the PV controller needs to skip binding
unbound PVCs with VolumBindingWaitForFirstConsumer and no prebound PVs
to let it come through the scheduler path.
Dynamic provisioning will also be skipped if
-VolumBindingWaitForFirstConsumer is set. The scheduler will signal to
+VolumBindingWaitForFirstConsumer is set. The scheduler will signal to
the PV controller to start dynamic provisioning by setting the
-`annStorageProvisioner` annotation in the PVC.
+`annSelectedNode` annotation in the PVC. If provisioning fails, the PV
+controller can signal back to the scheduler to retry dynamic provisioning by
+removing the `annSelectedNode` annotation. For external provisioners, the
+external provisioner needs to remove the annotation.
No other state machine changes are required. The PV controller continues to
handle the remaining scenarios without any change.
@@ -344,14 +815,39 @@ handle the remaining scenarios without any change.
The methods to find matching PVs for a claim and prebind PVs need to be
refactored for use by the new scheduler functions.
-#### Scheduler Changes
+### Dynamic Provisioning interface changes
+The dynamic provisioning interfaces will be updated to pass in:
+* selectedNode, when late binding is enabled on the StorageClass
+* allowedTopologies, when it is set in the StorageClass
+
+If selectedNode is set, the provisioner should get its appropriate topology
+labels from the Node object, and provision a volume based on those topology
+values. In the common use case for a volume supporting a single topology domain,
+if nodeName is set, then allowedTopologies can be ignored by the provisioner.
+However, multi-domain volume provisioners may still need to look at
+allowedTopologies to restrict the remaining domains.
+
+In-tree provisioners:
+```
+Provision(selectedNode *v1.Node, allowedTopologies *storagev1.VolumeProvisioningTopology) (*v1.PersistentVolume, error)
+```
+
+External provisioners:
+* selectedNode will be represented by the PVC annotation "volume.alpha.kubernetes.io/selectedNode".
+ Value is the name of the node.
+* allowedTopologies must be obtained by looking at the StorageClass for the PVC.
-##### Predicate
+#### New Permissions
+Provisioners will need to be able to get Node and StorageClass objects.
+
+### Scheduler Changes
+
+#### Predicate
A new predicate function checks all of a Pod's unbound PVCs can be satisfied
by existing PVs or dynamically provisioned PVs that are
topologically-constrained to the Node.
```
-MatchUnboundPVCs(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
+CheckVolumeBinding(pod *v1.Pod, node *v1.Node) (canBeBound bool, err error)
```
1. If all the Pod’s PVCs are bound, return true.
2. Otherwise try to find matching PVs for all of the unbound PVCs in order of
@@ -361,69 +857,72 @@ decreasing requested capacity.
5. Temporarily cache this PV choice for the PVC per Node, for fast
processing later in the priority and bind functions.
6. Return true if all PVCs are matched.
-7. If there are still unmatched PVCs, check if dynamic provisioning is possible.
-For this alpha phase, the provisioner is not topology aware, so the predicate
-will just return true if there is a provisioner specified in the StorageClass
-(internal or external).
+7. If there are still unmatched PVCs, check if dynamic provisioning is possible,
+ by evaluating StorageClass.AllowedTopology. If so,
+ temporarily cache this decision in the PVC per Node.
8. Otherwise return false.
-##### Priority
+#### Priority
After all the predicates run, there is a reduced set of Nodes that can fit a
Pod. A new priority function will rank the remaining nodes based on the
unbound PVCs and their matching PVs.
```
-PrioritizeUnboundPVCs(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
+PrioritizeVolumes(pod *v1.Pod, filteredNodes HostPriorityList) (rankedNodes HostPriorityList, err error)
```
1. For each Node, get the cached PV matches for the Pod’s PVCs.
2. Compute a priority score for the Node using the following factors:
1. How close the PVC’s requested capacity and PV’s capacity are.
- 2. Matching static PVs is preferred over dynamic provisioning because we
+ 2. Matching pre-provisioned PVs is preferred over dynamic provisioning because we
assume that the administrator has specifically created these PVs for
the Pod.
TODO (beta): figure out weights and exact calculation
-##### Assume
+#### Assume
Once all the predicates and priorities have run, then the scheduler picks a
Node. Then we can bind or provision PVCs for that Node. For better scheduler
performance, we’ll assume that the binding will likely succeed, and update the
-PV cache first. Then the actual binding API update will be made
+PV and PVC caches first. Then the actual binding API update will be made
asynchronously, and the scheduler can continue processing other Pods.
-For the alpha phase, the AssumePVCs function will be directly called by the
+For the alpha phase, the AssumeVolumes function will be directly called by the
scheduler. We’ll consider creating a generic scheduler interface in a
subsequent phase.
```
-AssumePVCs(pod *v1.Pod, node *v1.Node) (pvcBindingRequired bool, err error)
+AssumePodVolumes(pod *v1.pod, node *v1.node) (pvcbindingrequired bool, err error)
```
1. If all the Pod’s PVCs are bound, return false.
-2. For static PV binding:
+2. For pre-provisioned PV binding:
1. Get the cached matching PVs for the PVCs on that Node.
2. Validate the actual PV state.
3. Mark PV.ClaimRef in the PV cache.
4. Cache the PVs that need binding in the Pod object.
3. For in-tree and external dynamic provisioning:
- 1. Cache the PVCs that need provisioning in the Pod object.
-4. Return true.
+ 1. Mark the PVC annSelectedNode in the PVC cache.
+ 2. Cache the PVCs that need provisioning in the Pod object.
+4. Return true
+
+#### Bind
+If AssumePodVolumes returns pvcBindingRequired, then Pod is queued for volume
+binding and provisioning. A separate go routine will process this queue and
+call the BindPodVolumes function.
-##### Bind
-If AssumePVCs returns pvcBindingRequired, then the BindPVCs function is called
-as a go routine. Otherwise, we can continue with assuming and binding the Pod
+Otherwise, we can continue with assuming and binding the Pod
to the Node.
-For the alpha phase, the BindUnboundPVCs function will be directly called by the
+For the alpha phase, the BindVolumes function will be directly called by the
scheduler. We’ll consider creating a generic scheduler interface in a subsequent
phase.
```
-BindUnboundPVCs(pod *v1.Pod, node *v1.Node) (err error)
+BindPodVolumes(pod *v1.Pod, node *v1.Node) (err error)
```
-1. For static PV binding:
+1. For pre-provisioned PV binding:
1. Prebind the PV by updating the `PersistentVolume.ClaimRef` field.
2. If the prebind fails, revert the cache updates.
2. For in-tree and external dynamic provisioning:
- 1. Set `annStorageProvisioner` on the PVC.
+ 1. Set `annSelectedNode` on the PVC.
3. Send Pod back through scheduling, regardless of success or failure.
1. In the case of success, we need one more pass through the scheduler in
order to evaluate other volume predicates that require the PVC to be bound, as
@@ -433,16 +932,16 @@ described below.
TODO: pv controller has a high resync frequency, do we need something similar
for the scheduler too
-##### Access Control
-Scheduler will need PV update permissions for prebinding static PVs, and PVC
-modify permissions for triggering dynamic provisioning.
+#### Access Control
+Scheduler will need PV update permissions for prebinding pre-provisioned PVs, and PVC
+update permissions for triggering dynamic provisioning.
-##### Pod preemption considerations
-The MatchUnboundPVs predicate does not need to be re-evaluated for pod
+#### Pod preemption considerations
+The CheckVolumeBinding predicate does not need to be re-evaluated for pod
preemption. Preempting a pod that uses a PV will not free up capacity on that
node because the PV lifecycle is independent of the Pod’s lifecycle.
-##### Other scheduler predicates
+#### Other scheduler predicates
Currently, there are a few existing scheduler predicates that require the PVC
to be bound. The bound assumption needs to be changed in order to work with
this new workflow.
@@ -452,7 +951,7 @@ running predicates? One possible way is to mark at the beginning of scheduling
a Pod if all PVCs were bound. Then we can check if a second scheduler pass is
needed.
-###### Max PD Volume Count Predicate
+##### Max PD Volume Count Predicate
This predicate checks the maximum number of PDs per node is not exceeded. It
needs to be integrated into the binding decision so that we don’t bind or
provision a PV if it’s going to cause the node to exceed the max PD limit. But
@@ -460,7 +959,7 @@ until it is integrated, we need to make one more pass in the scheduler after all
the PVCs are bound. The current copy of the predicate in the default scheduler
has to remain to account for the already-bound volumes.
-###### Volume Zone Predicate
+##### Volume Zone Predicate
This predicate makes sure that the zone label on a PV matches the zone label of
the node. If the volume is not bound, this predicate can be ignored, as the
binding logic will take into account zone constraints on the PV.
@@ -475,18 +974,18 @@ This predicate needs to remain in the default scheduler to handle the
already-bound volumes using the old zonal labeling. It can be removed once that
mechanism is deprecated and unsupported.
-###### Volume Node Predicate
+##### Volume Node Predicate
This is a new predicate added in 1.7 to handle the new PV node affinity. It
evaluates the node affinity against the node’s labels to determine if the pod
can be scheduled on that node. If the volume is not bound, this predicate can
be ignored, as the binding logic will take into account the PV node affinity.
-##### Caching
+#### Caching
There are two new caches needed in the scheduler.
The first cache is for handling the PV/PVC API binding updates occurring
-asynchronously with the main scheduler loop. `AssumePVCs` needs to store
-the updated API objects before `BindUnboundPVCs` makes the API update, so
+asynchronously with the main scheduler loop. `AssumeVolumes` needs to store
+the updated API objects before `BindVolumes` makes the API update, so
that future binding decisions will not choose any assumed PVs. In addition,
if the API update fails, the cached updates need to be reverted and restored
with the actual API object. The cache will return either the cached-only
@@ -507,6 +1006,8 @@ all the volume predicates are fully run once all PVCs are bound.
* Caching PV matches per node decisions that the predicate had made. This is
an optimization to avoid walking through all the PVs again in priority and
assume functions.
+* Caching PVC dynamic provisioning decisions per node that the predicate had
+ made.
#### Performance and Optimizations
Let:
@@ -524,14 +1025,7 @@ PVs for every node, so its running time is O(NV).
A few optimizations can be made to improve the performance:
-1. Optimizing for PVs that don’t use node affinity (to prevent performance
-regression):
- 1. Index the PVs by StorageClass and only search the PV list with matching
-StorageClass.
- 2. Keep temporary state in the PVC cache if we previously succeeded or
-failed to match PVs, and if none of the PVs have node affinity. Then we can
-skip PV matching on subsequent nodes, and just return the result of the first
-attempt.
+1. PVs that don’t use node affinity should not be using delayed binding.
2. Optimizing for PVs that have node affinity:
1. When a static PV is created, if node affinity is present, evaluate it
against all the nodes. For each node, keep an in-memory map of all its PVs
@@ -541,7 +1035,7 @@ match against the PVs in the node’s PV map instead of the cluster-wide PV list
For the alpha phase, the optimizations are not required. However, they should
be required for beta and GA.
-#### Packaging
+### Packaging
The new bind logic that is invoked by the scheduler can be packaged in a few
ways:
* As a library to be directly called in the default scheduler
@@ -556,7 +1050,7 @@ for more race conditions due to the caches being out of sync.
because the scheduler’s cache and PV controller’s cache have different interfaces
and private methods.
-##### Extender cons
+#### Extender cons
However, the cons of the extender approach outweighs the cons of the library
approach.
@@ -578,18 +1072,18 @@ Kubernetes.
With all this complexity, the library approach is the most feasible in a single
release time frame, and aligns better with the current Kubernetes architecture.
-#### Downsides
+### Downsides
-##### Unsupported Use Cases
+#### Unsupported Use Cases
The following use cases will not be supported for PVCs with a StorageClass with
-VolumeBindingWaitForFirstConsumer:
+BindingWaitForFirstConsumer:
* Directly setting Pod.Spec.NodeName
* DaemonSets
These two use cases will bypass the default scheduler and thus will not
trigger PV binding.
-##### Custom Schedulers
+#### Custom Schedulers
Custom schedulers, controllers and operators that handle pod scheduling and want
to support this new volume binding mode will also need to handle the volume
binding decision.
@@ -604,7 +1098,7 @@ easier for custom schedulers to include in their own implementation.
In general, many advanced scheduling features have been added into the default
scheduler, such that it is becoming more difficult to run without it.
-##### HA Master Upgrades
+#### HA Master Upgrades
HA masters adds a bit of complexity to this design because the active scheduler
process and active controller-manager (PV controller) process can be on different
nodes. That means during an HA master upgrade, the scheduler and controller-manager
@@ -624,9 +1118,9 @@ all dependencies are at the required versions.
For alpha, this is not concerning, but it needs to be solved by GA.
-#### Other Alternatives Considered
+### Other Alternatives Considered
-##### One scheduler function
+#### One scheduler function
An alternative design considered was to do the predicate, priority and bind
functions all in one function at the end right before Pod binding, in order to
reduce the number of passes we have to make over all the PVs. However, this
@@ -641,7 +1135,7 @@ on a Node that the higher priority pod still cannot run on due to PVC
requirements. For that reason, the PVC binding decision needs to be have its
predicate function separated out and evaluated with the rest of the predicates.
-##### Pull entire PVC binding into the scheduler
+#### Pull entire PVC binding into the scheduler
The proposed design only has the scheduler initiating the binding transaction
by prebinding the PV. An alternative is to pull the whole two-way binding
transaction into the scheduler, but there are some complex scenarios that
@@ -653,7 +1147,7 @@ scheduler’s Pod sync loop cannot handle:
Handling these scenarios in the scheduler’s Pod sync loop is not possible, so
they have to remain in the PV controller.
-##### Keep all PVC binding in the PV controller
+#### Keep all PVC binding in the PV controller
Instead of initiating PV binding in the scheduler, have the PV controller wait
until the Pod has been scheduled to a Node, and then try to bind based on the
chosen Node. A new scheduling predicate is still needed to filter and match
@@ -685,7 +1179,7 @@ can make a lot of wrong decisions after the restart.
evaluated. To solve this, all the volume predicates need to also be built into
the PV controller when matching possible PVs.
-##### Move PVC binding to kubelet
+#### Move PVC binding to kubelet
Looking into the future, with the potential for NUMA-aware scheduling, you could
have a sub-scheduler on each node to handle the pod scheduling within a node. It
could make sense to have the volume binding as part of this sub-scheduler, to make
@@ -699,7 +1193,7 @@ to just that node, but for zonal storage, it could see all the PVs in that zone.
In addition, the sub-scheduler is just a thought at this point, and there are no
concrete proposals in this area yet.
-### Binding multiple PVCs in one transaction
+## Binding multiple PVCs in one transaction
There are no plans to handle this, but a possible solution is presented here if the
need arises in the future. Since the scheduler is serialized, a partial binding
failure should be a rare occurrence and would only be caused if there is a user or
@@ -720,25 +1214,12 @@ If scheduling fails, update all bound PVCs with an annotation,
are clean. Scheduler and kubelet needs to reject pods with PVCs that are
undergoing rollback.
-### Recovering from kubelet rejection of pod
+## Recovering from kubelet rejection of pod
We can use the same rollback mechanism as above to handle this case.
If kubelet rejects a pod, it will go back to scheduling. If the scheduler
cannot find a node for the pod, then it will encounter scheduling failure and
initiate the rollback.
-### Making dynamic provisioning topology aware
-TODO (beta): Design details
-
-For alpha, we are not focusing on this use case. But it should be able to
-follow the new workflow closely with some modifications.
-* The FindUnboundPVCs predicate function needs to get provisionable capacity per
-topology dimension from the provisioner somehow.
-* The PrioritizeUnboundPVCs priority function can add a new priority score factor
-based on available capacity per node.
-* The BindUnboundPVCs bind function needs to pass in the node to the provisioner.
-The internal and external provisioning APIs need to be updated to take in a node
-parameter.
-
## Testing
@@ -752,7 +1233,7 @@ parameter.
* Multiple PVCs specified in a pod
* Positive: Enough local PVs available on a single node
* Negative: Not enough local PVs available on a single node
-* Fallback to dynamic provisioning if unsuitable static PVs
+* Fallback to dynamic provisioning if unsuitable pre-provisioned PVs
### Unit tests
* All PVCs found a match on first node. Verify match is best suited based on
diff --git a/contributors/devel/api_changes.md b/contributors/devel/api_changes.md
index 2440902e..303c43a8 100644
--- a/contributors/devel/api_changes.md
+++ b/contributors/devel/api_changes.md
@@ -365,10 +365,14 @@ being required otherwise.
### Edit defaults.go
If your change includes new fields for which you will need default values, you
-need to add cases to `pkg/apis/<group>/<version>/defaults.go` (the core v1 API
-is special, its defaults.go is at `pkg/api/v1/defaults.go`. For simplicity, we
-will not mention this special case in the rest of the article). Of course, since
-you have added code, you have to add a test:
+need to add cases to `pkg/apis/<group>/<version>/defaults.go`
+
+*Note:* In the past the core v1 API
+was special. Its `defaults.go` used to live at `pkg/api/v1/defaults.go`.
+If you see code referencing that path, you can be sure its outdated. Now the core v1 api lives at
+`pkg/apis/core/v1/defaults.go` which follows the above convention.
+
+Of course, since you have added code, you have to add a test:
`pkg/apis/<group>/<version>/defaults_test.go`.
Do use pointers to scalars when you need to distinguish between an unset value
@@ -601,7 +605,6 @@ Due to the fast changing nature of the project, the following content is probabl
to generate protobuf IDL and marshallers.
* You must add the new version to
[cmd/kube-apiserver/app#apiVersionPriorities](https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.2/cmd/kube-apiserver/app/aggregator.go#L172)
- to let the aggregator list it. This list will be removed before release 1.8.
* You must setup storage for the new version in
[pkg/registry/group_name/rest](https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.2/pkg/registry/authentication/rest/storage_authentication.go)
diff --git a/contributors/devel/coding-conventions.md b/contributors/devel/coding-conventions.md
deleted file mode 100644
index 23775c55..00000000
--- a/contributors/devel/coding-conventions.md
+++ /dev/null
@@ -1,3 +0,0 @@
-This document has been moved to https://git.k8s.io/community/contributors/guide/coding-conventions.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/development.md b/contributors/devel/development.md
index cd3f84b7..29ba0bd7 100644
--- a/contributors/devel/development.md
+++ b/contributors/devel/development.md
@@ -134,7 +134,9 @@ development environment, please [set one up](http://golang.org/doc/code.html).
| 1.5, 1.6 | 1.7 - 1.7.5 |
| 1.7 | 1.8.1 |
| 1.8 | 1.8.3 |
-| 1.9+ | 1.9.1 |
+| 1.9 | 1.9.1 |
+| 1.10 | 1.9.1 |
+| 1.11+ | 1.10.1 |
Ensure your GOPATH and PATH have been configured in accordance with the Go
environment instructions.
diff --git a/contributors/devel/faster_reviews.md b/contributors/devel/faster_reviews.md
deleted file mode 100644
index d0fe7e37..00000000
--- a/contributors/devel/faster_reviews.md
+++ /dev/null
@@ -1,4 +0,0 @@
-The contents of this file have been moved to https://git.k8s.io/community/contributors/guide/pull-requests.md.
- <!--
- This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
- -->
diff --git a/contributors/devel/flexvolume.md b/contributors/devel/flexvolume.md
index 627e88b1..1dfc9668 100644
--- a/contributors/devel/flexvolume.md
+++ b/contributors/devel/flexvolume.md
@@ -132,7 +132,7 @@ Note: Secrets are passed only to "mount/unmount" call-outs.
See [nginx-lvm.yaml] & [nginx-nfs.yaml] for a quick example on how to use Flexvolume in a pod.
-[lvm]: https://git.k8s.io/kubernetes/examples/volumes/flexvolume/lvm
-[nfs]: https://git.k8s.io/kubernetes/examples/volumes/flexvolume/nfs
-[nginx-lvm.yaml]: https://git.k8s.io/kubernetes/examples/volumes/flexvolume/nginx-lvm.yaml
-[nginx-nfs.yaml]: https://git.k8s.io/kubernetes/examples/volumes/flexvolume/nginx-nfs.yaml
+[lvm]: https://git.k8s.io/examples/staging/volumes/flexvolume/lvm
+[nfs]: https://git.k8s.io/examples/staging/volumes/flexvolume/nfs
+[nginx-lvm.yaml]: https://git.k8s.io/examples/staging/volumes/flexvolume/nginx-lvm.yaml
+[nginx-nfs.yaml]: https://git.k8s.io/examples/staging/volumes/flexvolume/nginx-nfs.yaml
diff --git a/contributors/devel/go-code.md b/contributors/devel/go-code.md
deleted file mode 100644
index 4454e400..00000000
--- a/contributors/devel/go-code.md
+++ /dev/null
@@ -1,3 +0,0 @@
-This document's content has been rolled into https://git.k8s.io/community/contributors/guide/coding-conventions.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/owners.md b/contributors/devel/owners.md
deleted file mode 100644
index 1be75e5f..00000000
--- a/contributors/devel/owners.md
+++ /dev/null
@@ -1,4 +0,0 @@
-This document has been moved to https://git.k8s.io/community/contributors/guide/owners.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
-
diff --git a/contributors/devel/pull-requests.md b/contributors/devel/pull-requests.md
deleted file mode 100644
index c793df8c..00000000
--- a/contributors/devel/pull-requests.md
+++ /dev/null
@@ -1,4 +0,0 @@
-This file has been moved to https://git.k8s.io/community/contributors/guide/pull-requests.md.
-<!--
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
---> \ No newline at end of file
diff --git a/contributors/devel/release/OWNERS b/contributors/devel/release/OWNERS
deleted file mode 100644
index afb042fa..00000000
--- a/contributors/devel/release/OWNERS
+++ /dev/null
@@ -1,8 +0,0 @@
-reviewers:
- - saad-ali
- - pwittrock
- - steveperry-53
- - chenopis
- - spiffxp
-approvers:
- - sig-release-leads
diff --git a/contributors/devel/release/README.md b/contributors/devel/release/README.md
deleted file mode 100644
index d6eb9d6c..00000000
--- a/contributors/devel/release/README.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/ephemera/README.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/release/issues.md b/contributors/devel/release/issues.md
deleted file mode 100644
index cccf12e9..00000000
--- a/contributors/devel/release/issues.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/ephemera/issues.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/release/patch-release-manager.md b/contributors/devel/release/patch-release-manager.md
deleted file mode 100644
index da1290e5..00000000
--- a/contributors/devel/release/patch-release-manager.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/release-process-documentation/release-team-guides/patch-release-manager-playbook.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/release/patch_release.md b/contributors/devel/release/patch_release.md
deleted file mode 100644
index 1b074759..00000000
--- a/contributors/devel/release/patch_release.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/ephemera/patch_release.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/release/scalability-validation.md b/contributors/devel/release/scalability-validation.md
deleted file mode 100644
index 8a943227..00000000
--- a/contributors/devel/release/scalability-validation.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/ephemera/scalability-validation.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/release/testing.md b/contributors/devel/release/testing.md
deleted file mode 100644
index 2ae76112..00000000
--- a/contributors/devel/release/testing.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/ephemera/testing.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/devel/scalability-good-practices.md b/contributors/devel/scalability-good-practices.md
deleted file mode 100644
index ef274c27..00000000
--- a/contributors/devel/scalability-good-practices.md
+++ /dev/null
@@ -1,4 +0,0 @@
-This document has been moved to https://git.k8s.io/community/contributors/guide/scalability-good-practices.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
-
diff --git a/contributors/devel/scheduler.md b/contributors/devel/scheduler.md
index d8da4631..486b04a9 100644
--- a/contributors/devel/scheduler.md
+++ b/contributors/devel/scheduler.md
@@ -84,7 +84,7 @@ scheduling policies to apply, and can add new ones.
The policies that are applied when scheduling can be chosen in one of two ways.
The default policies used are selected by the functions `defaultPredicates()` and `defaultPriorities()` in
[pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/pkg/scheduler/algorithmprovider/defaults/defaults.go).
-However, the choice of policies can be overridden by passing the command-line flag `--policy-config-file` to the scheduler, pointing to a JSON file specifying which scheduling policies to use. See [examples/scheduler-policy-config.json](http://releases.k8s.io/HEAD/examples/scheduler-policy-config.json) for an example
+However, the choice of policies can be overridden by passing the command-line flag `--policy-config-file` to the scheduler, pointing to a JSON file specifying which scheduling policies to use. See [examples/scheduler-policy-config.json](https://git.k8s.io/examples/staging/scheduler-policy-config.json) for an example
config file. (Note that the config file format is versioned; the API is defined in [pkg/scheduler/api](http://releases.k8s.io/HEAD/pkg/scheduler/api/)).
Thus to add a new scheduling policy, you should modify [pkg/scheduler/algorithm/predicates/predicates.go](http://releases.k8s.io/HEAD/pkg/scheduler/algorithm/predicates/predicates.go) or add to the directory [pkg/scheduler/algorithm/priorities](http://releases.k8s.io/HEAD/pkg/scheduler/algorithm/priorities/), and either register the policy in `defaultPredicates()` or `defaultPriorities()`, or use a policy config file.
diff --git a/contributors/devel/security-release-process.md b/contributors/devel/security-release-process.md
deleted file mode 100644
index e0b55f68..00000000
--- a/contributors/devel/security-release-process.md
+++ /dev/null
@@ -1,3 +0,0 @@
-The original content of this file has been migrated to https://git.k8s.io/sig-release/security-release-process-documentation/security-release-process.md
-
-This file is a placeholder to preserve links. Please remove after 3 months or the release of kubernetes 1.10, whichever comes first.
diff --git a/contributors/guide/README.md b/contributors/guide/README.md
index ddb65111..ad5cf2e1 100644
--- a/contributors/guide/README.md
+++ b/contributors/guide/README.md
@@ -208,7 +208,7 @@ If you haven't noticed by now, we have a large, lively, and friendly open-source
## Events
-Kubernetes is the main focus of CloudNativeCon/KubeCon, held twice per year in EMEA and in North America. Information about these and other community events is available on the CNCF [events](https://www.cncf.io/events/) pages.
+Kubernetes is the main focus of KubeCon + CloudNativeCon, held three times per year in China, Europe and in North America. Information about these and other community events is available on the CNCF [events](https://www.cncf.io/events/) pages.
### Meetups
diff --git a/contributors/guide/contributor-cheatsheet.md b/contributors/guide/contributor-cheatsheet.md
index 8f21cd84..e9591afc 100644
--- a/contributors/guide/contributor-cheatsheet.md
+++ b/contributors/guide/contributor-cheatsheet.md
@@ -17,10 +17,11 @@ A list of common resources when contributing to Kubernetes.
- [Gubernator Dashboard - k8s.reviews](https://k8s-gubernator.appspot.com/pr)
- [Submit Queue](https://submit-queue.k8s.io)
- [Bot commands](https://go.k8s.io/bot-commands)
-- [Release Buckets](http://gcsweb.k8s.io/gcs/kubernetes-release/)
+- [GitHub labels](https://go.k8s.io/github-labels)
+- [Release Buckets](https://gcsweb.k8s.io/gcs/kubernetes-release/)
- Developer Guide
- - [Cherry Picking Guide](/contributors/devel/cherry-picks.md) - [Queue](http://cherrypick.k8s.io/#/queue)
-- [https://k8s-code.appspot.com/](https://k8s-code.appspot.com/) - Kubernetes Code Search, maintained by [@dims](https://github.com/dims)
+ - [Cherry Picking Guide](/contributors/devel/cherry-picks.md) - [Queue](https://cherrypick.k8s.io/#/queue)
+- [Kubernetes Code Search](https://cs.k8s.io/), maintained by [@dims](https://github.com/dims)
## SIGs and Working Groups
@@ -39,8 +40,10 @@ A list of common resources when contributing to Kubernetes.
## Tests
- [Current Test Status](https://prow.k8s.io/)
-- [Aggregated Failures](https://storage.googleapis.com/k8s-gubernator/triage/index.html)
-- [Test Grid](https://k8s-testgrid.appspot.com/)
+- [Aggregated Failures](https://go.k8s.io/triage)
+- [Test Grid](https://testgrid.k8s.io)
+- [Test Health](https://go.k8s.io/test-health)
+- [Test History](https://go.k8s.io/test-history)
## Other
diff --git a/contributors/guide/github-workflow.md b/contributors/guide/github-workflow.md
index a1429258..ac747abc 100644
--- a/contributors/guide/github-workflow.md
+++ b/contributors/guide/github-workflow.md
@@ -74,6 +74,22 @@ git checkout -b myfeature
Then edit code on the `myfeature` branch.
#### Build
+The following section is a quick start on how to build Kubernetes locally, for more detailed information you can see [kubernetes/build](https://git.k8s.io/kubernetes/build/README.md).
+The best way to validate your current setup is to build a small part of Kubernetes. This way you can address issues without waiting for the full build to complete. To build a specific part of Kubernetes use the `WHAT` environment variable to let the build scripts know you want to build only a certain package/executable.
+
+```sh
+make WHAT=cmd/{$package_you_want}
+```
+
+*Note:* This applies to all top level folders under kubernetes/cmd.
+
+So for the cli, you can run:
+
+```sh
+make WHAT=cmd/kubectl
+```
+
+If everything checks out you will have an executable in the `_output/bin` directory to play around with.
*Note:* If you are using `CDPATH`, you must either start it with a leading colon, or unset the variable. The make rules and scripts to build require the current directory to come first on the CD search path in order to properly navigate between directories.
diff --git a/contributors/new-contributor-playground/OWNERS b/contributors/new-contributor-playground/OWNERS
new file mode 100644
index 00000000..8a6b7bb7
--- /dev/null
+++ b/contributors/new-contributor-playground/OWNERS
@@ -0,0 +1,14 @@
+reviewers:
+ - parispittman
+ - guineveresaenger
+ - jberkus
+ - errordeveloper
+ - tpepper
+ - spiffxp
+approvers:
+ - parispittman
+ - guineveresaenger
+ - jberkus
+ - errordeveloper
+labels:
+ - area/new-contributor-track
diff --git a/contributors/new-contributor-playground/README.md b/contributors/new-contributor-playground/README.md
new file mode 100644
index 00000000..b5946581
--- /dev/null
+++ b/contributors/new-contributor-playground/README.md
@@ -0,0 +1,12 @@
+# Welcome to KubeCon Copenhagen's New Contributor Track!
+
+Hello new contributors!
+
+This subfolder of [kubernetes/community](https://github.com/kubernetes/community) will be used as a safe space for participants in the New Contributor Onboarding Track to familiarize themselves with (some of) the Kubernetes Project's review and pull request processes.
+
+The label associated with this track is `area/new-contributor-track`.
+
+*If you are not currently attending or organizing this event, please DO NOT create issues or pull requests against this section of the community repo.*
+
+A [Youtube playlist](https://www.youtube.com/playlist?list=PL69nYSiGNLP3M5X7stuD7N4r3uP2PZQUx) of this workshop has been posted, and an outline of content to videos can be found [here](http://git.k8s.io/community/events/2018/05-contributor-summit).
+
diff --git a/contributors/new-contributor-playground/hello-from-copenhagen.md b/contributors/new-contributor-playground/hello-from-copenhagen.md
new file mode 100644
index 00000000..29467efd
--- /dev/null
+++ b/contributors/new-contributor-playground/hello-from-copenhagen.md
@@ -0,0 +1,4 @@
+# Hello from Copenhagen!
+
+Hello everyone who's attending the Contributor Summit at KubeCon + CloudNativeCon in Copenhagen!
+Great to see so many amazing people interested in contributing to Kubernetes :)
diff --git a/contributors/new-contributor-playground/new-contributor-notes.md b/contributors/new-contributor-playground/new-contributor-notes.md
new file mode 100644
index 00000000..1858fd85
--- /dev/null
+++ b/contributors/new-contributor-playground/new-contributor-notes.md
@@ -0,0 +1,350 @@
+# Kubernetes New Contributor Workshop - KubeCon EU 2018 - Notes
+
+Joining in the beginning was onboarding on a yacht
+Now is more onboarding a BIG cruise ship.
+
+Will be a Hard schedule, and let's hope we can achieve everything
+Sig-contributor-experience -> from Non-member contributors to Owner
+
+## SIG presentation
+
+- SIG-docs & SIG-contributor-experience: **Docs and website** contribution
+- SIG-testing: **Testing** contribution
+- SIG-\* (*depends on the area to contribute on*): **Code** contribution
+
+**=> Find your first topics**: bug, feature, learning, community development and documentation
+
+Table exercise: Introduce yourself and give a tip on where you want to contribute in Kubernetes
+
+
+## Communication in the community
+
+Kubernetes community is like a Capybara: community members are really cool with everyone and they are from a lot of different horizons.
+
+- Tech question on Slack and Stack Overflow, not on Github
+- A lot of discussion will be involve when GH issues and PR are opened. Don't be frustrated
+- Stay patient because there is a lot of contribution
+
+When in doubt, **ask on Slack**
+
+Other communication channels:
+
+- Community meetings
+- Mailing lists
+- @ on Github
+- Office Hour
+- Kubernetes meetups https://www.meetup.com/topics/kubernetes
+
+on https://kubernetes.io/community, there is the schedule for all the SIG/Working group meeting.
+If you want to join or create a meetup. Go to **slack#sig-contribex**
+
+## SIG - Special Interest Group
+
+Semi-autonomous teams:
+- Own leaders & charteers
+- Code, Github repo, Slack, mailing, meeting responsibility
+
+### Types
+
+[SIG List](https://github.com/kubernetes/community/blob/master/sig-list.md)
+
+1. Features Area
+ - sig-auth
+ - sig-apps
+ - sig-autoscaling
+ - sig-big-data
+ - sig-cli
+ - sig-multicluster
+ - sig-network
+ - sig-node
+ - sig-scalability
+ - sig-scheduling
+ - sig-service-catalog
+ - sig-storage
+ - sig-ui
+2. Plumbing
+ - sig-cluster-lifecycle
+ - sig-api-machinary
+ - sig-instrumentation
+3. Cloud Providers *(currently working on moving cloudprovider code out of Core)*
+ - sig-aws
+ - sig-azure
+ - sig-gcp
+ - sig-ibmcloud
+ - sig-openstack
+4. Meta
+ - sig-architecture: For all general architectural decision
+ - sig-contributor-experience: Helping contributor and community experience
+ - sig-product-management: Long-term decision
+ - sig-release
+ - sig-testing: In charge of all the test for Kubernetes
+5. Docs
+ - sig-docs: for documentation and website
+
+## Working groups and "Subproject"
+
+From working group to "subproject".
+
+For specific: tools (ex. Helm), goals (ex. Resource Management) or areas (ex. Machine Learning).
+
+Working groups change around more frequently than SIGs, and some might be temporary.
+
+- wg-app-def
+- wg-apply
+- wg-cloud-provider
+- wg-cluster-api
+- wg-container-identity
+- ...
+
+### Picking the right SIG:
+1. Figure out which area you would like to contribute to
+2. Find out which SIG / WG / subproject covers that (tip: ask on #sig-contribex Slack channel)
+3. Join that SIG / WG / subproject (you should also join the main SIG when joining a WG / subproject)
+
+## Tour des repositories
+
+Everything will be refactored (cleaning, move, merged,...)
+
+### Core repository
+- [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes)
+
+### Project
+
+- [kubernetes/Community](https://github.com/kubernetes/Community): Kubecon, proposition, Code of conduct and Contribution guideline, SIG-list
+- [kubernetes/Features](https://github.com/kubernetes/Features): Features proposal for future release
+- [kubernetes/Steering](https://github.com/kubernetes/Steering)
+- [kubernetes/Test-Infra](https://github.com/kubernetes/Test-Infra): All related to test except Perf
+- [kubernetes/Perf-Tests](https://github.com/kubernetes/Perf-Tests):
+
+### Docs/Website
+
+- website
+- kubernetes-cn
+- kubernetes-ko
+
+### Developer Tools
+
+- sample-controller*
+- sample- apiserver*
+- code-generator*
+- k8s.io
+- kubernetes-template-project: For new github repo
+
+### Staging repositories
+
+Mirror of core part for easy vendoring
+
+### SIG repositories
+
+- release
+- federation
+- autoscaler
+
+### Cloud Providers
+
+No AWS
+
+### Tools & Products
+
+- kubeadm
+- kubectl
+- kops
+- helm
+- charts
+- kompose
+- ingress-nginx
+- minikube
+- dashboard
+- heapster
+- kubernetes-anywhere
+- kube-openapi
+
+### 2nd Namespace: Kubernetes-sigs
+
+Too much places for Random/Incubation stuff.
+No working path for **promotion/deprecation**
+
+In future:
+1. start in Kubernetes-sigs
+2. SIGs determine when and how the project will be **promoted/deprecated**
+
+Those repositories can have their own rules:
+- Approval
+- Ownership
+- ...
+
+## Contribution
+
+### First Bug report
+
+```
+- Bug or Feature
+
+- What happened
+
+- How to reproduce
+
+```
+
+ ### Issues as specifications
+
+
+Most of k8s change start with an issue:
+
+- Feature proposal
+- API changes proposal
+- Specification
+
+### From Issue to Code/Docs
+
+1. Start with an issue
+2. Apply all appropriate labels
+3. cc SIG leads and concerned devs
+4. Raise the issue at a SIG meeting or on mailing list
+5. If *Lazy consensus*, submit a PR
+
+### Required labels https://github.com/kubernetes/test-infra/blob/master/label_sync/labels.md
+
+#### On creation
+- `sig/\*`: the sig the issue belong too
+- `kind/\*`:
+ - bug
+ - feature
+ - documentation
+ - design
+ - failing-test
+
+#### For issue closed as port of **triage**
+
+- `triage/duplicate`
+- `triage/needs-information`
+- `triage/support`
+- `triage/unreproduceable`
+- `triage/unresolved`
+
+#### Prority
+
+- `priority/critical-urgent`
+- `priority/important-soon`
+- `priority/important-longtem`
+- `priority/backlog`
+- `priority/awaiting-evidence`
+
+#### Area
+
+Free for dedicated issue area
+
+- `area/kubectl`
+- `area/api`
+- `area/dns`
+- `area/platform/gcp`
+
+#### help-wanted
+
+Currently mostly complicated things
+
+#### SOON
+
+`good-first-issue`
+
+## Making a contribution by Pull Request
+
+We will go through the typical PR process on kubernetes repos.
+
+We will play there: [community/contributors/new-contributor-playground at master · kubernetes/community · GitHub](https://github.com/kubernetes/community/tree/master/contributors/new-contributor-playground)
+
+1. When we contribute to any kubernetes repository, **fork it**
+
+2. Do your modification in your fork
+```
+$ git clone git@github.com:jgsqware/community.git $GOPATH/src/github.com/kubernetes/community
+$ git remote add upstream https://github.com/kubernetes/community.git
+$ git remote -v
+origin git@github.com:jgsqware/community.git (fetch)
+origin git@github.com:jgsqware/community.git (push)
+upstream git@github.com:kubernetes/community.git (fetch)
+upstream git@github.com:kubernetes/community.git (push)
+$ git checkout -b kubecon
+Switched to a new branch 'kubecon'
+
+## DO YOUR MODIFCATION IN THE CODE##
+
+$ git add contributors/new-contributor-playground/new-contibutor-playground-xyz.md
+$ git commit
+
+
+### IN YOUR COMMIT EDITOR ###
+
+ Adding a new contributors file
+
+ We are currently experimenting PR process in the kubernetes repository.
+
+$ git push -u origin kubecon
+```
+
+3. Create a Pull request via Github
+4. If needed, sign the CLA to make valid your contribution
+5. Read the `k8s-ci-robot` message and `/assign @reviewer` recommended by the `k8s-ci-robot`
+6. wait for a `LTGM` label from one of the `OWNER/reviewers`
+7. wait for approval from one of `OWNER/approvers`
+8. `k8s-ci-robot` will automatically merge the PR
+
+`needs-ok-to-test` is used for non-member contributor to validate the pull request
+
+## Test infrastructure
+
+> How bot toll you when you mess up
+
+At the end of a PR there is a bunch of test.
+2 types:
+ - required: Always run and needed to pass to validate the PR (eg. end-to-end test)
+ - not required: Needed in specific condition (eg. modifying on ly specific part of code)
+
+If something failed, click on `details` and check the test failure logs to see what happened.
+There is `junit-XX.log` with the list of test executed and `e2e-xxxxx` folder with all the component logs.
+To check if the test failed because of your PR or another one, you can click on the **TOP** `pull-request-xxx` link and you will see the test-grid and check if your failing test is failing in other PR too.
+
+If you want to retrigger the test manually, you can comment the PR with `/retest` and `k8s-ci-robot` will retrigger the tests.
+
+## SIG-Docs contribution
+
+Anyone can contribute to docs.
+
+### Kubernetes docs
+
+- Websites URL
+- Github Repository
+- k8s slack: #sig-docs
+
+### Working with docs
+
+Docs use `k8s-ci-robot`. Approval process is the same as for any k8s repo.
+In docs, `master` branch is the current version of the docs. So always branch from `master`. It's continuous deployment
+For a specific release docs, branch from `release-1.X`.
+
+## Local build and Test
+
+The code: [kubernetes/kubernetes]
+The process: [kubernetes/community]
+
+### Dev Env
+
+You need:
+- Go
+- Docker
+
+
+- Lot of RAM and CPU and 10 GB of space
+- best to use Linux
+- place you k8s repo fork in:
+ - `$GOPATH/src/k8s.io/kubernetes`
+- `cd $GOPATH/src/k8s.io/kubernetes`
+- build: `./build/run.sh make`
+ - Build is incremental, keep running `./build/run.sh make` til it works
+- To build variant: `make WHAT="kubectl"`
+- Building kubectl on Mac for linux: `KUBE_*_PLATFORM="linux/amd64" make WHAT "kubectl"`
+
+There is `build` documentation there: https://git.k8s.io/kubernetes/build
+
+### Testing
+There is `test` documentation there: https://git.k8s.io/community/contributor/guide
diff --git a/contributors/new-contributor-playground/new-contributors.md b/contributors/new-contributor-playground/new-contributors.md
new file mode 100644
index 00000000..6604eb65
--- /dev/null
+++ b/contributors/new-contributor-playground/new-contributors.md
@@ -0,0 +1,5 @@
+# Hello everyone!
+
+Please feel free to talk amongst yourselves or ask questions if you need help
+
+First commit at kubecon from @mitsutaka \ No newline at end of file