summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBrian Grant <bgrant0607@users.noreply.github.com>2016-11-30 12:52:56 -0800
committerGitHub <noreply@github.com>2016-11-30 12:52:56 -0800
commit795d951bba8bad8bcb67feacc18d741eac8c3597 (patch)
treea86981bfd27dc3b4b4d43eac77f57b6d0d646dde
parent5f8a42e8e8b2dc718206ed60e193f87d77ecd54e (diff)
parentf4379b3c775e31b38df46fb174cdafceceaaca33 (diff)
Merge pull request #118 from michelleN/move-docs-from-kube
import devel, design, and proposals from kubernetes
-rw-r--r--contributors/design-proposals/Kubemark_architecture.pngbin0 -> 30417 bytes
-rw-r--r--contributors/design-proposals/README.md62
-rw-r--r--contributors/design-proposals/access.md376
-rw-r--r--contributors/design-proposals/admission_control.md106
-rw-r--r--contributors/design-proposals/admission_control_limit_range.md233
-rw-r--r--contributors/design-proposals/admission_control_resource_quota.md215
-rw-r--r--contributors/design-proposals/api-group.md119
-rw-r--r--contributors/design-proposals/apiserver-watch.md145
-rw-r--r--contributors/design-proposals/apparmor.md310
-rw-r--r--contributors/design-proposals/architecture.diabin0 -> 6523 bytes
-rw-r--r--contributors/design-proposals/architecture.md85
-rw-r--r--contributors/design-proposals/architecture.pngbin0 -> 268126 bytes
-rw-r--r--contributors/design-proposals/architecture.svg1943
-rw-r--r--contributors/design-proposals/aws_under_the_hood.md310
-rw-r--r--contributors/design-proposals/client-package-structure.md316
-rw-r--r--contributors/design-proposals/cluster-deployment.md171
-rw-r--r--contributors/design-proposals/clustering.md128
-rw-r--r--contributors/design-proposals/clustering/.gitignore1
-rw-r--r--contributors/design-proposals/clustering/Dockerfile26
-rw-r--r--contributors/design-proposals/clustering/Makefile41
-rw-r--r--contributors/design-proposals/clustering/README.md35
-rw-r--r--contributors/design-proposals/clustering/dynamic.pngbin0 -> 72373 bytes
-rw-r--r--contributors/design-proposals/clustering/dynamic.seqdiag24
-rw-r--r--contributors/design-proposals/clustering/static.pngbin0 -> 36583 bytes
-rw-r--r--contributors/design-proposals/clustering/static.seqdiag16
-rw-r--r--contributors/design-proposals/command_execution_port_forwarding.md158
-rw-r--r--contributors/design-proposals/configmap.md300
-rw-r--r--contributors/design-proposals/container-init.md444
-rw-r--r--contributors/design-proposals/container-runtime-interface-v1.md267
-rw-r--r--contributors/design-proposals/control-plane-resilience.md241
-rw-r--r--contributors/design-proposals/controller-ref.md102
-rw-r--r--contributors/design-proposals/daemon.md206
-rw-r--r--contributors/design-proposals/deploy.md147
-rw-r--r--contributors/design-proposals/deployment.md229
-rwxr-xr-xcontributors/design-proposals/disk-accounting.md615
-rw-r--r--contributors/design-proposals/downward_api_resources_limits_requests.md622
-rw-r--r--contributors/design-proposals/dramatically-simplify-cluster-creation.md266
-rw-r--r--contributors/design-proposals/enhance-pluggable-policy.md429
-rw-r--r--contributors/design-proposals/event_compression.md169
-rw-r--r--contributors/design-proposals/expansion.md417
-rw-r--r--contributors/design-proposals/extending-api.md203
-rw-r--r--contributors/design-proposals/external-lb-source-ip-preservation.md238
-rw-r--r--contributors/design-proposals/federated-api-servers.md209
-rw-r--r--contributors/design-proposals/federated-ingress.md223
-rw-r--r--contributors/design-proposals/federated-replicasets.md513
-rw-r--r--contributors/design-proposals/federated-services.md517
-rw-r--r--contributors/design-proposals/federation-high-level-arch.pngbin0 -> 31793 bytes
-rw-r--r--contributors/design-proposals/federation-lite.md201
-rw-r--r--contributors/design-proposals/federation-phase-1.md407
-rw-r--r--contributors/design-proposals/federation.md648
-rw-r--r--contributors/design-proposals/flannel-integration.md132
-rw-r--r--contributors/design-proposals/garbage-collection.md357
-rw-r--r--contributors/design-proposals/gpu-support.md279
-rw-r--r--contributors/design-proposals/ha_master.md236
-rw-r--r--contributors/design-proposals/high-availability.md8
-rw-r--r--contributors/design-proposals/horizontal-pod-autoscaler.md263
-rw-r--r--contributors/design-proposals/identifiers.md113
-rw-r--r--contributors/design-proposals/image-provenance.md331
-rw-r--r--contributors/design-proposals/images/.gitignore0
-rw-r--r--contributors/design-proposals/indexed-job.md900
-rw-r--r--contributors/design-proposals/initial-resources.md75
-rw-r--r--contributors/design-proposals/job.md159
-rw-r--r--contributors/design-proposals/kubectl-login.md220
-rw-r--r--contributors/design-proposals/kubelet-auth.md106
-rw-r--r--contributors/design-proposals/kubelet-cri-logging.md269
-rw-r--r--contributors/design-proposals/kubelet-eviction.md462
-rw-r--r--contributors/design-proposals/kubelet-hypercontainer-runtime.md45
-rw-r--r--contributors/design-proposals/kubelet-rkt-runtime.md103
-rw-r--r--contributors/design-proposals/kubelet-systemd.md407
-rw-r--r--contributors/design-proposals/kubelet-tls-bootstrap.md243
-rw-r--r--contributors/design-proposals/kubemark.md157
-rw-r--r--contributors/design-proposals/local-cluster-ux.md161
-rw-r--r--contributors/design-proposals/metadata-policy.md137
-rw-r--r--contributors/design-proposals/monitoring_architecture.md203
-rw-r--r--contributors/design-proposals/monitoring_architecture.pngbin0 -> 76662 bytes
-rw-r--r--contributors/design-proposals/multi-platform.md532
-rw-r--r--contributors/design-proposals/multiple-schedulers.md138
-rw-r--r--contributors/design-proposals/namespaces.md370
-rw-r--r--contributors/design-proposals/network-policy.md304
-rw-r--r--contributors/design-proposals/networking.md190
-rw-r--r--contributors/design-proposals/node-allocatable.md151
-rw-r--r--contributors/design-proposals/node-allocatable.pngbin0 -> 17673 bytes
-rw-r--r--contributors/design-proposals/nodeaffinity.md246
-rw-r--r--contributors/design-proposals/performance-related-monitoring.md116
-rw-r--r--contributors/design-proposals/persistent-storage.md292
-rw-r--r--contributors/design-proposals/pleg.pngbin0 -> 49079 bytes
-rw-r--r--contributors/design-proposals/pod-cache.pngbin0 -> 51394 bytes
-rw-r--r--contributors/design-proposals/pod-lifecycle-event-generator.md201
-rw-r--r--contributors/design-proposals/pod-resource-management.md416
-rw-r--r--contributors/design-proposals/pod-security-context.md374
-rw-r--r--contributors/design-proposals/podaffinity.md673
-rw-r--r--contributors/design-proposals/principles.md101
-rw-r--r--contributors/design-proposals/protobuf.md480
-rw-r--r--contributors/design-proposals/release-notes.md194
-rw-r--r--contributors/design-proposals/rescheduler.md123
-rw-r--r--contributors/design-proposals/rescheduling-for-critical-pods.md88
-rw-r--r--contributors/design-proposals/rescheduling.md493
-rw-r--r--contributors/design-proposals/resource-metrics-api.md151
-rw-r--r--contributors/design-proposals/resource-qos.md218
-rw-r--r--contributors/design-proposals/resource-quota-scoping.md333
-rw-r--r--contributors/design-proposals/resources.md370
-rw-r--r--contributors/design-proposals/runtime-client-server.md206
-rw-r--r--contributors/design-proposals/runtime-pod-cache.md173
-rw-r--r--contributors/design-proposals/runtimeconfig.md69
-rw-r--r--contributors/design-proposals/scalability-testing.md72
-rw-r--r--contributors/design-proposals/scheduledjob.md335
-rw-r--r--contributors/design-proposals/scheduler_extender.md105
-rw-r--r--contributors/design-proposals/seccomp.md266
-rw-r--r--contributors/design-proposals/secret-configmap-downwarapi-file-mode.md186
-rw-r--r--contributors/design-proposals/secrets.md628
-rw-r--r--contributors/design-proposals/security-context-constraints.md348
-rw-r--r--contributors/design-proposals/security.md218
-rw-r--r--contributors/design-proposals/security_context.md192
-rw-r--r--contributors/design-proposals/selector-generation.md180
-rw-r--r--contributors/design-proposals/self-hosted-kubelet.md135
-rw-r--r--contributors/design-proposals/selinux-enhancements.md209
-rw-r--r--contributors/design-proposals/selinux.md317
-rw-r--r--contributors/design-proposals/service-discovery.md69
-rw-r--r--contributors/design-proposals/service-external-name.md161
-rw-r--r--contributors/design-proposals/service_accounts.md210
-rw-r--r--contributors/design-proposals/simple-rolling-update.md131
-rw-r--r--contributors/design-proposals/stateful-apps.md363
-rw-r--r--contributors/design-proposals/synchronous-garbage-collection.md175
-rw-r--r--contributors/design-proposals/taint-toleration-dedicated.md291
-rw-r--r--contributors/design-proposals/templates.md569
-rw-r--r--contributors/design-proposals/ubernetes-cluster-state.pngbin0 -> 13824 bytes
-rw-r--r--contributors/design-proposals/ubernetes-design.pngbin0 -> 20358 bytes
-rw-r--r--contributors/design-proposals/ubernetes-scheduling.pngbin0 -> 39094 bytes
-rw-r--r--contributors/design-proposals/versioning.md174
-rw-r--r--contributors/design-proposals/volume-hostpath-qualifiers.md150
-rw-r--r--contributors/design-proposals/volume-ownership-management.md108
-rw-r--r--contributors/design-proposals/volume-provisioning.md500
-rw-r--r--contributors/design-proposals/volume-selectors.md268
-rw-r--r--contributors/design-proposals/volume-snapshotting.md523
-rw-r--r--contributors/design-proposals/volume-snapshotting.pngbin0 -> 49261 bytes
-rw-r--r--contributors/design-proposals/volumes.md482
-rw-r--r--contributors/devel/README.md83
-rw-r--r--contributors/devel/adding-an-APIGroup.md100
-rw-r--r--contributors/devel/api-conventions.md1350
-rwxr-xr-xcontributors/devel/api_changes.md732
-rw-r--r--contributors/devel/automation.md116
-rw-r--r--contributors/devel/bazel.md44
-rw-r--r--contributors/devel/cherry-picks.md64
-rw-r--r--contributors/devel/cli-roadmap.md11
-rw-r--r--contributors/devel/client-libraries.md27
-rw-r--r--contributors/devel/coding-conventions.md147
-rw-r--r--contributors/devel/collab.md87
-rw-r--r--contributors/devel/community-expectations.md87
-rw-r--r--contributors/devel/container-runtime-interface.md127
-rw-r--r--contributors/devel/controllers.md186
-rwxr-xr-xcontributors/devel/developer-guides/vagrant.md432
-rw-r--r--contributors/devel/development.md251
-rw-r--r--contributors/devel/e2e-node-tests.md231
-rw-r--r--contributors/devel/e2e-tests.md719
-rw-r--r--contributors/devel/faster_reviews.md218
-rw-r--r--contributors/devel/flaky-tests.md194
-rw-r--r--contributors/devel/generating-clientset.md41
-rw-r--r--contributors/devel/getting-builds.md52
-rw-r--r--contributors/devel/git_workflow.pngbin0 -> 114745 bytes
-rw-r--r--contributors/devel/go-code.md32
-rw-r--r--contributors/devel/godep.md123
-rw-r--r--contributors/devel/gubernator-images/filterpage.pngbin0 -> 408077 bytes
-rw-r--r--contributors/devel/gubernator-images/filterpage1.pngbin0 -> 375248 bytes
-rw-r--r--contributors/devel/gubernator-images/filterpage2.pngbin0 -> 372828 bytes
-rw-r--r--contributors/devel/gubernator-images/filterpage3.pngbin0 -> 362554 bytes
-rw-r--r--contributors/devel/gubernator-images/skipping1.pngbin0 -> 67007 bytes
-rw-r--r--contributors/devel/gubernator-images/skipping2.pngbin0 -> 114503 bytes
-rw-r--r--contributors/devel/gubernator-images/testfailures.pngbin0 -> 189178 bytes
-rw-r--r--contributors/devel/gubernator.md142
-rw-r--r--contributors/devel/how-to-doc.md205
-rw-r--r--contributors/devel/instrumentation.md52
-rw-r--r--contributors/devel/issues.md59
-rw-r--r--contributors/devel/kubectl-conventions.md411
-rwxr-xr-xcontributors/devel/kubemark-guide.md212
-rw-r--r--contributors/devel/local-cluster/docker.md269
-rw-r--r--contributors/devel/local-cluster/k8s-singlenode-docker.pngbin0 -> 31801 bytes
-rw-r--r--contributors/devel/local-cluster/local.md125
-rw-r--r--contributors/devel/local-cluster/vagrant.md397
-rw-r--r--contributors/devel/logging.md36
-rw-r--r--contributors/devel/mesos-style.md218
-rw-r--r--contributors/devel/node-performance-testing.md127
-rw-r--r--contributors/devel/on-call-build-cop.md151
-rw-r--r--contributors/devel/on-call-rotations.md43
-rw-r--r--contributors/devel/on-call-user-support.md89
-rw-r--r--contributors/devel/owners.md100
-rw-r--r--contributors/devel/pr_workflow.diabin0 -> 3189 bytes
-rw-r--r--contributors/devel/pr_workflow.pngbin0 -> 80793 bytes
-rw-r--r--contributors/devel/profiling.md46
-rw-r--r--contributors/devel/pull-requests.md105
-rw-r--r--contributors/devel/running-locally.md170
-rwxr-xr-xcontributors/devel/scheduler.md72
-rwxr-xr-xcontributors/devel/scheduler_algorithm.md44
-rw-r--r--contributors/devel/testing.md230
-rw-r--r--contributors/devel/update-release-docs.md115
-rw-r--r--contributors/devel/updating-docs-for-feature-changes.md76
-rw-r--r--contributors/devel/writing-a-getting-started-guide.md101
-rw-r--r--contributors/devel/writing-good-e2e-tests.md235
197 files changed, 41450 insertions, 0 deletions
diff --git a/contributors/design-proposals/Kubemark_architecture.png b/contributors/design-proposals/Kubemark_architecture.png
new file mode 100644
index 00000000..479ad8b1
--- /dev/null
+++ b/contributors/design-proposals/Kubemark_architecture.png
Binary files differ
diff --git a/contributors/design-proposals/README.md b/contributors/design-proposals/README.md
new file mode 100644
index 00000000..85fc8245
--- /dev/null
+++ b/contributors/design-proposals/README.md
@@ -0,0 +1,62 @@
+# Kubernetes Design Overview
+
+Kubernetes is a system for managing containerized applications across multiple
+hosts, providing basic mechanisms for deployment, maintenance, and scaling of
+applications.
+
+Kubernetes establishes robust declarative primitives for maintaining the desired
+state requested by the user. We see these primitives as the main value added by
+Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and
+replicating containers require active controllers, not just imperative
+orchestration.
+
+Kubernetes is primarily targeted at applications composed of multiple
+containers, such as elastic, distributed micro-services. It is also designed to
+facilitate migration of non-containerized application stacks to Kubernetes. It
+therefore includes abstractions for grouping containers in both loosely coupled
+and tightly coupled formations, and provides ways for containers to find and
+communicate with each other in relatively familiar ways.
+
+Kubernetes enables users to ask a cluster to run a set of containers. The system
+automatically chooses hosts to run those containers on. While Kubernetes's
+scheduler is currently very simple, we expect it to grow in sophistication over
+time. Scheduling is a policy-rich, topology-aware, workload-specific function
+that significantly impacts availability, performance, and capacity. The
+scheduler needs to take into account individual and collective resource
+requirements, quality of service requirements, hardware/software/policy
+constraints, affinity and anti-affinity specifications, data locality,
+inter-workload interference, deadlines, and so on. Workload-specific
+requirements will be exposed through the API as necessary.
+
+Kubernetes is intended to run on a number of cloud providers, as well as on
+physical hosts.
+
+A single Kubernetes cluster is not intended to span multiple availability zones.
+Instead, we recommend building a higher-level layer to replicate complete
+deployments of highly available applications across multiple zones (see
+[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md)
+for more details).
+
+Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS
+platform and toolkit. Therefore, architecturally, we want Kubernetes to be built
+as a collection of pluggable components and layers, with the ability to use
+alternative schedulers, controllers, storage systems, and distribution
+mechanisms, and we're evolving its current code in that direction. Furthermore,
+we want others to be able to extend Kubernetes functionality, such as with
+higher-level PaaS functionality or multi-cluster layers, without modification of
+core Kubernetes source. Therefore, its API isn't just (or even necessarily
+mainly) targeted at end users, but at tool and extension developers. Its APIs
+are intended to serve as the foundation for an open ecosystem of tools,
+automation systems, and higher-level API layers. Consequently, there are no
+"internal" inter-component APIs. All APIs are visible and available, including
+the APIs used by the scheduler, the node controller, the replication-controller
+manager, Kubelet's API, etc. There's no glass to break -- in order to handle
+more complex use cases, one can just access the lower-level APIs in a fully
+transparent, composable manner.
+
+For more about the Kubernetes architecture, see [architecture](architecture.md).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/access.md b/contributors/design-proposals/access.md
new file mode 100644
index 00000000..b23e463b
--- /dev/null
+++ b/contributors/design-proposals/access.md
@@ -0,0 +1,376 @@
+# K8s Identity and Access Management Sketch
+
+This document suggests a direction for identity and access management in the
+Kubernetes system.
+
+
+## Background
+
+High level goals are:
+ - Have a plan for how identity, authentication, and authorization will fit in
+to the API.
+ - Have a plan for partitioning resources within a cluster between independent
+organizational units.
+ - Ease integration with existing enterprise and hosted scenarios.
+
+### Actors
+
+Each of these can act as normal users or attackers.
+ - External Users: People who are accessing applications running on K8s (e.g.
+a web site served by webserver running in a container on K8s), but who do not
+have K8s API access.
+ - K8s Users: People who access the K8s API (e.g. create K8s API objects like
+Pods)
+ - K8s Project Admins: People who manage access for some K8s Users
+ - K8s Cluster Admins: People who control the machines, networks, or binaries
+that make up a K8s cluster.
+ - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
+
+### Threats
+
+Both intentional attacks and accidental use of privilege are concerns.
+
+For both cases it may be useful to think about these categories differently:
+ - Application Path - attack by sending network messages from the internet to
+the IP/port of any application running on K8s. May exploit weakness in
+application or misconfiguration of K8s.
+ - K8s API Path - attack by sending network messages to any K8s API endpoint.
+ - Insider Path - attack on K8s system components. Attacker may have
+privileged access to networks, machines or K8s software and data. Software
+errors in K8s system components and administrator error are some types of threat
+in this category.
+
+This document is primarily concerned with K8s API paths, and secondarily with
+Internal paths. The Application path also needs to be secure, but is not the
+focus of this document.
+
+### Assets to protect
+
+External User assets:
+ - Personal information like private messages, or images uploaded by External
+Users.
+ - web server logs.
+
+K8s User assets:
+ - External User assets of each K8s User.
+ - things private to the K8s app, like:
+ - credentials for accessing other services (docker private repos, storage
+services, facebook, etc)
+ - SSL certificates for web servers
+ - proprietary data and code
+
+K8s Cluster assets:
+ - Assets of each K8s User.
+ - Machine Certificates or secrets.
+ - The value of K8s cluster computing resources (cpu, memory, etc).
+
+This document is primarily about protecting K8s User assets and K8s cluster
+assets from other K8s Users and K8s Project and Cluster Admins.
+
+### Usage environments
+
+Cluster in Small organization:
+ - K8s Admins may be the same people as K8s Users.
+ - Few K8s Admins.
+ - Prefer ease of use to fine-grained access control/precise accounting, etc.
+ - Product requirement that it be easy for potential K8s Cluster Admin to try
+out setting up a simple cluster.
+
+Cluster in Large organization:
+ - K8s Admins typically distinct people from K8s Users. May need to divide
+K8s Cluster Admin access by roles.
+ - K8s Users need to be protected from each other.
+ - Auditing of K8s User and K8s Admin actions important.
+ - Flexible accurate usage accounting and resource controls important.
+ - Lots of automated access to APIs.
+ - Need to integrate with existing enterprise directory, authentication,
+accounting, auditing, and security policy infrastructure.
+
+Org-run cluster:
+ - Organization that runs K8s master components is same as the org that runs
+apps on K8s.
+ - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
+
+Hosted cluster:
+ - Offering K8s API as a service, or offering a Paas or Saas built on K8s.
+ - May already offer web services, and need to integrate with existing customer
+account concept, and existing authentication, accounting, auditing, and security
+policy infrastructure.
+ - May want to leverage K8s User accounts and accounting to manage their User
+accounts (not a priority to support this use case.)
+ - Precise and accurate accounting of resources needed. Resource controls
+needed for hard limits (Users given limited slice of data) and soft limits
+(Users can grow up to some limit and then be expanded).
+
+K8s ecosystem services:
+ - There may be companies that want to offer their existing services (Build, CI,
+A/B-test, release automation, etc) for use with K8s. There should be some story
+for this case.
+
+Pods configs should be largely portable between Org-run and hosted
+configurations.
+
+
+# Design
+
+Related discussion:
+- http://issue.k8s.io/442
+- http://issue.k8s.io/443
+
+This doc describes two security profiles:
+ - Simple profile: like single-user mode. Make it easy to evaluate K8s
+without lots of configuring accounts and policies. Protects from unauthorized
+users, but does not partition authorized users.
+ - Enterprise profile: Provide mechanisms needed for large numbers of users.
+Defense in depth. Should integrate with existing enterprise security
+infrastructure.
+
+K8s distribution should include templates of config, and documentation, for
+simple and enterprise profiles. System should be flexible enough for
+knowledgeable users to create intermediate profiles, but K8s developers should
+only reason about those two Profiles, not a matrix.
+
+Features in this doc are divided into "Initial Feature", and "Improvements".
+Initial features would be candidates for version 1.00.
+
+## Identity
+
+### userAccount
+
+K8s will have a `userAccount` API object.
+- `userAccount` has a UID which is immutable. This is used to associate users
+with objects and to record actions in audit logs.
+- `userAccount` has a name which is a string and human readable and unique among
+userAccounts. It is used to refer to users in Policies, to ensure that the
+Policies are human readable. It can be changed only when there are no Policy
+objects or other objects which refer to that name. An email address is a
+suggested format for this field.
+- `userAccount` is not related to the unix username of processes in Pods created
+by that userAccount.
+- `userAccount` API objects can have labels.
+
+The system may associate one or more Authentication Methods with a
+`userAccount` (but they are not formally part of the userAccount object.)
+
+In a simple deployment, the authentication method for a user might be an
+authentication token which is verified by a K8s server. In a more complex
+deployment, the authentication might be delegated to another system which is
+trusted by the K8s API to authenticate users, but where the authentication
+details are unknown to K8s.
+
+Initial Features:
+- There is no superuser `userAccount`
+- `userAccount` objects are statically populated in the K8s API store by reading
+a config file. Only a K8s Cluster Admin can do this.
+- `userAccount` can have a default `namespace`. If API call does not specify a
+`namespace`, the default `namespace` for that caller is assumed.
+- `userAccount` is global. A single human with access to multiple namespaces is
+recommended to only have one userAccount.
+
+Improvements:
+- Make `userAccount` part of a separate API group from core K8s objects like
+`pod.` Facilitates plugging in alternate Access Management.
+
+Simple Profile:
+ - Single `userAccount`, used by all K8s Users and Project Admins. One access
+token shared by all.
+
+Enterprise Profile:
+ - Every human user has own `userAccount`.
+ - `userAccount`s have labels that indicate both membership in groups, and
+ability to act in certain roles.
+ - Each service using the API has own `userAccount` too. (e.g. `scheduler`,
+`repcontroller`)
+ - Automated jobs to denormalize the ldap group info into the local system
+list of users into the K8s userAccount file.
+
+### Unix accounts
+
+A `userAccount` is not a Unix user account. The fact that a pod is started by a
+`userAccount` does not mean that the processes in that pod's containers run as a
+Unix user with a corresponding name or identity.
+
+Initially:
+- The unix accounts available in a container, and used by the processes running
+in a container are those that are provided by the combination of the base
+operating system and the Docker manifest.
+- Kubernetes doesn't enforce any relation between `userAccount` and unix
+accounts.
+
+Improvements:
+- Kubelet allocates disjoint blocks of root-namespace uids for each container.
+This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
+- requires docker to integrate user namespace support, and deciding what
+getpwnam() does for these uids.
+- any features that help users avoid use of privileged containers
+(http://issue.k8s.io/391)
+
+### Namespaces
+
+K8s will have a `namespace` API object. It is similar to a Google Compute
+Engine `project`. It provides a namespace for objects created by a group of
+people co-operating together, preventing name collisions with non-cooperating
+groups. It also serves as a reference point for authorization policies.
+
+Namespaces are described in [namespaces.md](namespaces.md).
+
+In the Enterprise Profile:
+ - a `userAccount` may have permission to access several `namespace`s.
+
+In the Simple Profile:
+ - There is a single `namespace` used by the single user.
+
+Namespaces versus userAccount vs. Labels:
+- `userAccount`s are intended for audit logging (both name and UID should be
+logged), and to define who has access to `namespace`s.
+- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md))
+should be used to distinguish pods, users, and other objects that cooperate
+towards a common goal but are different in some way, such as version, or
+responsibilities.
+- `namespace`s prevent name collisions between uncoordinated groups of people,
+and provide a place to attach common policies for co-operating groups of people.
+
+
+## Authentication
+
+Goals for K8s authentication:
+- Include a built-in authentication system with no configuration required to use
+in single-user mode, and little configuration required to add several user
+accounts, and no https proxy required.
+- Allow for authentication to be handled by a system external to Kubernetes, to
+allow integration with existing to enterprise authorization systems. The
+Kubernetes namespace itself should avoid taking contributions of multiple
+authorization schemes. Instead, a trusted proxy in front of the apiserver can be
+used to authenticate users.
+ - For organizations whose security requirements only allow FIPS compliant
+implementations (e.g. apache) for authentication.
+ - So the proxy can terminate SSL, and isolate the CA-signed certificate from
+less trusted, higher-touch APIserver.
+ - For organizations that already have existing SaaS web services (e.g.
+storage, VMs) and want a common authentication portal.
+- Avoid mixing authentication and authorization, so that authorization policies
+be centrally managed, and to allow changes in authentication methods without
+affecting authorization code.
+
+Initially:
+- Tokens used to authenticate a user.
+- Long lived tokens identify a particular `userAccount`.
+- Administrator utility generates tokens at cluster setup.
+- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
+- No scopes for tokens. Authorization happens in the API server
+- Tokens dynamically generated by apiserver to identify pods which are making
+API calls.
+- Tokens checked in a module of the APIserver.
+- Authentication in apiserver can be disabled by flag, to allow testing without
+authorization enabled, and to allow use of an authenticating proxy. In this
+mode, a query parameter or header added by the proxy will identify the caller.
+
+Improvements:
+- Refresh of tokens.
+- SSH keys to access inside containers.
+
+To be considered for subsequent versions:
+- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749)
+- Scoped tokens.
+- Tokens that are bound to the channel between the client and the api server
+ - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
+ - http://www.browserauth.net
+
+## Authorization
+
+K8s authorization should:
+- Allow for a range of maturity levels, from single-user for those test driving
+the system, to integration with existing to enterprise authorization systems.
+- Allow for centralized management of users and policies. In some
+organizations, this will mean that the definition of users and access policies
+needs to reside on a system other than k8s and encompass other web services
+(such as a storage service).
+- Allow processes running in K8s Pods to take on identity, and to allow narrow
+scoping of permissions for those identities in order to limit damage from
+software faults.
+- Have Authorization Policies exposed as API objects so that a single config
+file can create or delete Pods, Replication Controllers, Services, and the
+identities and policies for those Pods and Replication Controllers.
+- Be separate as much as practical from Authentication, to allow Authentication
+methods to change over time and space, without impacting Authorization policies.
+
+K8s will implement a relatively simple
+[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
+
+The model will be described in more detail in a forthcoming document. The model
+will:
+- Be less complex than XACML
+- Be easily recognizable to those familiar with Amazon IAM Policies.
+- Have a subset/aliases/defaults which allow it to be used in a way comfortable
+to those users more familiar with Role-Based Access Control.
+
+Authorization policy is set by creating a set of Policy objects.
+
+The API Server will be the Enforcement Point for Policy. For each API call that
+it receives, it will construct the Attributes needed to evaluate the policy
+(what user is making the call, what resource they are accessing, what they are
+trying to do that resource, etc) and pass those attributes to a Decision Point.
+The Decision Point code evaluates the Attributes against all the Policies and
+allows or denies the API call. The system will be modular enough that the
+Decision Point code can either be linked into the APIserver binary, or be
+another service that the apiserver calls for each Decision (with appropriate
+time-limited caching as needed for performance).
+
+Policy objects may be applicable only to a single namespace or to all
+namespaces; K8s Project Admins would be able to create those as needed. Other
+Policy objects may be applicable to all namespaces; a K8s Cluster Admin might
+create those in order to authorize a new type of controller to be used by all
+namespaces, or to make a K8s User into a K8s Project Admin.)
+
+## Accounting
+
+The API should have a `quota` concept (see http://issue.k8s.io/442). A quota
+object relates a namespace (and optionally a label selector) to a maximum
+quantity of resources that may be used (see [resources design doc](resources.md)).
+
+Initially:
+- A `quota` object is immutable.
+- For hosted K8s systems that do billing, Project is recommended level for
+billing accounts.
+- Every object that consumes resources should have a `namespace` so that
+Resource usage stats are roll-up-able to `namespace`.
+- K8s Cluster Admin sets quota objects by writing a config file.
+
+Improvements:
+- Allow one namespace to charge the quota for one or more other namespaces. This
+would be controlled by a policy which allows changing a billing_namespace =
+label on an object.
+- Allow quota to be set by namespace owners for (namespace x label) combinations
+(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't
+allow "webserver" namespace and "instance=test" use more than 10 cores.
+- Tools to help write consistent quota config files based on number of nodes,
+historical namespace usages, QoS needs, etc.
+- Way for K8s Cluster Admin to incrementally adjust Quota objects.
+
+Simple profile:
+ - A single `namespace` with infinite resource limits.
+
+Enterprise profile:
+ - Multiple namespaces each with their own limits.
+
+Issues:
+- Need for locking or "eventual consistency" when multiple apiserver goroutines
+are accessing the object store and handling pod creations.
+
+
+## Audit Logging
+
+API actions can be logged.
+
+Initial implementation:
+- All API calls logged to nginx logs.
+
+Improvements:
+- API server does logging instead.
+- Policies to drop logging for high rate trusted API calls, or by users
+performing audit or other sensitive functions.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/admission_control.md b/contributors/design-proposals/admission_control.md
new file mode 100644
index 00000000..a7330104
--- /dev/null
+++ b/contributors/design-proposals/admission_control.md
@@ -0,0 +1,106 @@
+# Kubernetes Proposal - Admission Control
+
+**Related PR:**
+
+| Topic | Link |
+| ----- | ---- |
+| Separate validation from RESTStorage | http://issue.k8s.io/2977 |
+
+## Background
+
+High level goals:
+* Enable an easy-to-use mechanism to provide admission control to cluster.
+* Enable a provider to support multiple admission control strategies or author
+their own.
+* Ensure any rejected request can propagate errors back to the caller with why
+the request failed.
+
+Authorization via policy is focused on answering if a user is authorized to
+perform an action.
+
+Admission Control is focused on if the system will accept an authorized action.
+
+Kubernetes may choose to dismiss an authorized action based on any number of
+admission control strategies.
+
+This proposal documents the basic design, and describes how any number of
+admission control plug-ins could be injected.
+
+Implementation of specific admission control strategies are handled in separate
+documents.
+
+## kube-apiserver
+
+The kube-apiserver takes the following OPTIONAL arguments to enable admission
+control:
+
+| Option | Behavior |
+| ------ | -------- |
+| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. |
+| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. |
+
+An **AdmissionControl** plug-in is an implementation of the following interface:
+
+```go
+package admission
+
+// Attributes is an interface used by a plug-in to make an admission decision
+// on a individual request.
+type Attributes interface {
+ GetNamespace() string
+ GetKind() string
+ GetOperation() string
+ GetObject() runtime.Object
+}
+
+// Interface is an abstract, pluggable interface for Admission Control decisions.
+type Interface interface {
+ // Admit makes an admission decision based on the request attributes
+ // An error is returned if it denies the request.
+ Admit(a Attributes) (err error)
+}
+```
+
+A **plug-in** must be compiled with the binary, and is registered as an
+available option by providing a name, and implementation of admission.Interface.
+
+```go
+func init() {
+ admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil })
+}
+```
+
+A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go)
+
+```go
+ // Admission policies
+ _ "k8s.io/kubernetes/plugin/pkg/admission/admit"
+ _ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages"
+ _ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity"
+ ...
+ _ "<YOUR NEW PLUGIN>"
+```
+
+Invocation of admission control is handled by the **APIServer** and not
+individual **RESTStorage** implementations.
+
+This design assumes that **Issue 297** is adopted, and as a consequence, the
+general framework of the APIServer request/response flow will ensure the
+following:
+
+1. Incoming request
+2. Authenticate user
+3. Authorize user
+4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes)
+ - invoke each admission.Interface object in sequence
+5. Case on the operation:
+ - If operation=create|update, then validate(object) and persist
+ - If operation=delete, delete the object
+ - If operation=connect, exec
+
+If at any step, there is an error, the request is canceled.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/admission_control_limit_range.md b/contributors/design-proposals/admission_control_limit_range.md
new file mode 100644
index 00000000..06cce2cb
--- /dev/null
+++ b/contributors/design-proposals/admission_control_limit_range.md
@@ -0,0 +1,233 @@
+# Admission control plugin: LimitRanger
+
+## Background
+
+This document proposes a system for enforcing resource requirements constraints
+as part of admission control.
+
+## Use cases
+
+1. Ability to enumerate resource requirement constraints per namespace
+2. Ability to enumerate min/max resource constraints for a pod
+3. Ability to enumerate min/max resource constraints for a container
+4. Ability to specify default resource limits for a container
+5. Ability to specify default resource requests for a container
+6. Ability to enforce a ratio between request and limit for a resource.
+7. Ability to enforce min/max storage requests for persistent volume claims
+
+## Data Model
+
+The **LimitRange** resource is scoped to a **Namespace**.
+
+### Type
+
+```go
+// LimitType is a type of object that is limited
+type LimitType string
+
+const (
+ // Limit that applies to all pods in a namespace
+ LimitTypePod LimitType = "Pod"
+ // Limit that applies to all containers in a namespace
+ LimitTypeContainer LimitType = "Container"
+)
+
+// LimitRangeItem defines a min/max usage limit for any resource that matches
+// on kind.
+type LimitRangeItem struct {
+ // Type of resource that this limit applies to.
+ Type LimitType `json:"type,omitempty"`
+ // Max usage constraints on this kind by resource name.
+ Max ResourceList `json:"max,omitempty"`
+ // Min usage constraints on this kind by resource name.
+ Min ResourceList `json:"min,omitempty"`
+ // Default resource requirement limit value by resource name if resource limit
+ // is omitted.
+ Default ResourceList `json:"default,omitempty"`
+ // DefaultRequest is the default resource requirement request value by
+ // resource name if resource request is omitted.
+ DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
+ // MaxLimitRequestRatio if specified, the named resource must have a request
+ // and limit that are both non-zero where limit divided by request is less
+ // than or equal to the enumerated value; this represents the max burst for
+ // the named resource.
+ MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
+}
+
+// LimitRangeSpec defines a min/max usage limit for resources that match
+// on kind.
+type LimitRangeSpec struct {
+ // Limits is the list of LimitRangeItem objects that are enforced.
+ Limits []LimitRangeItem `json:"limits"`
+}
+
+// LimitRange sets resource usage limits for each kind of resource in a
+// Namespace.
+type LimitRange struct {
+ TypeMeta `json:",inline"`
+ // Standard object's metadata.
+ // More info:
+ // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
+ ObjectMeta `json:"metadata,omitempty"`
+
+ // Spec defines the limits enforced.
+ // More info:
+ // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
+ Spec LimitRangeSpec `json:"spec,omitempty"`
+}
+
+// LimitRangeList is a list of LimitRange items.
+type LimitRangeList struct {
+ TypeMeta `json:",inline"`
+ // Standard list metadata.
+ // More info:
+ // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
+ ListMeta `json:"metadata,omitempty"`
+
+ // Items is a list of LimitRange objects.
+ // More info:
+ // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md
+ Items []LimitRange `json:"items"`
+}
+```
+
+### Validation
+
+Validation of a **LimitRange** enforces that for a given named resource the
+following rules apply:
+
+Min (if specified) <= DefaultRequest (if specified) <= Default (if specified)
+<= Max (if specified)
+
+### Default Value Behavior
+
+The following default value behaviors are applied to a LimitRange for a given
+named resource.
+
+```
+if LimitRangeItem.Default[resourceName] is undefined
+ if LimitRangeItem.Max[resourceName] is defined
+ LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName]
+```
+
+```
+if LimitRangeItem.DefaultRequest[resourceName] is undefined
+ if LimitRangeItem.Default[resourceName] is defined
+ LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName]
+ else if LimitRangeItem.Min[resourceName] is defined
+ LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName]
+```
+
+## AdmissionControl plugin: LimitRanger
+
+The **LimitRanger** plug-in introspects all incoming pod requests and evaluates
+the constraints defined on a LimitRange.
+
+If a constraint is not specified for an enumerated resource, it is not enforced
+or tracked.
+
+To enable the plug-in and support for LimitRange, the kube-apiserver must be
+configured as follows:
+
+```console
+$ kube-apiserver --admission-control=LimitRanger
+```
+
+### Enforcement of constraints
+
+**Type: Container**
+
+Supported Resources:
+
+1. memory
+2. cpu
+
+Supported Constraints:
+
+Per container, the following must hold true:
+
+| Constraint | Behavior |
+| ---------- | -------- |
+| Min | Min <= Request (required) <= Limit (optional) |
+| Max | Limit (required) <= Max |
+| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) |
+
+Supported Defaults:
+
+1. Default - if the named resource has no enumerated value, the Limit is equal
+to the Default
+2. DefaultRequest - if the named resource has no enumerated value, the Request
+is equal to the DefaultRequest
+
+**Type: Pod**
+
+Supported Resources:
+
+1. memory
+2. cpu
+
+Supported Constraints:
+
+Across all containers in pod, the following must hold true
+
+| Constraint | Behavior |
+| ---------- | -------- |
+| Min | Min <= Request (required) <= Limit (optional) |
+| Max | Limit (required) <= Max |
+| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) |
+
+**Type: PersistentVolumeClaim**
+
+Supported Resources:
+
+1. storage
+
+Supported Constraints:
+
+Across all claims in a namespace, the following must hold true:
+
+| Constraint | Behavior |
+| ---------- | -------- |
+| Min | Min >= Request (required) |
+| Max | Max <= Request (required) |
+
+Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time.
+
+## Run-time configuration
+
+The default ```LimitRange``` that is applied via Salt configuration will be
+updated as follows:
+
+```
+apiVersion: "v1"
+kind: "LimitRange"
+metadata:
+ name: "limits"
+ namespace: default
+spec:
+ limits:
+ - type: "Container"
+ defaultRequests:
+ cpu: "100m"
+```
+
+## Example
+
+An example LimitRange configuration:
+
+| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio |
+| ---- | -------- | --- | --- | ------- | -------------- | ----------------- |
+| Container | cpu | .1 | 1 | 500m | 250m | 4 |
+| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | |
+
+Assuming an incoming container that specified no incoming resource requirements,
+the following would happen.
+
+1. The incoming container cpu would request 250m with a limit of 500m.
+2. The incoming container memory would request 250Mi with a limit of 500Mi
+3. If the container is later resized, it's cpu would be constrained to between
+.1 and 1 and the ratio of limit to request could not exceed 4.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/admission_control_resource_quota.md b/contributors/design-proposals/admission_control_resource_quota.md
new file mode 100644
index 00000000..575db9a8
--- /dev/null
+++ b/contributors/design-proposals/admission_control_resource_quota.md
@@ -0,0 +1,215 @@
+# Admission control plugin: ResourceQuota
+
+## Background
+
+This document describes a system for enforcing hard resource usage limits per
+namespace as part of admission control.
+
+## Use cases
+
+1. Ability to enumerate resource usage limits per namespace.
+2. Ability to monitor resource usage for tracked resources.
+3. Ability to reject resource usage exceeding hard quotas.
+
+## Data Model
+
+The **ResourceQuota** object is scoped to a **Namespace**.
+
+```go
+// The following identify resource constants for Kubernetes object types
+const (
+ // Pods, number
+ ResourcePods ResourceName = "pods"
+ // Services, number
+ ResourceServices ResourceName = "services"
+ // ReplicationControllers, number
+ ResourceReplicationControllers ResourceName = "replicationcontrollers"
+ // ResourceQuotas, number
+ ResourceQuotas ResourceName = "resourcequotas"
+ // ResourceSecrets, number
+ ResourceSecrets ResourceName = "secrets"
+ // ResourcePersistentVolumeClaims, number
+ ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims"
+)
+
+// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
+type ResourceQuotaSpec struct {
+ // Hard is the set of desired hard limits for each named resource
+ Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
+}
+
+// ResourceQuotaStatus defines the enforced hard limits and observed use
+type ResourceQuotaStatus struct {
+ // Hard is the set of enforced hard limits for each named resource
+ Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
+ // Used is the current observed total usage of the resource in the namespace
+ Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"`
+}
+
+// ResourceQuota sets aggregate quota restrictions enforced per namespace
+type ResourceQuota struct {
+ TypeMeta `json:",inline"`
+ ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"`
+
+ // Spec defines the desired quota
+ Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"`
+
+ // Status defines the actual enforced quota and its current usage
+ Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"`
+}
+
+// ResourceQuotaList is a list of ResourceQuota items
+type ResourceQuotaList struct {
+ TypeMeta `json:",inline"`
+ ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"`
+
+ // Items is a list of ResourceQuota objects
+ Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
+}
+```
+
+## Quota Tracked Resources
+
+The following resources are supported by the quota system:
+
+| Resource | Description |
+| ------------ | ----------- |
+| cpu | Total requested cpu usage |
+| memory | Total requested memory usage |
+| pods | Total number of active pods where phase is pending or active. |
+| services | Total number of services |
+| replicationcontrollers | Total number of replication controllers |
+| resourcequotas | Total number of resource quotas |
+| secrets | Total number of secrets |
+| persistentvolumeclaims | Total number of persistent volume claims |
+
+If a third-party wants to track additional resources, it must follow the
+resource naming conventions prescribed by Kubernetes. This means the resource
+must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
+
+## Resource Requirements: Requests vs. Limits
+
+If a resource supports the ability to distinguish between a request and a limit
+for a resource, the quota tracking system will only cost the request value
+against the quota usage. If a resource is tracked by quota, and no request value
+is provided, the associated entity is rejected as part of admission.
+
+For an example, consider the following scenarios relative to tracking quota on
+CPU:
+
+| Pod | Container | Request CPU | Limit CPU | Result |
+| --- | --------- | ----------- | --------- | ------ |
+| X | C1 | 100m | 500m | The quota usage is incremented 100m |
+| Y | C2 | 100m | none | The quota usage is incremented 100m |
+| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
+| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
+
+The rationale for accounting for the requested amount of a resource versus the
+limit is the belief that a user should only be charged for what they are
+scheduled against in the cluster. In addition, attempting to track usage against
+actual usage, where request < actual < limit, is considered highly volatile.
+
+As a consequence of this decision, the user is able to spread its usage of a
+resource across multiple tiers of service. Let's demonstrate this via an
+example with a 4 cpu quota.
+
+The quota may be allocated as follows:
+
+| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage |
+| --- | --------- | ----------- | --------- | ---- | ----------- |
+| X | C1 | 1 | 4 | Burstable | 1 |
+| Y | C2 | 2 | 2 | Guaranteed | 2 |
+| Z | C3 | 1 | 3 | Burstable | 1 |
+
+It is possible that the pods may consume 9 cpu over a given time period
+depending on the nodes available cpu that held pod X and Z, but since we
+scheduled X and Z relative to the request, we only track the requesting value
+against their allocated quota. If one wants to restrict the ratio between the
+request and limit, it is encouraged that the user define a **LimitRange** with
+**LimitRequestRatio** to control burst out behavior. This would in effect, let
+an administrator keep the difference between request and limit more in line with
+tracked usage if desired.
+
+## Status API
+
+A REST API endpoint to update the status section of the **ResourceQuota** is
+exposed. It requires an atomic compare-and-swap in order to keep resource usage
+tracking consistent.
+
+## Resource Quota Controller
+
+A resource quota controller monitors observed usage for tracked resources in the
+**Namespace**.
+
+If there is observed difference between the current usage stats versus the
+current **ResourceQuota.Status**, the controller posts an update of the
+currently observed usage metrics to the **ResourceQuota** via the /status
+endpoint.
+
+The resource quota controller is the only component capable of monitoring and
+recording usage updates after a DELETE operation since admission control is
+incapable of guaranteeing a DELETE request actually succeeded.
+
+## AdmissionControl plugin: ResourceQuota
+
+The **ResourceQuota** plug-in introspects all incoming admission requests.
+
+To enable the plug-in and support for ResourceQuota, the kube-apiserver must be
+configured as follows:
+
+```
+$ kube-apiserver --admission-control=ResourceQuota
+```
+
+It makes decisions by evaluating the incoming object against all defined
+**ResourceQuota.Status.Hard** resource limits in the request namespace. If
+acceptance of the resource would cause the total usage of a named resource to
+exceed its hard limit, the request is denied.
+
+If the incoming request does not cause the total usage to exceed any of the
+enumerated hard resource limits, the plug-in will post a
+**ResourceQuota.Status** document to the server to atomically update the
+observed usage based on the previously read **ResourceQuota.ResourceVersion**.
+This keeps incremental usage atomically consistent, but does introduce a
+bottleneck (intentionally) into the system.
+
+To optimize system performance, it is encouraged that all resource quotas are
+tracked on the same **ResourceQuota** document in a **Namespace**. As a result,
+it is encouraged to impose a cap on the total number of individual quotas that
+are tracked in the **Namespace** to 1 in the **ResourceQuota** document.
+
+## kubectl
+
+kubectl is modified to support the **ResourceQuota** resource.
+
+`kubectl describe` provides a human-readable output of quota.
+
+For example:
+
+```console
+$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml
+namespace "quota-example" created
+$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example
+resourcequota "quota" created
+$ kubectl describe quota quota --namespace=quota-example
+Name: quota
+Namespace: quota-example
+Resource Used Hard
+-------- ---- ----
+cpu 0 20
+memory 0 1Gi
+persistentvolumeclaims 0 10
+pods 0 10
+replicationcontrollers 0 20
+resourcequotas 1 1
+secrets 1 10
+services 0 5
+```
+
+## More information
+
+See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/api-group.md b/contributors/design-proposals/api-group.md
new file mode 100644
index 00000000..435994fe
--- /dev/null
+++ b/contributors/design-proposals/api-group.md
@@ -0,0 +1,119 @@
+# Supporting multiple API groups
+
+## Goal
+
+1. Breaking the monolithic v1 API into modular groups and allowing groups to be enabled/disabled individually. This allows us to break the monolithic API server to smaller components in the future.
+
+2. Supporting different versions in different groups. This allows different groups to evolve at different speed.
+
+3. Supporting identically named kinds to exist in different groups. This is useful when we experiment new features of an API in the experimental group while supporting the stable API in the original group at the same time.
+
+4. Exposing the API groups and versions supported by the server. This is required to develop a dynamic client.
+
+5. Laying the basis for [API Plugin](../../docs/design/extending-api.md).
+
+6. Keeping the user interaction easy. For example, we should allow users to omit group name when using kubectl if there is no ambiguity.
+
+
+## Bookkeeping for groups
+
+1. No changes to TypeMeta:
+
+ Currently many internal structures, such as RESTMapper and Scheme, are indexed and retrieved by APIVersion. For a fast implementation targeting the v1.1 deadline, we will concatenate group with version, in the form of "group/version", and use it where a version string is expected, so that many code can be reused. This implies we will not add a new field to TypeMeta, we will use TypeMeta.APIVersion to hold "group/version".
+
+ For backward compatibility, v1 objects belong to the group with an empty name, so existing v1 config files will remain valid.
+
+2. /pkg/conversion#Scheme:
+
+ The key of /pkg/conversion#Scheme.versionMap for versioned types will be "group/version". For now, the internal version types of all groups will be registered to versionMap[""], as we don't have any identically named kinds in different groups yet. In the near future, internal version types will be registered to versionMap["group/"], and pkg/conversion#Scheme.InternalVersion will have type []string.
+
+ We will need a mechanism to express if two kinds in different groups (e.g., compute/pods and experimental/pods) are convertible, and auto-generate the conversions if they are.
+
+3. meta.RESTMapper:
+
+ Each group will have its own RESTMapper (of type DefaultRESTMapper), and these mappers will be registered to pkg/api#RESTMapper (of type MultiRESTMapper).
+
+ To support identically named kinds in different groups, We need to expand the input of RESTMapper.VersionAndKindForResource from (resource string) to (group, resource string). If group is not specified and there is ambiguity (i.e., the resource exists in multiple groups), an error should be returned to force the user to specify the group.
+
+## Server-side implementation
+
+1. resource handlers' URL:
+
+ We will force the URL to be in the form of prefix/group/version/...
+
+ Prefix is used to differentiate API paths from other paths like /healthz. All groups will use the same prefix="apis", except when backward compatibility requires otherwise. No "/" is allowed in prefix, group, or version. Specifically,
+
+ * for /api/v1, we set the prefix="api" (which is populated from cmd/kube-apiserver/app#APIServer.APIPrefix), group="", version="v1", so the URL remains to be /api/v1.
+
+ * for new kube API groups, we will set the prefix="apis" (we will add a field in type APIServer to hold this prefix), group=GROUP_NAME, version=VERSION. For example, the URL of the experimental resources will be /apis/experimental/v1alpha1.
+
+ * for OpenShift v1 API, because it's currently registered at /oapi/v1, to be backward compatible, OpenShift may set prefix="oapi", group="".
+
+ * for other new third-party API, they should also use the prefix="apis" and choose the group and version. This can be done through the thirdparty API plugin mechanism in [13000](http://pr.k8s.io/13000).
+
+2. supporting API discovery:
+
+ * At /prefix (e.g., /apis), API server will return the supported groups and their versions using pkg/api/unversioned#APIVersions type, setting the Versions field to "group/version". This is backward compatible, because currently API server does return "v1" encoded in pkg/api/unversioned#APIVersions at /api. (We will also rename the JSON field name from `versions` to `apiVersions`, to be consistent with pkg/api#TypeMeta.APIVersion field)
+
+ * At /prefix/group, API server will return all supported versions of the group. We will create a new type VersionList (name is open to discussion) in pkg/api/unversioned as the API.
+
+ * At /prefix/group/version, API server will return all supported resources in this group, and whether each resource is namespaced. We will create a new type APIResourceList (name is open to discussion) in pkg/api/unversioned as the API.
+
+ We will design how to handle deeper path in other proposals.
+
+ * At /swaggerapi/swagger-version/prefix/group/version, API server will return the Swagger spec of that group/version in `swagger-version` (e.g. we may support both Swagger v1.2 and v2.0).
+
+3. handling common API objects:
+
+ * top-level common API objects:
+
+ To handle the top-level API objects that are used by all groups, we either have to register them to all schemes, or we can choose not to encode them to a version. We plan to take the latter approach and place such types in a new package called `unversioned`, because many of the common top-level objects, such as APIVersions, VersionList, and APIResourceList, which are used in the API discovery, and pkg/api#Status, are part of the protocol between client and server, and do not belong to the domain-specific parts of the API, which will evolve independently over time.
+
+ Types in the unversioned package will not have the APIVersion field, but may retain the Kind field.
+
+ For backward compatibility, when handling the Status, the server will encode it to v1 if the client expects the Status to be encoded in v1, otherwise the server will send the unversioned#Status. If an error occurs before the version can be determined, the server will send the unversioned#Status.
+
+ * non-top-level common API objects:
+
+ Assuming object o belonging to group X is used as a field in an object belonging to group Y, currently genconversion will generate the conversion functions for o in package Y. Hence, we don't need any special treatment for non-top-level common API objects.
+
+ TypeMeta is an exception, because it is a common object that is used by objects in all groups but does not logically belong to any group. We plan to move it to the package `unversioned`.
+
+## Client-side implementation
+
+1. clients:
+
+ Currently we have structured (pkg/client/unversioned#ExperimentalClient, pkg/client/unversioned#Client) and unstructured (pkg/kubectl/resource#Helper) clients. The structured clients are not scalable because each of them implements specific interface, e.g., `[here]../../pkg/client/unversioned/client.go#L32`--fixed. Only the unstructured clients are scalable. We should either auto-generate the code for structured clients or migrate to use the unstructured clients as much as possible.
+
+ We should also move the unstructured client to pkg/client/.
+
+2. Spelling the URL:
+
+ The URL is in the form of prefix/group/version/. The prefix is hard-coded in the client/unversioned.Config. The client should be able to figure out `group` and `version` using the RESTMapper. For a third-party client which does not have access to the RESTMapper, it should discover the mapping of `group`, `version` and `kind` by querying the server as described in point 2 of #server-side-implementation.
+
+3. kubectl:
+
+ kubectl should accept arguments like `group/resource`, `group/resource/name`. Nevertheless, the user can omit the `group`, then kubectl shall rely on RESTMapper.VersionAndKindForResource() to figure out the default group/version of the resource. For example, for resources (like `node`) that exist in both k8s v1 API and k8s modularized API (like `infra/v2`), we should set kubectl default to use one of them. If there is no default group, kubectl should return an error for the ambiguity.
+
+ When kubectl is used with a single resource type, the --api-version and --output-version flag of kubectl should accept values in the form of `group/version`, and they should work as they do today. For multi-resource operations, we will disable these two flags initially.
+
+ Currently, by setting pkg/client/unversioned/clientcmd/api/v1#Config.NamedCluster[x].Cluster.APIVersion ([here](../../pkg/client/unversioned/clientcmd/api/v1/types.go#L58)), user can configure the default apiVersion used by kubectl to talk to server. It does not make sense to set a global version used by kubectl when there are multiple groups, so we plan to deprecate this field. We may extend the version negotiation function to negotiate the preferred version of each group. Details will be in another proposal.
+
+## OpenShift integration
+
+OpenShift can take a similar approach to break monolithic v1 API: keeping the v1 where they are, and gradually adding groups.
+
+For the v1 objects in OpenShift, they should keep doing what they do now: they should remain registered to Scheme.versionMap["v1"] scheme, they should keep being added to originMapper.
+
+For new OpenShift groups, they should do the same as native Kubernetes groups would do: each group should register to Scheme.versionMap["group/version"], each should has separate RESTMapper and the register the MultiRESTMapper.
+
+To expose a list of the supported Openshift groups to clients, OpenShift just has to call to pkg/cmd/server/origin#call initAPIVersionRoute() as it does now, passing in the supported "group/versions" instead of "versions".
+
+
+## Future work
+
+1. Dependencies between groups: we need an interface to register the dependencies between groups. It is not our priority now as the use cases are not clear yet.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/api-group.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/apiserver-watch.md b/contributors/design-proposals/apiserver-watch.md
new file mode 100644
index 00000000..9764768f
--- /dev/null
+++ b/contributors/design-proposals/apiserver-watch.md
@@ -0,0 +1,145 @@
+## Abstract
+
+In the current system, most watch requests sent to apiserver are redirected to
+etcd. This means that for every watch request the apiserver opens a watch on
+etcd.
+
+The purpose of the proposal is to improve the overall performance of the system
+by solving the following problems:
+
+- having too many open watches on etcd
+- avoiding deserializing/converting the same objects multiple times in different
+watch results
+
+In the future, we would also like to add an indexing mechanism to the watch.
+Although Indexer is not part of this proposal, it is supposed to be compatible
+with it - in the future Indexer should be incorporated into the proposed new
+watch solution in apiserver without requiring any redesign.
+
+
+## High level design
+
+We are going to solve those problems by allowing many clients to watch the same
+storage in the apiserver, without being redirected to etcd.
+
+At the high level, apiserver will have a single watch open to etcd, watching all
+the objects (of a given type) without any filtering. The changes delivered from
+etcd will then be stored in a cache in apiserver. This cache is in fact a
+"rolling history window" that will support clients having some amount of latency
+between their list and watch calls. Thus it will have a limited capacity and
+whenever a new change comes from etcd when a cache is full, the oldest change
+will be remove to make place for the new one.
+
+When a client sends a watch request to apiserver, instead of redirecting it to
+etcd, it will cause:
+
+ - registering a handler to receive all new changes coming from etcd
+ - iterating though a watch window, starting at the requested resourceVersion
+ to the head and sending filtered changes directory to the client, blocking
+ the above until this iteration has caught up
+
+This will be done be creating a go-routine per watcher that will be responsible
+for performing the above.
+
+The following section describes the proposal in more details, analyzes some
+corner cases and divides the whole design in more fine-grained steps.
+
+
+## Proposal details
+
+We would like the cache to be __per-resource-type__ and __optional__. Thanks to
+it we will be able to:
+ - have different cache sizes for different resources (e.g. bigger cache
+ [= longer history] for pods, which can significantly affect performance)
+ - avoid any overhead for objects that are watched very rarely (e.g. events
+ are almost not watched at all, but there are a lot of them)
+ - filter the cache for each watcher more effectively
+
+If we decide to support watches spanning different resources in the future and
+we have an efficient indexing mechanisms, it should be relatively simple to unify
+the cache to be common for all the resources.
+
+The rest of this section describes the concrete steps that need to be done
+to implement the proposal.
+
+1. Since we want the watch in apiserver to be optional for different resource
+types, this needs to be self-contained and hidden behind a well defined API.
+This should be a layer very close to etcd - in particular all registries:
+"pkg/registry/generic/registry" should be built on top of it.
+We will solve it by turning tools.EtcdHelper by extracting its interface
+and treating this interface as this API - the whole watch mechanisms in
+apiserver will be hidden behind that interface.
+Thanks to it we will get an initial implementation for free and we will just
+need to reimplement few relevant functions (probably just Watch and List).
+Moreover, this will not require any changes in other parts of the code.
+This step is about extracting the interface of tools.EtcdHelper.
+
+2. Create a FIFO cache with a given capacity. In its "rolling history window"
+we will store two things:
+
+ - the resourceVersion of the object (being an etcdIndex)
+ - the object watched from etcd itself (in a deserialized form)
+
+ This should be as simple as having an array an treating it as a cyclic buffer.
+ Obviously resourceVersion of objects watched from etcd will be increasing, but
+ they are necessary for registering a new watcher that is interested in all the
+ changes since a given etcdIndex.
+
+ Additionally, we should support LIST operation, otherwise clients can never
+ start watching at now. We may consider passing lists through etcd, however
+ this will not work once we have Indexer, so we will need that information
+ in memory anyway.
+ Thus, we should support LIST operation from the "end of the history" - i.e.
+ from the moment just after the newest cached watched event. It should be
+ pretty simple to do, because we can incrementally update this list whenever
+ the new watch event is watched from etcd.
+ We may consider reusing existing structures cache.Store or cache.Indexer
+ ("pkg/client/cache") but this is not a hard requirement.
+
+3. Create the new implementation of the API, that will internally have a
+single watch open to etcd and will store the data received from etcd in
+the FIFO cache - this includes implementing registration of a new watcher
+which will start a new go-routine responsible for iterating over the cache
+and sending all the objects watcher is interested in (by applying filtering
+function) to the watcher.
+
+4. Add a support for processing "error too old" from etcd, which will require:
+ - disconnect all the watchers
+ - clear the internal cache and relist all objects from etcd
+ - start accepting watchers again
+
+5. Enable watch in apiserver for some of the existing resource types - this
+should require only changes at the initialization level.
+
+6. The next step will be to incorporate some indexing mechanism, but details
+of it are TBD.
+
+
+
+### Future optimizations:
+
+1. The implementation of watch in apiserver internally will open a single
+watch to etcd, responsible for watching all the changes of objects of a given
+resource type. However, this watch can potentially expire at any time and
+reconnecting can return "too old resource version". In that case relisting is
+necessary. In such case, to avoid LIST requests coming from all watchers at
+the same time, we can introduce an additional etcd event type:
+[EtcdResync](../../pkg/storage/etcd/etcd_watcher.go#L36)
+
+ Whenever relisting will be done to refresh the internal watch to etcd,
+ EtcdResync event will be send to all the watchers. It will contain the
+ full list of all the objects the watcher is interested in (appropriately
+ filtered) as the parameter of this watch event.
+ Thus, we need to create the EtcdResync event, extend watch.Interface and
+ its implementations to support it and handle those events appropriately
+ in places like
+ [Reflector](../../pkg/client/cache/reflector.go)
+
+ However, this might turn out to be unnecessary optimization if apiserver
+ will always keep up (which is possible in the new design). We will work
+ out all necessary details at that point.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apiserver-watch.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/apparmor.md b/contributors/design-proposals/apparmor.md
new file mode 100644
index 00000000..d7051567
--- /dev/null
+++ b/contributors/design-proposals/apparmor.md
@@ -0,0 +1,310 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Overview](#overview)
+ - [Motivation](#motivation)
+ - [Related work](#related-work)
+- [Alpha Design](#alpha-design)
+ - [Overview](#overview-1)
+ - [Prerequisites](#prerequisites)
+ - [API Changes](#api-changes)
+ - [Pod Security Policy](#pod-security-policy)
+ - [Deploying profiles](#deploying-profiles)
+ - [Testing](#testing)
+- [Beta Design](#beta-design)
+ - [API Changes](#api-changes-1)
+- [Future work](#future-work)
+ - [System component profiles](#system-component-profiles)
+ - [Deploying profiles](#deploying-profiles-1)
+ - [Custom app profiles](#custom-app-profiles)
+ - [Security plugins](#security-plugins)
+ - [Container Runtime Interface](#container-runtime-interface)
+ - [Alerting](#alerting)
+ - [Profile authoring](#profile-authoring)
+- [Appendix](#appendix)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Overview
+
+AppArmor is a [mandatory access control](https://en.wikipedia.org/wiki/Mandatory_access_control)
+(MAC) system for Linux that supplements the standard Linux user and group based
+permissions. AppArmor can be configured for any application to reduce the potential attack surface
+and provide greater [defense in depth](https://en.wikipedia.org/wiki/Defense_in_depth_(computing)).
+It is configured through profiles tuned to whitelist the access needed by a specific program or
+container, such as Linux capabilities, network access, file permissions, etc. Each profile can be
+run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which
+only reports violations.
+
+AppArmor is similar to SELinux. Both are MAC systems implemented as a Linux security module (LSM),
+and are mutually exclusive. SELinux offers a lot of power and very fine-grained controls, but is
+generally considered very difficult to understand and maintain. AppArmor sacrifices some of that
+flexibility in favor of ease of use. Seccomp-bpf is another Linux kernel security feature for
+limiting attack surface, and can (and should!) be used alongside AppArmor.
+
+## Motivation
+
+AppArmor can enable users to run a more secure deployment, and / or provide better auditing and
+monitoring of their systems. Although it is not the only solution, we should enable AppArmor for
+users that want a simpler alternative to SELinux, or are already maintaining a set of AppArmor
+profiles. We have heard from multiple Kubernetes users already that AppArmor support is important to
+them. The [seccomp proposal](../../docs/design/seccomp.md#use-cases) details several use cases that
+also apply to AppArmor.
+
+## Related work
+
+Much of this design is drawn from the work already done to support seccomp profiles in Kubernetes,
+which is outlined in the [seccomp design doc](../../docs/design/seccomp.md). The designs should be
+kept close to apply lessons learned, and reduce cognitive and maintenance overhead.
+
+Docker has supported AppArmor profiles since version 1.3, and maintains a default profile which is
+applied to all containers on supported systems.
+
+AppArmor was upstreamed into the Linux kernel in version 2.6.36. It is currently maintained by
+[Canonical](http://www.canonical.com/), is shipped by default on all Ubuntu and openSUSE systems,
+and is supported on several
+[other distributions](http://wiki.apparmor.net/index.php/Main_Page#Distributions_and_Ports).
+
+# Alpha Design
+
+This section describes the proposed design for
+[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although
+additional features are described in [future work](#future-work). For AppArmor alpha support
+(targeted for Kubernetes 1.4) we will enable:
+
+- Specifying a pre-loaded profile to apply to a pod container
+- Restricting pod containers to a set of profiles (admin use case)
+
+We will also provide a reference implementation of a pod for loading profiles on nodes, but an
+official supported mechanism for deploying profiles is out of scope for alpha.
+
+## Overview
+
+An AppArmor profile can be specified for a container through the Kubernetes API with a pod
+annotation. If a profile is specified, the Kubelet will verify that the node meets the required
+[prerequisites](#prerequisites) (e.g. the profile is already configured on the node) before starting
+the container, and will not run the container if the profile cannot be applied. If the requirements
+are met, the container runtime will configure the appropriate options to apply the profile. Profile
+requirements and defaults can be specified on the
+[PodSecurityPolicy](security-context-constraints.md).
+
+## Prerequisites
+
+When an AppArmor profile is specified, the Kubelet will verify the prerequisites for applying the
+profile to the container. In order to [fail
+securely](https://www.owasp.org/index.php/Fail_securely), a container **will not be run** if any of
+the prerequisites are not met. The prerequisites are:
+
+1. **Kernel support** - The AppArmor kernel module is loaded. Can be checked by
+ [libcontainer](https://github.com/opencontainers/runc/blob/4dedd0939638fc27a609de1cb37e0666b3cf2079/libcontainer/apparmor/apparmor.go#L17).
+2. **Runtime support** - For the initial implementation, Docker will be required (rkt does not
+ currently have AppArmor support). All supported Docker versions include AppArmor support. See
+ [Container Runtime Interface](#container-runtime-interface) for other runtimes.
+3. **Installed profile** - The target profile must be loaded prior to starting the container. Loaded
+ profiles can be found in the AppArmor securityfs \[1\].
+
+If any of the prerequisites are not met an event will be generated to report the error and the pod
+will be
+[rejected](https://github.com/kubernetes/kubernetes/blob/cdfe7b7b42373317ecd83eb195a683e35db0d569/pkg/kubelet/kubelet.go#L2201)
+by the Kubelet.
+
+*[1] The securityfs can be found in `/proc/mounts`, and defaults to `/sys/kernel/security` on my
+Ubuntu system. The profiles can be found at `{securityfs}/apparmor/profiles`
+([example](http://bazaar.launchpad.net/~apparmor-dev/apparmor/master/view/head:/utils/aa-status#L137)).*
+
+## API Changes
+
+The initial alpha support of AppArmor will follow the pattern
+[used by seccomp](https://github.com/kubernetes/kubernetes/pull/25324) and specify profiles through
+annotations. Profiles can be specified per-container through pod annotations. The annotation format
+is a key matching the container, and a profile name value:
+
+```
+container.apparmor.security.alpha.kubernetes.io/<container_name>=<profile_name>
+```
+
+The profiles can be specified in the following formats (following the convention used by [seccomp](../../docs/design/seccomp.md#api-changes)):
+
+1. `runtime/default` - Applies the default profile for the runtime. For docker, the profile is
+ generated from a template
+ [here](https://github.com/docker/docker/blob/master/profiles/apparmor/template.go). If no
+ AppArmor annotations are provided, this profile is enabled by default if AppArmor is enabled in
+ the kernel. Runtimes may define this to be unconfined, as Docker does for privileged pods.
+2. `localhost/<profile_name>` - The profile name specifies the profile to load.
+
+*Note: There is no way to explicitly specify an "unconfined" profile, since it is discouraged. If
+ this is truly needed, the user can load an "allow-all" profile.*
+
+### Pod Security Policy
+
+The [PodSecurityPolicy](security-context-constraints.md) allows cluster administrators to control
+the security context for a pod and its containers. An annotation can be specified on the
+PodSecurityPolicy to restrict which AppArmor profiles can be used, and specify a default if no
+profile is specified.
+
+The annotation key is `apparmor.security.alpha.kubernetes.io/allowedProfileNames`. The value is a
+comma delimited list, with each item following the format described [above](#api-changes). If a list
+of profiles are provided and a pod does not have an AppArmor annotation, the first profile in the
+list will be used by default.
+
+Enforcement of the policy is standard. See the
+[seccomp implementation](https://github.com/kubernetes/kubernetes/pull/28300) as an example.
+
+## Deploying profiles
+
+We will provide a reference implementation of a DaemonSet pod for loading profiles on nodes, but
+there will not be an official mechanism or API in the initial version (see
+[future work](#deploying-profiles-1)). The reference container will contain the `apparmor_parser`
+tool and a script for using the tool to load all profiles in a set of (configurable)
+directories. The initial implementation will poll (with a configurable interval) the directories for
+additions, but will not update or unload existing profiles. The pod can be run in a DaemonSet to
+load the profiles onto all nodes. The pod will need to be run in privileged mode.
+
+This simple design should be sufficient to deploy AppArmor profiles from any volume source, such as
+a ConfigMap or PersistentDisk. Users seeking more advanced features should be able extend this
+design easily.
+
+## Testing
+
+Our e2e testing framework does not currently run nodes with AppArmor enabled, but we can run a node
+e2e test suite on an AppArmor enabled node. The cases we should test are:
+
+- *PodSecurityPolicy* - These tests can be run on a cluster even if AppArmor is not enabled on the
+ nodes.
+ - No AppArmor policy allows pods with arbitrary profiles
+ - With a policy a default is selected
+ - With a policy arbitrary profiles are prevented
+ - With a policy allowed profiles are allowed
+- *Node AppArmor enforcement* - These tests need to run on AppArmor enabled nodes, in the node e2e
+ suite.
+ - A valid container profile gets applied
+ - An unloaded profile will be rejected
+
+# Beta Design
+
+The only part of the design that changes for beta is the API, which is upgraded from
+annotation-based to first class fields.
+
+## API Changes
+
+AppArmor profiles will be specified in the container's SecurityContext, as part of an
+`AppArmorOptions` struct. The options struct makes the API more flexible to future additions.
+
+```go
+type SecurityContext struct {
+ ...
+ // The AppArmor options to be applied to the container.
+ AppArmorOptions *AppArmorOptions `json:"appArmorOptions,omitempty"`
+ ...
+}
+
+// Reference to an AppArmor profile loaded on the host.
+type AppArmorProfileName string
+
+// Options specifying how to run Containers with AppArmor.
+type AppArmorOptions struct {
+ // The profile the Container must be run with.
+ Profile AppArmorProfileName `json:"profile"`
+}
+```
+
+The `AppArmorProfileName` format matches the format for the profile annotation values describe
+[above](#api-changes).
+
+The `PodSecurityPolicySpec` receives a similar treatment with the addition of an
+`AppArmorStrategyOptions` struct. Here the `DefaultProfile` is separated from the `AllowedProfiles`
+in the interest of making the behavior more explicit.
+
+```go
+type PodSecurityPolicySpec struct {
+ ...
+ AppArmorStrategyOptions *AppArmorStrategyOptions `json:"appArmorStrategyOptions,omitempty"`
+ ...
+}
+
+// AppArmorStrategyOptions specifies AppArmor restrictions and requirements for pods and containers.
+type AppArmorStrategyOptions struct {
+ // If non-empty, all pod containers must be run with one of the profiles in this list.
+ AllowedProfiles []AppArmorProfileName `json:"allowedProfiles,omitempty"`
+ // The default profile to use if a profile is not specified for a container.
+ // Defaults to "runtime/default". Must be allowed by AllowedProfiles.
+ DefaultProfile AppArmorProfileName `json:"defaultProfile,omitempty"`
+}
+```
+
+# Future work
+
+Post-1.4 feature ideas. These are not fully-fleshed designs.
+
+## System component profiles
+
+We should publish (to GitHub) AppArmor profiles for all Kubernetes system components, including core
+components like the API server and controller manager, as well as addons like influxDB and
+Grafana. `kube-up.sh` and its successor should have an option to apply the profiles, if the AppArmor
+is supported by the nodes. Distros that support AppArmor and provide a Kubernetes package should
+include the profiles out of the box.
+
+## Deploying profiles
+
+We could provide an official supported solution for loading profiles on the nodes. One option is to
+extend the reference implementation described [above](#deploying-profiles) into a DaemonSet that
+watches the directory sources to sync changes, or to watch a ConfigMap object directly. Another
+option is to add an official API for this purpose, and load the profiles on-demand in the Kubelet.
+
+## Custom app profiles
+
+[Profile stacking](http://wiki.apparmor.net/index.php/AppArmorStacking) is an AppArmor feature
+currently in development that will enable multiple profiles to be applied to the same object. If
+profiles are stacked, the allowed set of operations is the "intersection" of both profiles
+(i.e. stacked profiles are never more permissive). Taking advantage of this feature, the cluster
+administrator could restrict the allowed profiles on a PodSecurityPolicy to a few broad profiles,
+and then individual apps could apply more app specific profiles on top.
+
+## Security plugins
+
+AppArmor, SELinux, TOMOYO, grsecurity, SMACK, etc. are all Linux MAC implementations with similar
+requirements and features. At the very least, the AppArmor implementation should be factored in a
+way that makes it easy to add alternative systems. A more advanced approach would be to extract a
+set of interfaces for plugins implementing the alternatives. An even higher level approach would be
+to define a common API or profile interface for all of them. Work towards this last option is
+already underway for Docker, called
+[Docker Security Profiles](https://github.com/docker/docker/issues/17142#issuecomment-148974642).
+
+## Container Runtime Interface
+
+Other container runtimes will likely add AppArmor support eventually, so the
+[Container Runtime Interface](container-runtime-interface-v1.md) (CRI) needs to be made compatible
+with this design. The two important pieces are a way to report whether AppArmor is supported by the
+runtime, and a way to specify the profile to load (likely through the `LinuxContainerConfig`).
+
+## Alerting
+
+Whether AppArmor is running in enforcing or complain mode it generates logs of policy
+violations. These logs can be important cues for intrusion detection, or at the very least a bug in
+the profile. Violations should almost always generate alerts in production systems. We should
+provide reference documentation for setting up alerts.
+
+## Profile authoring
+
+A common method for writing AppArmor profiles is to start with a restrictive profile in complain
+mode, and then use the `aa-logprof` tool to build a profile from the logs. We should provide
+documentation for following this process in a Kubernetes environment.
+
+# Appendix
+
+- [What is AppArmor](https://askubuntu.com/questions/236381/what-is-apparmor)
+- [Debugging AppArmor on Docker](https://github.com/docker/docker/blob/master/docs/security/apparmor.md#debug-apparmor)
+- Load an AppArmor profile with `apparmor_parser` (required by Docker so it should be available):
+
+ ```
+ $ apparmor_parser --replace --write-cache /path/to/profile
+ ```
+
+- Unload with:
+
+ ```
+ $ apparmor_parser --remove /path/to/profile
+ ```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/apparmor.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/architecture.dia b/contributors/design-proposals/architecture.dia
new file mode 100644
index 00000000..5c87409f
--- /dev/null
+++ b/contributors/design-proposals/architecture.dia
Binary files differ
diff --git a/contributors/design-proposals/architecture.md b/contributors/design-proposals/architecture.md
new file mode 100644
index 00000000..95e3aef4
--- /dev/null
+++ b/contributors/design-proposals/architecture.md
@@ -0,0 +1,85 @@
+# Kubernetes architecture
+
+A running Kubernetes cluster contains node agents (`kubelet`) and master
+components (APIs, scheduler, etc), on top of a distributed storage solution.
+This diagram shows our desired eventual state, though we're still working on a
+few things, like making `kubelet` itself (all our components, really) run within
+containers, and making the scheduler 100% pluggable.
+
+![Architecture Diagram](architecture.png?raw=true "Architecture overview")
+
+## The Kubernetes Node
+
+When looking at the architecture of the system, we'll break it down to services
+that run on the worker node and services that compose the cluster-level control
+plane.
+
+The Kubernetes node has the services necessary to run application containers and
+be managed from the master systems.
+
+Each node runs Docker, of course. Docker takes care of the details of
+downloading images and running containers.
+
+### `kubelet`
+
+The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their
+images, their volumes, etc.
+
+### `kube-proxy`
+
+Each node also runs a simple network proxy and load balancer (see the
+[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for
+more details). This reflects `services` (see
+[the services doc](../user-guide/services.md) for more details) as defined in
+the Kubernetes API on each node and can do simple TCP and UDP stream forwarding
+(round robin) across a set of backends.
+
+Service endpoints are currently found via [DNS](../admin/dns.md) or through
+environment variables (both
+[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and
+Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are
+supported). These variables resolve to ports managed by the service proxy.
+
+## The Kubernetes Control Plane
+
+The Kubernetes control plane is split into a set of components. Currently they
+all run on a single _master_ node, but that is expected to change soon in order
+to support high-availability clusters. These components work together to provide
+a unified view of the cluster.
+
+### `etcd`
+
+All persistent master state is stored in an instance of `etcd`. This provides a
+great way to store configuration data reliably. With `watch` support,
+coordinating components can be notified very quickly of changes.
+
+### Kubernetes API Server
+
+The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a
+CRUD-y server, with most/all business logic implemented in separate components
+or in plug-ins. It mainly processes REST operations, validates them, and updates
+the corresponding objects in `etcd` (and eventually other stores).
+
+### Scheduler
+
+The scheduler binds unscheduled pods to nodes via the `/binding` API. The
+scheduler is pluggable, and we expect to support multiple cluster schedulers and
+even user-provided schedulers in the future.
+
+### Kubernetes Controller Manager Server
+
+All other cluster-level functions are currently performed by the Controller
+Manager. For instance, `Endpoints` objects are created and updated by the
+endpoints controller, and nodes are discovered, managed, and monitored by the
+node controller. These could eventually be split into separate components to
+make them independently pluggable.
+
+The [`replicationcontroller`](../user-guide/replication-controller.md) is a
+mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md)
+API. We eventually plan to port it to a generic plug-in mechanism, once one is
+implemented.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/architecture.png b/contributors/design-proposals/architecture.png
new file mode 100644
index 00000000..0ee8bceb
--- /dev/null
+++ b/contributors/design-proposals/architecture.png
Binary files differ
diff --git a/contributors/design-proposals/architecture.svg b/contributors/design-proposals/architecture.svg
new file mode 100644
index 00000000..d6b6aab0
--- /dev/null
+++ b/contributors/design-proposals/architecture.svg
@@ -0,0 +1,1943 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<svg
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:cc="http://creativecommons.org/ns#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+ xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+ width="68cm"
+ height="56cm"
+ viewBox="-55 -75 1348 1117"
+ id="svg2"
+ version="1.1"
+ inkscape:version="0.91 r13725"
+ sodipodi:docname="architecture.svg"
+ inkscape:export-filename="D:\Work\PaaS\V1R2\Kubernetes\Src\kubernetes\docs\design\architecture.png"
+ inkscape:export-xdpi="90"
+ inkscape:export-ydpi="90">
+ <metadata
+ id="metadata738">
+ <rdf:RDF>
+ <cc:Work
+ rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type
+ rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <defs
+ id="defs736" />
+ <sodipodi:namedview
+ pagecolor="#ffffff"
+ bordercolor="#666666"
+ borderopacity="1"
+ objecttolerance="10"
+ gridtolerance="10"
+ guidetolerance="10"
+ inkscape:pageopacity="0"
+ inkscape:pageshadow="2"
+ inkscape:window-width="1680"
+ inkscape:window-height="988"
+ id="namedview734"
+ showgrid="false"
+ inkscape:zoom="0.33640324"
+ inkscape:cx="1204.7244"
+ inkscape:cy="992.12598"
+ inkscape:window-x="-8"
+ inkscape:window-y="-8"
+ inkscape:window-maximized="1"
+ inkscape:current-layer="svg2" />
+ <g
+ id="g4">
+ <rect
+ style="fill: #ffffff"
+ x="662"
+ y="192"
+ width="630"
+ height="381"
+ id="rect6" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="662"
+ y="192"
+ width="630"
+ height="381"
+ id="rect8" />
+ </g>
+ <g
+ id="g10">
+ <rect
+ style="fill: #ffffff"
+ x="688"
+ y="321"
+ width="580"
+ height="227"
+ id="rect12" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="688"
+ y="321"
+ width="580"
+ height="227"
+ id="rect14" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="687"
+ y="224"
+ id="text16">
+ <tspan
+ x="687"
+ y="224"
+ id="tspan18">Node</tspan>
+ </text>
+ <g
+ id="g20">
+ <rect
+ style="fill: #ffffff"
+ x="723.2"
+ y="235"
+ width="69.6"
+ height="38"
+ id="rect22" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="723.2"
+ y="235"
+ width="69.6"
+ height="38"
+ id="rect24" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="758"
+ y="258.8"
+ id="text26">
+ <tspan
+ x="758"
+ y="258.8"
+ id="tspan28">kubelet</tspan>
+ </text>
+ </g>
+ <g
+ id="g30">
+ <rect
+ style="fill: #ffffff"
+ x="720.2"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect32" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="720.2"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect34" />
+ </g>
+ <g
+ id="g36">
+ <rect
+ style="fill: #ffffff"
+ x="760.55"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect38" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="760.55"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect40" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="805.2"
+ y="461.9"
+ id="text42">
+ <tspan
+ x="805.2"
+ y="461.9"
+ id="tspan44">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g46">
+ <rect
+ style="fill: #ffffff"
+ x="749.8"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect48" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="749.8"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect50" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="794.45"
+ y="452"
+ id="text52">
+ <tspan
+ x="794.45"
+ y="452"
+ id="tspan54">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g56">
+ <rect
+ style="fill: #ffffff"
+ x="739.4"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect58" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="739.4"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect60" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="784.05"
+ y="442.1"
+ id="text62">
+ <tspan
+ x="784.05"
+ y="442.1"
+ id="tspan64">cAdvisor</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="794.2"
+ y="434.6"
+ id="text66">
+ <tspan
+ x="794.2"
+ y="434.6"
+ id="tspan68" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="742.2"
+ y="394.6"
+ id="text70">
+ <tspan
+ x="742.2"
+ y="394.6"
+ id="tspan72">Pod</tspan>
+ </text>
+ <g
+ id="g74">
+ <g
+ id="g76">
+ <rect
+ style="fill: #ffffff"
+ x="1108.6"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect78" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1108.6"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect80" />
+ </g>
+ <g
+ id="g82">
+ <rect
+ style="fill: #ffffff"
+ x="1148.95"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect84" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1148.95"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect86" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1193.6"
+ y="461.9"
+ id="text88">
+ <tspan
+ x="1193.6"
+ y="461.9"
+ id="tspan90">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g92">
+ <rect
+ style="fill: #ffffff"
+ x="1138.2"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect94" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1138.2"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect96" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1182.85"
+ y="452"
+ id="text98">
+ <tspan
+ x="1182.85"
+ y="452"
+ id="tspan100">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g102">
+ <rect
+ style="fill: #ffffff"
+ x="1127.8"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect104" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1127.8"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect106" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1172.45"
+ y="442.1"
+ id="text108">
+ <tspan
+ x="1172.45"
+ y="442.1"
+ id="tspan110">container</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1182.6"
+ y="434.6"
+ id="text112">
+ <tspan
+ x="1182.6"
+ y="434.6"
+ id="tspan114" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1130.6"
+ y="394.6"
+ id="text116">
+ <tspan
+ x="1130.6"
+ y="394.6"
+ id="tspan118">Pod</tspan>
+ </text>
+ </g>
+ <g
+ id="g120">
+ <g
+ id="g122">
+ <rect
+ style="fill: #ffffff"
+ x="902.9"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect124" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="902.9"
+ y="368.1"
+ width="148"
+ height="133"
+ id="rect126" />
+ </g>
+ <g
+ id="g128">
+ <rect
+ style="fill: #ffffff"
+ x="943.25"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect130" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="943.25"
+ y="438.1"
+ width="89.3"
+ height="38"
+ id="rect132" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="987.9"
+ y="461.9"
+ id="text134">
+ <tspan
+ x="987.9"
+ y="461.9"
+ id="tspan136">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g138">
+ <rect
+ style="fill: #ffffff"
+ x="932.5"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect140" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="932.5"
+ y="428.2"
+ width="89.3"
+ height="38"
+ id="rect142" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="977.15"
+ y="452"
+ id="text144">
+ <tspan
+ x="977.15"
+ y="452"
+ id="tspan146">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g148">
+ <rect
+ style="fill: #ffffff"
+ x="922.1"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect150" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="922.1"
+ y="418.3"
+ width="89.3"
+ height="38"
+ id="rect152" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="966.75"
+ y="442.1"
+ id="text154">
+ <tspan
+ x="966.75"
+ y="442.1"
+ id="tspan156">container</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="976.9"
+ y="434.6"
+ id="text158">
+ <tspan
+ x="976.9"
+ y="434.6"
+ id="tspan160" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="924.9"
+ y="394.6"
+ id="text162">
+ <tspan
+ x="924.9"
+ y="394.6"
+ id="tspan164">Pod</tspan>
+ </text>
+ </g>
+ <g
+ id="g166">
+ <rect
+ style="fill: #ffffff"
+ x="949.748"
+ y="228"
+ width="57.1"
+ height="38"
+ id="rect168" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="949.748"
+ y="228"
+ width="57.1"
+ height="38"
+ id="rect170" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="978.298"
+ y="251.8"
+ id="text172">
+ <tspan
+ x="978.298"
+ y="251.8"
+ id="tspan174">Proxy</tspan>
+ </text>
+ </g>
+ <g
+ id="g176">
+ <rect
+ style="fill: #ffffff"
+ x="126.911"
+ y="92.49"
+ width="189.4"
+ height="38"
+ id="rect178" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="126.911"
+ y="92.49"
+ width="189.4"
+ height="38"
+ id="rect180" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="221.611"
+ y="116.29"
+ id="text182">
+ <tspan
+ x="221.611"
+ y="116.29"
+ id="tspan184">kubectl (user commands)</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="142.476"
+ y="866.282"
+ id="text186">
+ <tspan
+ x="142.476"
+ y="866.282"
+ id="tspan188" />
+ </text>
+ <g
+ id="g190">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="758"
+ y1="273"
+ x2="782.332"
+ y2="408.717"
+ id="line192" />
+ <polygon
+ style="fill: #000000"
+ points="783.655,416.099 776.969,407.138 782.332,408.717 786.812,405.374 "
+ id="polygon194" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="783.655,416.099 776.969,407.138 782.332,408.717 786.812,405.374 "
+ id="polygon196" />
+ </g>
+ <g
+ id="g198">
+ <rect
+ style="fill: #ffffff"
+ x="942.576"
+ y="75.6768"
+ width="70.2"
+ height="38"
+ id="rect200" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="942.576"
+ y="75.6768"
+ width="70.2"
+ height="38"
+ id="rect202" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="977.676"
+ y="99.4768"
+ id="text204">
+ <tspan
+ x="977.676"
+ y="99.4768"
+ id="tspan206">Firewall</tspan>
+ </text>
+ </g>
+ <g
+ id="g208">
+ <path
+ style="fill: #ffffff"
+ d="M 949.242 -47.953 C 939.87,-48.2618 921.694,-41.7773 924.25,-27.8819 C 926.806,-13.9865 939.018,-10.8988 944.13,-14.9129 C 949.242,-18.9271 936.178,4.54051 961.17,10.7162 C 986.161,16.8919 998.941,7.01079 995.249,-0.0912821 C 991.557,-7.19336 1017.12,16.5832 1029.04,2.99658 C 1040.97,-10.59 1016.83,-23.5589 1021.94,-21.7062 C 1027.06,-19.8535 1042.68,-22.3237 1037.56,-45.4827 C 1032.45,-68.6416 986.445,-50.7321 991.557,-54.1287 C 996.669,-57.5253 983.889,-74.5086 967.986,-71.112 C 952.082,-67.7153 950.954,-61.5516 949.25,-47.965 L 949.242,-47.953z"
+ id="path210" />
+ <path
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ d="M 949.242 -47.953 C 939.87,-48.2618 921.694,-41.7773 924.25,-27.8819 C 926.806,-13.9865 939.018,-10.8988 944.13,-14.9129 C 949.242,-18.9271 936.178,4.54051 961.17,10.7162 C 986.161,16.8919 998.941,7.01079 995.249,-0.0912821 C 991.557,-7.19336 1017.12,16.5832 1029.04,2.99658 C 1040.97,-10.59 1016.83,-23.5589 1021.94,-21.7062 C 1027.06,-19.8535 1042.68,-22.3237 1037.56,-45.4827 C 1032.45,-68.6416 986.445,-50.7321 991.557,-54.1287 C 996.669,-57.5253 983.889,-74.5086 967.986,-71.112 C 952.082,-67.7153 950.954,-61.5516 949.25,-47.965 L 949.242,-47.953"
+ id="path212" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="985.428"
+ y="-22.3971"
+ id="text214">
+ <tspan
+ x="985.428"
+ y="-22.3971"
+ id="tspan216">Internet</tspan>
+ </text>
+ </g>
+ <g
+ id="g218">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="975.985"
+ y1="12.703"
+ x2="977.415"
+ y2="65.9442"
+ id="line220" />
+ <polygon
+ style="fill: #000000"
+ points="977.616,73.4415 972.349,63.5793 977.415,65.9442 982.346,63.3109 "
+ id="polygon222" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="977.616,73.4415 972.349,63.5793 977.415,65.9442 982.346,63.3109 "
+ id="polygon224" />
+ </g>
+ <g
+ id="g226">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="977.676"
+ y1="113.677"
+ x2="978.245"
+ y2="218.264"
+ id="line228" />
+ <polygon
+ style="fill: #000000"
+ points="978.286,225.764 973.232,215.791 978.245,218.264 983.231,215.737 "
+ id="polygon230" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="978.286,225.764 973.232,215.791 978.245,218.264 983.231,215.737 "
+ id="polygon232" />
+ </g>
+ <g
+ id="g234">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="978.298"
+ y1="266"
+ x2="977.033"
+ y2="358.365"
+ id="line236" />
+ <polygon
+ style="fill: #000000"
+ points="976.931,365.864 972.068,355.797 977.033,358.365 982.067,355.934 "
+ id="polygon238" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="976.931,365.864 972.068,355.797 977.033,358.365 982.067,355.934 "
+ id="polygon240" />
+ </g>
+ <g
+ id="g242">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="992.572"
+ y1="266"
+ x2="1174.02"
+ y2="363.492"
+ id="line244" />
+ <polygon
+ style="fill: #000000"
+ points="1180.63,367.042 1169.45,366.713 1174.02,363.492 1174.19,357.904 "
+ id="polygon246" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="1180.63,367.042 1169.45,366.713 1174.02,363.492 1174.19,357.904 "
+ id="polygon248" />
+ </g>
+ <g
+ id="g250">
+ <rect
+ style="fill: #ffffff"
+ x="-54"
+ y="370.5"
+ width="562"
+ height="383.25"
+ id="rect252" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="-54"
+ y="370.5"
+ width="562"
+ height="383.25"
+ id="rect254" />
+ </g>
+ <g
+ id="g256">
+ <rect
+ style="fill: #ffffff"
+ x="-30"
+ y="416.75"
+ width="364"
+ height="146"
+ id="rect258" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="-30"
+ y="416.75"
+ width="364"
+ height="146"
+ id="rect260" />
+ </g>
+ <g
+ id="g262">
+ <rect
+ style="fill: #ffffff"
+ x="128"
+ y="598.318"
+ width="189"
+ height="54"
+ id="rect264" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="128"
+ y="598.318"
+ width="189"
+ height="54"
+ id="rect266" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="222.5"
+ y="622.118"
+ id="text268">
+ <tspan
+ x="222.5"
+ y="622.118"
+ id="tspan270">controller manager</tspan>
+ <tspan
+ x="222.5"
+ y="638.118"
+ id="tspan272">(replication controller etc.)</tspan>
+ </text>
+ </g>
+ <g
+ id="g274">
+ <rect
+ style="fill: #ffffff"
+ x="15.8884"
+ y="622.914"
+ width="86.15"
+ height="38"
+ id="rect276" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="15.8884"
+ y="622.914"
+ width="86.15"
+ height="38"
+ id="rect278" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="58.9634"
+ y="646.714"
+ id="text280">
+ <tspan
+ x="58.9634"
+ y="646.714"
+ id="tspan282">Scheduler</tspan>
+ </text>
+ </g>
+ <g
+ id="g284">
+ <rect
+ style="fill: #ffffff"
+ x="1.162"
+ y="599.318"
+ width="86.15"
+ height="38"
+ id="rect286" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1.162"
+ y="599.318"
+ width="86.15"
+ height="38"
+ id="rect288" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="44.237"
+ y="623.118"
+ id="text290">
+ <tspan
+ x="44.237"
+ y="623.118"
+ id="tspan292">Scheduler</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="-34.876"
+ y="699.256"
+ id="text294">
+ <tspan
+ x="-34.876"
+ y="699.256"
+ id="tspan296">Master components</tspan>
+ <tspan
+ x="-34.876"
+ y="715.256"
+ id="tspan298">Colocated, or spread across machines,</tspan>
+ <tspan
+ x="-34.876"
+ y="731.256"
+ id="tspan300">as dictated by cluster size.</tspan>
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="370.886"
+ y="731.5"
+ id="text302">
+ <tspan
+ x="370.886"
+ y="731.5"
+ id="tspan304" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="370.886"
+ y="731.5"
+ id="text306">
+ <tspan
+ x="370.886"
+ y="731.5"
+ id="tspan308" />
+ </text>
+ <g
+ id="g310">
+ <rect
+ style="fill: #ffffff"
+ x="136.717"
+ y="468.5"
+ width="172.175"
+ height="70"
+ id="rect312" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="136.717"
+ y="468.5"
+ width="172.175"
+ height="70"
+ id="rect314" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="222.804"
+ y="492.3"
+ id="text316">
+ <tspan
+ x="222.804"
+ y="492.3"
+ id="tspan318">REST</tspan>
+ <tspan
+ x="222.804"
+ y="508.3"
+ id="tspan320">(pods, services,</tspan>
+ <tspan
+ x="222.804"
+ y="524.3"
+ id="tspan322">rep. controllers)</tspan>
+ </text>
+ </g>
+ <g
+ id="g324">
+ <rect
+ style="fill: #ffffff"
+ x="165.958"
+ y="389.5"
+ width="115"
+ height="54"
+ id="rect326" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="165.958"
+ y="389.5"
+ width="115"
+ height="54"
+ id="rect328" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="223.458"
+ y="413.3"
+ id="text330">
+ <tspan
+ x="223.458"
+ y="413.3"
+ id="tspan332">authentication</tspan>
+ <tspan
+ x="223.458"
+ y="429.3"
+ id="tspan334">authorization</tspan>
+ </text>
+ </g>
+ <g
+ id="g336">
+ <rect
+ style="fill: #ffffff"
+ x="-0.65"
+ y="476.5"
+ width="91.3"
+ height="54"
+ id="rect338" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="-0.65"
+ y="476.5"
+ width="91.3"
+ height="54"
+ id="rect340" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="45"
+ y="500.3"
+ id="text342">
+ <tspan
+ x="45"
+ y="500.3"
+ id="tspan344">scheduling</tspan>
+ <tspan
+ x="45"
+ y="516.3"
+ id="tspan346">actuator</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="-13"
+ y="436.75"
+ id="text348">
+ <tspan
+ x="-13"
+ y="436.75"
+ id="tspan350">APIs</tspan>
+ </text>
+ <g
+ id="g352">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="44.237"
+ y1="599.318"
+ x2="44.8921"
+ y2="540.235"
+ id="line354" />
+ <polygon
+ style="fill: #000000"
+ points="44.9752,532.736 49.864,542.791 44.8921,540.235 39.8647,542.68 "
+ id="polygon356" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="44.9752,532.736 49.864,542.791 44.8921,540.235 39.8647,542.68 "
+ id="polygon358" />
+ </g>
+ <g
+ id="g360">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="44.237"
+ y1="599.318"
+ x2="170.878"
+ y2="542.486"
+ id="line362" />
+ <polygon
+ style="fill: #000000"
+ points="177.72,539.416 170.644,548.071 170.878,542.486 166.55,538.948 "
+ id="polygon364" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="177.72,539.416 170.644,548.071 170.878,542.486 166.55,538.948 "
+ id="polygon366" />
+ </g>
+ <g
+ id="g368">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="222.5"
+ y1="598.318"
+ x2="222.755"
+ y2="548.236"
+ id="line370" />
+ <polygon
+ style="fill: #000000"
+ points="222.793,540.736 227.742,550.761 222.755,548.236 217.742,550.71 "
+ id="polygon372" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="222.793,540.736 227.742,550.761 222.755,548.236 217.742,550.71 "
+ id="polygon374" />
+ </g>
+ <g
+ id="g376">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="223.458"
+ y1="443.5"
+ x2="223.059"
+ y2="458.767"
+ id="line378" />
+ <polygon
+ style="fill: #000000"
+ points="222.862,466.265 218.126,456.137 223.059,458.767 228.122,456.399 "
+ id="polygon380" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="222.862,466.265 218.126,456.137 223.059,458.767 228.122,456.399 "
+ id="polygon382" />
+ </g>
+ <g
+ id="g384">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="313.554"
+ y1="548.463"
+ x2="366.76"
+ y2="662.181"
+ id="line386" />
+ <polygon
+ style="fill: #000000"
+ points="318.082,546.344 309.316,539.406 309.025,550.582 "
+ id="polygon388" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="318.082,546.344 309.316,539.406 309.025,550.582 "
+ id="polygon390" />
+ <polygon
+ style="fill: #000000"
+ points="369.938,668.975 361.172,662.036 366.76,662.181 370.229,657.798 "
+ id="polygon392" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="369.938,668.975 361.172,662.036 366.76,662.181 370.229,657.798 "
+ id="polygon394" />
+ </g>
+ <g
+ id="g396">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="221.612"
+ y1="130.49"
+ x2="223.389"
+ y2="379.764"
+ id="line398" />
+ <polygon
+ style="fill: #000000"
+ points="223.442,387.264 218.371,377.3 223.389,379.764 228.371,377.229 "
+ id="polygon400" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="223.442,387.264 218.371,377.3 223.389,379.764 228.371,377.229 "
+ id="polygon402" />
+ </g>
+ <g
+ id="g404">
+ <path
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ d="M 319.892 503.5 C 392.964,503.5 639.13,254 713.464,254"
+ id="path406" />
+ <polygon
+ style="fill: #000000"
+ points="319.892,498.5 309.892,503.5 319.892,508.5 "
+ id="polygon408" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="319.892,498.5 309.892,503.5 319.892,508.5 "
+ id="polygon410" />
+ <polygon
+ style="fill: #000000"
+ points="720.964,254 710.964,259 713.464,254 710.964,249 "
+ id="polygon412" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="720.964,254 710.964,259 713.464,254 710.964,249 "
+ id="polygon414" />
+ </g>
+ <g
+ id="g416">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="90.65"
+ y1="503.5"
+ x2="126.981"
+ y2="503.5"
+ id="line418" />
+ <polygon
+ style="fill: #000000"
+ points="134.481,503.5 124.481,508.5 126.981,503.5 124.481,498.5 "
+ id="polygon420" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="134.481,503.5 124.481,508.5 126.981,503.5 124.481,498.5 "
+ id="polygon422" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="221.612"
+ y="111.49"
+ id="text424">
+ <tspan
+ x="221.612"
+ y="111.49"
+ id="tspan426" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1209"
+ y="339.5"
+ id="text428">
+ <tspan
+ x="1209"
+ y="339.5"
+ id="tspan430">docker</tspan>
+ </text>
+ <g
+ id="g432">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="793.753"
+ y1="272.636"
+ x2="968.266"
+ y2="363.6"
+ id="line434" />
+ <polygon
+ style="fill: #000000"
+ points="974.917,367.066 963.738,366.878 968.266,363.6 968.361,358.01 "
+ id="polygon436" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="974.917,367.066 963.738,366.878 968.266,363.6 968.361,358.01 "
+ id="polygon438" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="978"
+ y="434.5"
+ id="text440">
+ <tspan
+ x="978"
+ y="434.5"
+ id="tspan442">..</tspan>
+ </text>
+ <text
+ font-size="27.0933"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1067"
+ y="437"
+ id="text444">
+ <tspan
+ x="1067"
+ y="437"
+ id="tspan446">...</tspan>
+ </text>
+ <g
+ id="g448">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="792.8"
+ y1="273"
+ x2="1173.14"
+ y2="365.792"
+ id="line450" />
+ <polygon
+ style="fill: #000000"
+ points="1180.43,367.57 1169.53,370.057 1173.14,365.792 1171.9,360.342 "
+ id="polygon452" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="1180.43,367.57 1169.53,370.057 1173.14,365.792 1171.9,360.342 "
+ id="polygon454" />
+ </g>
+ <g
+ id="g456">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="792.8"
+ y1="273"
+ x2="794.057"
+ y2="358.365"
+ id="line458" />
+ <polygon
+ style="fill: #000000"
+ points="794.167,365.864 789.02,355.939 794.057,358.365 799.019,355.792 "
+ id="polygon460" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="794.167,365.864 789.02,355.939 794.057,358.365 799.019,355.792 "
+ id="polygon462" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="720"
+ y="220"
+ id="text464">
+ <tspan
+ x="720"
+ y="220"
+ id="tspan466" />
+ </text>
+ <g
+ id="g468">
+ <rect
+ style="fill: #ffffff"
+ x="660"
+ y="660"
+ width="630"
+ height="381"
+ id="rect470" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="660"
+ y="660"
+ width="630"
+ height="381"
+ id="rect472" />
+ </g>
+ <g
+ id="g474">
+ <rect
+ style="fill: #ffffff"
+ x="686"
+ y="789"
+ width="580"
+ height="227"
+ id="rect476" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="686"
+ y="789"
+ width="580"
+ height="227"
+ id="rect478" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="685"
+ y="692"
+ id="text480">
+ <tspan
+ x="685"
+ y="692"
+ id="tspan482">Node</tspan>
+ </text>
+ <g
+ id="g484">
+ <rect
+ style="fill: #ffffff"
+ x="721.2"
+ y="703"
+ width="69.6"
+ height="38"
+ id="rect486" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="721.2"
+ y="703"
+ width="69.6"
+ height="38"
+ id="rect488" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="756"
+ y="726.8"
+ id="text490">
+ <tspan
+ x="756"
+ y="726.8"
+ id="tspan492">kubelet</tspan>
+ </text>
+ </g>
+ <g
+ id="g494">
+ <rect
+ style="fill: #ffffff"
+ x="718.2"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect496" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="718.2"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect498" />
+ </g>
+ <g
+ id="g500">
+ <rect
+ style="fill: #ffffff"
+ x="758.55"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect502" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="758.55"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect504" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="803.2"
+ y="929.9"
+ id="text506">
+ <tspan
+ x="803.2"
+ y="929.9"
+ id="tspan508">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g510">
+ <rect
+ style="fill: #ffffff"
+ x="747.8"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect512" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="747.8"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect514" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="792.45"
+ y="920"
+ id="text516">
+ <tspan
+ x="792.45"
+ y="920"
+ id="tspan518">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g520">
+ <rect
+ style="fill: #ffffff"
+ x="737.4"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect522" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="737.4"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect524" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="782.05"
+ y="910.1"
+ id="text526">
+ <tspan
+ x="782.05"
+ y="910.1"
+ id="tspan528">cAdvisor</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="792.2"
+ y="902.6"
+ id="text530">
+ <tspan
+ x="792.2"
+ y="902.6"
+ id="tspan532" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="740.2"
+ y="862.6"
+ id="text534">
+ <tspan
+ x="740.2"
+ y="862.6"
+ id="tspan536">Pod</tspan>
+ </text>
+ <g
+ id="g538">
+ <g
+ id="g540">
+ <rect
+ style="fill: #ffffff"
+ x="1106.6"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect542" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1106.6"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect544" />
+ </g>
+ <g
+ id="g546">
+ <rect
+ style="fill: #ffffff"
+ x="1146.95"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect548" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1146.95"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect550" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1191.6"
+ y="929.9"
+ id="text552">
+ <tspan
+ x="1191.6"
+ y="929.9"
+ id="tspan554">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g556">
+ <rect
+ style="fill: #ffffff"
+ x="1136.2"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect558" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1136.2"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect560" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1180.85"
+ y="920"
+ id="text562">
+ <tspan
+ x="1180.85"
+ y="920"
+ id="tspan564">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g566">
+ <rect
+ style="fill: #ffffff"
+ x="1125.8"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect568" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="1125.8"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect570" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1170.45"
+ y="910.1"
+ id="text572">
+ <tspan
+ x="1170.45"
+ y="910.1"
+ id="tspan574">container</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1180.6"
+ y="902.6"
+ id="text576">
+ <tspan
+ x="1180.6"
+ y="902.6"
+ id="tspan578" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1128.6"
+ y="862.6"
+ id="text580">
+ <tspan
+ x="1128.6"
+ y="862.6"
+ id="tspan582">Pod</tspan>
+ </text>
+ </g>
+ <g
+ id="g584">
+ <g
+ id="g586">
+ <rect
+ style="fill: #ffffff"
+ x="900.9"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect588" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="900.9"
+ y="836.1"
+ width="148"
+ height="133"
+ id="rect590" />
+ </g>
+ <g
+ id="g592">
+ <rect
+ style="fill: #ffffff"
+ x="941.25"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect594" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="941.25"
+ y="906.1"
+ width="89.3"
+ height="38"
+ id="rect596" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="985.9"
+ y="929.9"
+ id="text598">
+ <tspan
+ x="985.9"
+ y="929.9"
+ id="tspan600">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g602">
+ <rect
+ style="fill: #ffffff"
+ x="930.5"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect604" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="930.5"
+ y="896.2"
+ width="89.3"
+ height="38"
+ id="rect606" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="975.15"
+ y="920"
+ id="text608">
+ <tspan
+ x="975.15"
+ y="920"
+ id="tspan610">container</tspan>
+ </text>
+ </g>
+ <g
+ id="g612">
+ <rect
+ style="fill: #ffffff"
+ x="920.1"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect614" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="920.1"
+ y="886.3"
+ width="89.3"
+ height="38"
+ id="rect616" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="964.75"
+ y="910.1"
+ id="text618">
+ <tspan
+ x="964.75"
+ y="910.1"
+ id="tspan620">container</tspan>
+ </text>
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="974.9"
+ y="902.6"
+ id="text622">
+ <tspan
+ x="974.9"
+ y="902.6"
+ id="tspan624" />
+ </text>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="922.9"
+ y="862.6"
+ id="text626">
+ <tspan
+ x="922.9"
+ y="862.6"
+ id="tspan628">Pod</tspan>
+ </text>
+ </g>
+ <g
+ id="g630">
+ <rect
+ style="fill: #ffffff"
+ x="947.748"
+ y="696"
+ width="57.1"
+ height="38"
+ id="rect632" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="947.748"
+ y="696"
+ width="57.1"
+ height="38"
+ id="rect634" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="976.298"
+ y="719.8"
+ id="text636">
+ <tspan
+ x="976.298"
+ y="719.8"
+ id="tspan638">Proxy</tspan>
+ </text>
+ </g>
+ <g
+ id="g640">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="756"
+ y1="741"
+ x2="780.332"
+ y2="876.717"
+ id="line642" />
+ <polygon
+ style="fill: #000000"
+ points="781.655,884.099 774.969,875.138 780.332,876.717 784.812,873.374 "
+ id="polygon644" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="781.655,884.099 774.969,875.138 780.332,876.717 784.812,873.374 "
+ id="polygon646" />
+ </g>
+ <g
+ id="g648">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="976.298"
+ y1="734"
+ x2="975.033"
+ y2="826.365"
+ id="line650" />
+ <polygon
+ style="fill: #000000"
+ points="974.931,833.864 970.068,823.797 975.033,826.365 980.067,823.934 "
+ id="polygon652" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="974.931,833.864 970.068,823.797 975.033,826.365 980.067,823.934 "
+ id="polygon654" />
+ </g>
+ <g
+ id="g656">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="990.572"
+ y1="734"
+ x2="1172.02"
+ y2="831.492"
+ id="line658" />
+ <polygon
+ style="fill: #000000"
+ points="1178.63,835.042 1167.45,834.713 1172.02,831.492 1172.19,825.904 "
+ id="polygon660" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="1178.63,835.042 1167.45,834.713 1172.02,831.492 1172.19,825.904 "
+ id="polygon662" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1207"
+ y="807.5"
+ id="text664">
+ <tspan
+ x="1207"
+ y="807.5"
+ id="tspan666">docker</tspan>
+ </text>
+ <g
+ id="g668">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="791.753"
+ y1="740.636"
+ x2="966.266"
+ y2="831.6"
+ id="line670" />
+ <polygon
+ style="fill: #000000"
+ points="972.917,835.066 961.738,834.878 966.266,831.6 966.361,826.01 "
+ id="polygon672" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="972.917,835.066 961.738,834.878 966.266,831.6 966.361,826.01 "
+ id="polygon674" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="976"
+ y="902.5"
+ id="text676">
+ <tspan
+ x="976"
+ y="902.5"
+ id="tspan678">..</tspan>
+ </text>
+ <text
+ font-size="27.0933"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="1065"
+ y="905"
+ id="text680">
+ <tspan
+ x="1065"
+ y="905"
+ id="tspan682">...</tspan>
+ </text>
+ <g
+ id="g684">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="790.8"
+ y1="741"
+ x2="1171.14"
+ y2="833.792"
+ id="line686" />
+ <polygon
+ style="fill: #000000"
+ points="1178.43,835.57 1167.53,838.057 1171.14,833.792 1169.9,828.342 "
+ id="polygon688" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="1178.43,835.57 1167.53,838.057 1171.14,833.792 1169.9,828.342 "
+ id="polygon690" />
+ </g>
+ <g
+ id="g692">
+ <line
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x1="790.8"
+ y1="741"
+ x2="792.057"
+ y2="826.365"
+ id="line694" />
+ <polygon
+ style="fill: #000000"
+ points="792.167,833.864 787.02,823.939 792.057,826.365 797.019,823.792 "
+ id="polygon696" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="792.167,833.864 787.02,823.939 792.057,826.365 797.019,823.792 "
+ id="polygon698" />
+ </g>
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:start;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="718"
+ y="688"
+ id="text700">
+ <tspan
+ x="718"
+ y="688"
+ id="tspan702" />
+ </text>
+ <g
+ id="g704">
+ <path
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ d="M 319.892 521 C 392.964,521 637.13,722 711.464,722"
+ id="path706" />
+ <polygon
+ style="fill: #000000"
+ points="319.892,516 309.892,521 319.892,526 "
+ id="polygon708" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="319.892,516 309.892,521 319.892,526 "
+ id="polygon710" />
+ <polygon
+ style="fill: #000000"
+ points="718.964,722 708.964,727 711.464,722 708.964,717 "
+ id="polygon712" />
+ <polygon
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ points="718.964,722 708.964,727 711.464,722 708.964,717 "
+ id="polygon714" />
+ </g>
+ <g
+ id="g716">
+ <rect
+ style="fill: #ffffff"
+ x="282.774"
+ y="671"
+ width="176.225"
+ height="121"
+ id="rect718" />
+ <rect
+ style="fill: none; fill-opacity:0; stroke-width: 2; stroke: #000000"
+ x="282.774"
+ y="671"
+ width="176.225"
+ height="121"
+ id="rect720" />
+ <text
+ font-size="12.8"
+ style="fill: #000000;text-anchor:middle;font-family:sans-serif;font-style:normal;font-weight:normal"
+ x="370.886"
+ y="704.3"
+ id="text722">
+ <tspan
+ x="370.886"
+ y="704.3"
+ id="tspan724">Distributed</tspan>
+ <tspan
+ x="370.886"
+ y="720.3"
+ id="tspan726">Watchable</tspan>
+ <tspan
+ x="370.886"
+ y="736.3"
+ id="tspan728">Storage</tspan>
+ <tspan
+ x="370.886"
+ y="752.3"
+ id="tspan730" />
+ <tspan
+ x="370.886"
+ y="768.3"
+ id="tspan732">(implemented via etcd)</tspan>
+ </text>
+ </g>
+</svg>
diff --git a/contributors/design-proposals/aws_under_the_hood.md b/contributors/design-proposals/aws_under_the_hood.md
new file mode 100644
index 00000000..6e3c5afb
--- /dev/null
+++ b/contributors/design-proposals/aws_under_the_hood.md
@@ -0,0 +1,310 @@
+# Peeking under the hood of Kubernetes on AWS
+
+This document provides high-level insight into how Kubernetes works on AWS and
+maps to AWS objects. We assume that you are familiar with AWS.
+
+We encourage you to use [kube-up](../getting-started-guides/aws.md) to create
+clusters on AWS. We recommend that you avoid manual configuration but are aware
+that sometimes it's the only option.
+
+Tip: You should open an issue and let us know what enhancements can be made to
+the scripts to better suit your needs.
+
+That said, it's also useful to know what's happening under the hood when
+Kubernetes clusters are created on AWS. This can be particularly useful if
+problems arise or in circumstances where the provided scripts are lacking and
+you manually created or configured your cluster.
+
+**Table of contents:**
+ * [Architecture overview](#architecture-overview)
+ * [Storage](#storage)
+ * [Auto Scaling group](#auto-scaling-group)
+ * [Networking](#networking)
+ * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services)
+ * [Identity and access management (IAM)](#identity-and-access-management-iam)
+ * [Tagging](#tagging)
+ * [AWS objects](#aws-objects)
+ * [Manual infrastructure creation](#manual-infrastructure-creation)
+ * [Instance boot](#instance-boot)
+
+### Architecture overview
+
+Kubernetes is a cluster of several machines that consists of a Kubernetes
+master and a set number of nodes (previously known as 'nodes') for which the
+master is responsible. See the [Architecture](architecture.md) topic for
+more details.
+
+By default on AWS:
+
+* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently
+ modern kernel that pairs well with Docker and doesn't require a
+ reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.)
+* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly
+ because this is what Google Compute Engine uses).
+
+You can override these defaults by passing different environment variables to
+kube-up.
+
+### Storage
+
+AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore).
+These can then be attached to pods that should store persistent data (e.g. if
+you're running a database).
+
+By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)
+unless you create pods with persistent volumes
+[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes
+containers do not have persistent storage unless you attach a persistent
+volume, and so nodes on AWS use instance storage. Instance storage is cheaper,
+often faster, and historically more reliable. Unless you can make do with
+whatever space is left on your root partition, you must choose an instance type
+that provides you with sufficient instance storage for your needs.
+
+To configure Kubernetes to use EBS storage, pass the environment variable
+`KUBE_AWS_STORAGE=ebs` to kube-up.
+
+Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to
+track its state. Similar to nodes, containers are mostly run against instance
+storage, except that we repoint some important data onto the persistent volume.
+
+The default storage driver for Docker images is aufs. Specifying btrfs (by
+passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a
+good choice for a filesystem. btrfs is relatively reliable with Docker and has
+improved its reliability with modern kernels. It can easily span multiple
+volumes, which is particularly useful when we are using an instance type with
+multiple ephemeral instance disks.
+
+### Auto Scaling group
+
+Nodes (but not the master) are run in an
+[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
+on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled
+([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means
+that AWS will relaunch any nodes that are terminated.
+
+We do not currently run the master in an AutoScalingGroup, but we should
+([#11934](http://issues.k8s.io/11934)).
+
+### Networking
+
+Kubernetes uses an IP-per-pod model. This means that a node, which runs many
+pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced
+routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then
+configured to route to an instance in the VPC routing table.
+
+It is also possible to use overlay networking on AWS, but that is not the
+default configuration of the kube-up script.
+
+### NodePort and LoadBalancer services
+
+Kubernetes on AWS integrates with [Elastic Load Balancing
+(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html).
+When you create a service with `Type=LoadBalancer`, Kubernetes (the
+kube-controller-manager) will create an ELB, create a security group for the
+ELB which allows access on the service ports, attach all the nodes to the ELB,
+and modify the security group for the nodes to allow traffic from the ELB to
+the nodes. This traffic reaches kube-proxy where it is then forwarded to the
+pods.
+
+ELB has some restrictions:
+* ELB requires that all nodes listen on a single port,
+* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below
+on ELB annotations for pods speaking HTTP).
+
+To work with these restrictions, in Kubernetes, [LoadBalancer
+services](../user-guide/services.md#type-loadbalancer) are exposed as
+[NodePort services](../user-guide/services.md#type-nodeport). Then
+kube-proxy listens externally on the cluster-wide port that's assigned to
+NodePort services and forwards traffic to the corresponding pods.
+
+For example, if we configure a service of Type LoadBalancer with a
+public port of 80:
+* Kubernetes will assign a NodePort to the service (e.g. port 31234)
+* ELB is configured to proxy traffic on the public port 80 to the NodePort
+assigned to the service (in this example port 31234).
+* Then any in-coming traffic that ELB forwards to the NodePort (31234)
+is recognized by kube-proxy and sent to the correct pods for that service.
+
+Note that we do not automatically open NodePort services in the AWS firewall
+(although we do open LoadBalancer services). This is because we expect that
+NodePort services are more of a building block for things like inter-cluster
+services or for LoadBalancer. To consume a NodePort service externally, you
+will likely have to open the port in the node security group
+(`kubernetes-node-<clusterid>`).
+
+For SSL support, starting with 1.3 two annotations can be added to a service:
+
+```
+service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012
+```
+
+The first specifies which certificate to use. It can be either a
+certificate from a third party issuer that was uploaded to IAM or one created
+within AWS Certificate Manager.
+
+```
+service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp)
+```
+
+The second annotation specifies which protocol a pod speaks. For HTTPS and
+SSL, the ELB will expect the pod to authenticate itself over the encrypted
+connection.
+
+HTTP and HTTPS will select layer 7 proxying: the ELB will terminate
+the connection with the user, parse headers and inject the `X-Forwarded-For`
+header with the user's IP address (pods will only see the IP address of the
+ELB at the other end of its connection) when forwarding requests.
+
+TCP and SSL will select layer 4 proxying: the ELB will forward traffic without
+modifying the headers.
+
+### Identity and Access Management (IAM)
+
+kube-proxy sets up two IAM roles, one for the master called
+[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json)
+and one for the nodes called
+[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
+
+The master is responsible for creating ELBs and configuring them, as well as
+setting up advanced VPC routing. Currently it has blanket permissions on EC2,
+along with rights to create and destroy ELBs.
+
+The nodes do not need a lot of access to the AWS APIs. They need to download
+a distribution file, and then are responsible for attaching and detaching EBS
+volumes from itself.
+
+The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR
+authorization tokens, refresh them every 12 hours if needed, and fetch Docker
+images from it, as long as the appropriate permissions are enabled. Those in
+[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly),
+without write access, should suffice. The master policy is probably overly
+permissive. The security conscious may want to lock-down the IAM policies
+further ([#11936](http://issues.k8s.io/11936)).
+
+We should make it easier to extend IAM permissions and also ensure that they
+are correctly configured ([#14226](http://issues.k8s.io/14226)).
+
+### Tagging
+
+All AWS resources are tagged with a tag named "KubernetesCluster", with a value
+that is the unique cluster-id. This tag is used to identify a particular
+'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
+Resources are considered to belong to the same cluster if and only if they have
+the same value in the tag named "KubernetesCluster". (The kube-up script is
+not configured to create multiple clusters in the same VPC by default, but it
+is possible to create another cluster in the same VPC.)
+
+Within the AWS cloud provider logic, we filter requests to the AWS APIs to
+match resources with our cluster tag. By filtering the requests, we ensure
+that we see only our own AWS objects.
+
+**Important:** If you choose not to use kube-up, you must pick a unique
+cluster-id value, and ensure that all AWS resources have a tag with
+`Name=KubernetesCluster,Value=<clusterid>`.
+
+### AWS objects
+
+The kube-up script does a number of things in AWS:
+* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes
+distribution and the salt scripts into it. They are made world-readable and the
+HTTP URLs are passed to instances; this is how Kubernetes code gets onto the
+machines.
+* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/):
+ * `kubernetes-master` is used by the master.
+ * `kubernetes-node` is used by nodes.
+* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is
+the OpenSSH key fingerprint, so that multiple users can run the script with
+different keys and their keys will not collide (with near-certainty). It will
+use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
+one there. (With the default Ubuntu images, if you have to SSH in: the user is
+`ubuntu` and that user can `sudo`).
+* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
+enables the `dns-support` and `dns-hostnames` options.
+* Creates an internet gateway for the VPC.
+* Creates a route table for the VPC, with the internet gateway as the default
+route.
+* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
+(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a
+single AZ on AWS. Although, there are two philosophies in discussion on how to
+achieve High Availability (HA):
+ * cluster-per-AZ: An independent cluster for each AZ, where each cluster
+is entirely separate.
+ * cross-AZ-clusters: A single cluster spans multiple AZs.
+The debate is open here, where cluster-per-AZ is discussed as more robust but
+cross-AZ-clusters are more convenient.
+* Associates the subnet to the route table
+* Creates security groups for the master (`kubernetes-master-<clusterid>`)
+and the nodes (`kubernetes-node-<clusterid>`).
+* Configures security groups so that masters and nodes can communicate. This
+includes intercommunication between masters and nodes, opening SSH publicly
+for both masters and nodes, and opening port 443 on the master for the HTTPS
+API endpoints.
+* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type
+`MASTER_DISK_TYPE`.
+* Launches a master with a fixed IP address (172.20.0.9) that is also
+configured for the security group and all the necessary IAM credentials. An
+instance script is used to pass vital configuration information to Salt. Note:
+The hope is that over time we can reduce the amount of configuration
+information that must be passed in this way.
+* Once the instance is up, it attaches the EBS volume and sets up a manual
+routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
+10.246.0.0/24).
+* For auto-scaling, on each nodes it creates a launch configuration and group.
+The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default
+name is kubernetes-node-group. The auto-scaling group has a min and max size
+that are both set to NUM_NODES. You can change the size of the auto-scaling
+group to add or remove the total number of nodes from within the AWS API or
+Console. Each nodes self-configures, meaning that they come up; run Salt with
+the stored configuration; connect to the master; are assigned an internal CIDR;
+and then the master configures the route-table with the assigned CIDR. The
+kube-up script performs a health-check on the nodes but it's a self-check that
+is not required.
+
+If attempting this configuration manually, it is recommend to follow along
+with the kube-up script, and being sure to tag everything with a tag with name
+`KubernetesCluster` and value set to a unique cluster-id. Also, passing the
+right configuration options to Salt when not using the script is tricky: the
+plan here is to simplify this by having Kubernetes take on more node
+configuration, and even potentially remove Salt altogether.
+
+### Manual infrastructure creation
+
+While this work is not yet complete, advanced users might choose to manually
+create certain AWS objects while still making use of the kube-up script (to
+configure Salt, for example). These objects can currently be manually created:
+* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
+* Set the `VPC_ID` environment variable to reuse an existing VPC.
+* Set the `SUBNET_ID` environment variable to reuse an existing subnet.
+* If your route table has a matching `KubernetesCluster` tag, it will be reused.
+* If your security groups are appropriately named, they will be reused.
+
+Currently there is no way to do the following with kube-up:
+* Use an existing AWS SSH key with an arbitrary name.
+* Override the IAM credentials in a sensible way
+([#14226](http://issues.k8s.io/14226)).
+* Use different security group permissions.
+* Configure your own auto-scaling groups.
+
+If any of the above items apply to your situation, open an issue to request an
+enhancement to the kube-up script. You should provide a complete description of
+the use-case, including all the details around what you want to accomplish.
+
+### Instance boot
+
+The instance boot procedure is currently pretty complicated, primarily because
+we must marshal configuration from Bash to Salt via the AWS instance script.
+As we move more post-boot configuration out of Salt and into Kubernetes, we
+will hopefully be able to simplify this.
+
+When the kube-up script launches instances, it builds an instance startup
+script which includes some configuration options passed to kube-up, and
+concatenates some of the scripts found in the cluster/aws/templates directory.
+These scripts are responsible for mounting and formatting volumes, downloading
+Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually
+install Kubernetes.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/client-package-structure.md b/contributors/design-proposals/client-package-structure.md
new file mode 100644
index 00000000..2d30021d
--- /dev/null
+++ b/contributors/design-proposals/client-package-structure.md
@@ -0,0 +1,316 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Client: layering and package structure](#client-layering-and-package-structure)
+ - [Desired layers](#desired-layers)
+ - [Transport](#transport)
+ - [RESTClient/request.go](#restclientrequestgo)
+ - [Mux layer](#mux-layer)
+ - [High-level: Individual typed](#high-level-individual-typed)
+ - [High-level, typed: Discovery](#high-level-typed-discovery)
+ - [High-level: Dynamic](#high-level-dynamic)
+ - [High-level: Client Sets](#high-level-client-sets)
+ - [Package Structure](#package-structure)
+ - [Client Guarantees (and testing)](#client-guarantees-and-testing)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Client: layering and package structure
+
+## Desired layers
+
+### Transport
+
+The transport layer is concerned with round-tripping requests to an apiserver
+somewhere. It consumes a Config object with options appropriate for this.
+(That's most of the current client.Config structure.)
+
+Transport delivers an object that implements http's RoundTripper interface
+and/or can be used in place of http.DefaultTransport to route requests.
+
+Transport objects are safe for concurrent use, and are cached and reused by
+subsequent layers.
+
+Tentative name: "Transport".
+
+It's expected that the transport config will be general enough that third
+parties (e.g., OpenShift) will not need their own implementation, rather they
+can change the certs, token, etc., to be appropriate for their own servers,
+etc..
+
+Action items:
+* Split out of current client package into a new package. (@krousey)
+
+### RESTClient/request.go
+
+RESTClient consumes a Transport and a Codec (and optionally a group/version),
+and produces something that implements the interface currently in request.go.
+That is, with a RESTClient, you can write chains of calls like:
+
+`c.Get().Path(p).Param("name", "value").Do()`
+
+RESTClient is generically usable by any client for servers exposing REST-like
+semantics. It provides helpers that benefit those following api-conventions.md,
+but does not mandate them. It provides a higher level http interface that
+abstracts transport, wire serialization, retry logic, and error handling.
+Kubernetes-like constructs that deviate from standard HTTP should be bypassable.
+Every non-trivial call made to a remote restful API from Kubernetes code should
+go through a rest client.
+
+The group and version may be empty when constructing a RESTClient. This is valid
+for executing discovery commands. The group and version may be overridable with
+a chained function call.
+
+Ideally, no semantic behavior is built into RESTClient, and RESTClient will use
+the Codec it was constructed with for all semantic operations, including turning
+options objects into URL query parameters. Unfortunately, that is not true of
+today's RESTClient, which may have some semantic information built in. We will
+remove this.
+
+RESTClient should not make assumptions about the format of data produced or
+consumed by the Codec. Currently, it is JSON, but we want to support binary
+protocols in the future.
+
+The Codec would look something like this:
+
+```go
+type Codec interface {
+ Encode(runtime.Object) ([]byte, error)
+ Decode([]byte]) (runtime.Object, error)
+
+ // Used to version-control query parameters
+ EncodeParameters(optionsObject runtime.Object) (url.Values, error)
+
+ // Not included here since the client doesn't need it, but a corresponding
+ // DecodeParametersInto method would be available on the server.
+}
+```
+
+There should be one codec per version. RESTClient is *not* responsible for
+converting between versions; if a client wishes, they can supply a Codec that
+does that. But RESTClient will make the assumption that it's talking to a single
+group/version, and will not contain any conversion logic. (This is a slight
+change from the current state.)
+
+As with Transport, it is expected that 3rd party providers following the api
+conventions should be able to use RESTClient, and will not need to implement
+their own.
+
+Action items:
+* Split out of the current client package. (@krousey)
+* Possibly, convert to an interface (currently, it's a struct). This will allow
+ extending the error-checking monad that's currently in request.go up an
+ additional layer.
+* Switch from ParamX("x") functions to using types representing the collection
+ of parameters and the Codec for query parameter serialization.
+* Any other Kubernetes group specific behavior should also be removed from
+ RESTClient.
+
+### Mux layer
+
+(See TODO at end; this can probably be merged with the "client set" concept.)
+
+The client muxer layer has a map of group/version to cached RESTClient, and
+knows how to construct a new RESTClient in case of a cache miss (using the
+discovery client mentioned below). The ClientMux may need to deal with multiple
+transports pointing at differing destinations (e.g. OpenShift or other 3rd party
+provider API may be at a different location).
+
+When constructing a RESTClient generically, the muxer will just use the Codec
+the high-level dynamic client would use. Alternatively, the user should be able
+to pass in a Codec-- for the case where the correct types are compiled in.
+
+Tentative name: ClientMux
+
+Action items:
+* Move client cache out of kubectl libraries into a more general home.
+* TODO: a mux layer may not be necessary, depending on what needs to be cached.
+ If transports are cached already, and RESTClients are extremely light-weight,
+ there may not need to be much code at all in this layer.
+
+### High-level: Individual typed
+
+Our current high-level client allows you to write things like
+`c.Pods("namespace").Create(p)`; we will insert a level for the group.
+
+That is, the system will be:
+
+`clientset.GroupName().NamespaceSpecifier().Action()`
+
+Where:
+* `clientset` is a thing that holds multiple individually typed clients (see
+ below).
+* `GroupName()` returns the generated client that this section is about.
+* `NamespaceSpecifier()` may take a namespace parameter or nothing.
+* `Action` is one of Create/Get/Update/Delete/Watch, or appropriate actions
+ from the type's subresources.
+* It is TBD how we'll represent subresources and their actions. This is
+ inconsistent in the current clients, so we'll need to define a consistent
+ format. Possible choices:
+ * Insert a `.Subresource()` before the `.Action()`
+ * Flatten subresources, such that they become special Actions on the parent
+ resource.
+
+The types returned/consumed by such functions will be e.g. api/v1, NOT the
+current version inspecific types. The current internal-versioned client is
+inconvenient for users, as it does not protect them from having to recompile
+their code with every minor update. (We may continue to generate an
+internal-versioned client for our own use for a while, but even for our own
+components it probably makes sense to switch to specifically versioned clients.)
+
+We will provide this structure for each version of each group. It is infeasible
+to do this manually, so we will generate this. The generator will accept both
+swagger and the ordinary go types. The generator should operate on out-of-tree
+sources AND out-of-tree destinations, so it will be useful for consuming
+out-of-tree APIs and for others to build custom clients into their own
+repositories.
+
+Typed clients will be constructable given a ClientMux; the typed constructor will use
+the ClientMux to find or construct an appropriate RESTClient. Alternatively, a
+typed client should be constructable individually given a config, from which it
+will be able to construct the appropriate RESTClient.
+
+Typed clients do not require any version negotiation. The server either supports
+the client's group/version, or it does not. However, there are ways around this:
+* If you want to use a typed client against a server's API endpoint and the
+ server's API version doesn't match the client's API version, you can construct
+ the client with a RESTClient using a Codec that does the conversion (this is
+ basically what our client does now).
+* Alternatively, you could use the dynamic client.
+
+Action items:
+* Move current typed clients into new directory structure (described below)
+* Finish client generation logic. (@caesarxuchao, @lavalamp)
+
+#### High-level, typed: Discovery
+
+A `DiscoveryClient` is necessary to discover the api groups, versions, and
+resources a server supports. It's constructable given a RESTClient. It is
+consumed by both the ClientMux and users who want to iterate over groups,
+versions, or resources. (Example: namespace controller.)
+
+The DiscoveryClient is *not* required if you already know the group/version of
+the resource you want to use: you can simply try the operation without checking
+first, which is lower-latency anyway as it avoids an extra round-trip.
+
+Action items:
+* Refactor existing functions to present a sane interface, as close to that
+ offered by the other typed clients as possible. (@caeserxuchao)
+* Use a RESTClient to make the necessary API calls.
+* Make sure that no discovery happens unless it is explicitly requested. (Make
+ sure SetKubeDefaults doesn't call it, for example.)
+
+### High-level: Dynamic
+
+The dynamic client lets users consume apis which are not compiled into their
+binary. It will provide the same interface as the typed client, but will take
+and return `runtime.Object`s instead of typed objects. There is only one dynamic
+client, so it's not necessary to generate it, although optionally we may do so
+depending on whether the typed client generator makes it easy.
+
+A dynamic client is constructable given a config, group, and version. It will
+use this to construct a RESTClient with a Codec which encodes/decodes to
+'Unstructured' `runtime.Object`s. The group and version may be from a previous
+invocation of a DiscoveryClient, or they may be known by other means.
+
+For now, the dynamic client will assume that a JSON encoding is allowed. In the
+future, if we have binary-only APIs (unlikely?), we can add that to the
+discovery information and construct an appropriate dynamic Codec.
+
+Action items:
+* A rudimentary version of this exists in kubectl's builder. It needs to be
+ moved to a more general place.
+* Produce a useful 'Unstructured' runtime.Object, which allows for easy
+ Object/ListMeta introspection.
+
+### High-level: Client Sets
+
+Because there will be multiple groups with multiple versions, we will provide an
+aggregation layer that combines multiple typed clients in a single object.
+
+We do this to:
+* Deliver a concrete thing for users to consume, construct, and pass around. We
+ don't want people making 10 typed clients and making a random system to keep
+ track of them.
+* Constrain the testing matrix. Users can generate a client set at their whim
+ against their cluster, but we need to make guarantees that the clients we
+ shipped with v1.X.0 will work with v1.X+1.0, and vice versa. That's not
+ practical unless we "bless" a particular version of each API group and ship an
+ official client set with earch release. (If the server supports 15 groups with
+ 2 versions each, that's 2^15 different possible client sets. We don't want to
+ test all of them.)
+
+A client set is generated into its own package. The generator will take the list
+of group/versions to be included. Only one version from each group will be in
+the client set.
+
+A client set is constructable at runtime from either a ClientMux or a transport
+config (for easy one-stop-shopping).
+
+An example:
+
+```go
+import (
+ api_v1 "k8s.io/kubernetes/pkg/client/typed/generated/v1"
+ ext_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/extensions/v1beta1"
+ net_v1beta1 "k8s.io/kubernetes/pkg/client/typed/generated/net/v1beta1"
+ "k8s.io/kubernetes/pkg/client/typed/dynamic"
+)
+
+type Client interface {
+ API() api_v1.Client
+ Extensions() ext_v1beta1.Client
+ Net() net_v1beta1.Client
+ // ... other typed clients here.
+
+ // Included in every set
+ Discovery() discovery.Client
+ GroupVersion(group, version string) dynamic.Client
+}
+```
+
+Note that a particular version is chosen for each group. It is a general rule
+for our API structure that no client need care about more than one version of
+each group at a time.
+
+This is the primary deliverable that people would consume. It is also generated.
+
+Action items:
+* This needs to be built. It will replace the ClientInterface that everyone
+ passes around right now.
+
+## Package Structure
+
+```
+pkg/client/
+----------/transport/ # transport & associated config
+----------/restclient/
+----------/clientmux/
+----------/typed/
+----------------/discovery/
+----------------/generated/
+--------------------------/<group>/
+----------------------------------/<version>/
+--------------------------------------------/<resource>.go
+----------------/dynamic/
+----------/clientsets/
+---------------------/release-1.1/
+---------------------/release-1.2/
+---------------------/the-test-set-you-just-generated/
+```
+
+`/clientsets/` will retain their contents until they reach their expire date.
+e.g., when we release v1.N, we'll remove clientset v1.(N-3). Clients from old
+releases live on and continue to work (i.e., are tested) without any interface
+changes for multiple releases, to give users time to transition.
+
+## Client Guarantees (and testing)
+
+Once we release a clientset, we will not make interface changes to it. Users of
+that client will not have to change their code until they are deliberately
+upgrading their import. We probably will want to generate some sort of stub test
+with a clientset, to ensure that we don't change the interface.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/client-package-structure.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/cluster-deployment.md b/contributors/design-proposals/cluster-deployment.md
new file mode 100644
index 00000000..c6466f89
--- /dev/null
+++ b/contributors/design-proposals/cluster-deployment.md
@@ -0,0 +1,171 @@
+# Objective
+
+Simplify the cluster provisioning process for a cluster with one master and multiple worker nodes.
+It should be secured with SSL and have all the default add-ons. There should not be significant
+differences in the provisioning process across deployment targets (cloud provider + OS distribution)
+once machines meet the node specification.
+
+# Overview
+
+Cluster provisioning can be broken into a number of phases, each with their own exit criteria.
+In some cases, multiple phases will be combined together to more seamlessly automate the cluster setup,
+but in all cases the phases can be run sequentially to provision a functional cluster.
+
+It is possible that for some platforms we will provide an optimized flow that combines some of the steps
+together, but that is out of scope of this document.
+
+# Deployment flow
+
+**Note**: _Exit critieria_ in the following sections are not intended to list all tests that should pass,
+rather list those that must pass.
+
+## Step 1: Provision cluster
+
+**Objective**: Create a set of machines (master + nodes) where we will deploy Kubernetes.
+
+For this phase to be completed successfully, the following requirements must be completed for all nodes:
+- Basic connectivity between nodes (i.e. nodes can all ping each other)
+- Docker installed (and in production setups should be monitored to be always running)
+- One of the supported OS
+
+We will provide a node specification conformance test that will verify if provisioning has been successful.
+
+This step is provider specific and will be implemented for each cloud provider + OS distribution separately
+using provider specific technology (cloud formation, deployment manager, PXE boot, etc).
+Some OS distributions may meet the provisioning criteria without needing to run any post-boot steps as they
+ship with all of the requirements for the node specification by default.
+
+**Substeps** (on the GCE example):
+
+1. Create network
+2. Create firewall rules to allow communication inside the cluster
+3. Create firewall rule to allow ```ssh``` to all machines
+4. Create firewall rule to allow ```https``` to master
+5. Create persistent disk for master
+6. Create static IP address for master
+7. Create master machine
+8. Create node machines
+9. Install docker on all machines
+
+**Exit critera**:
+
+1. Can ```ssh``` to all machines and run a test docker image
+2. Can ```ssh``` to master and nodes and ping other machines
+
+## Step 2: Generate certificates
+
+**Objective**: Generate security certificates used to configure secure communication between client, master and nodes
+
+TODO: Enumerate certificates which have to be generated.
+
+## Step 3: Deploy master
+
+**Objective**: Run kubelet and all the required components (e.g. etcd, apiserver, scheduler, controllers) on the master machine.
+
+**Substeps**:
+
+1. copy certificates
+2. copy manifests for static pods:
+ 1. etcd
+ 2. apiserver, controller manager, scheduler
+3. run kubelet in docker container (configuration is read from apiserver Config object)
+4. run kubelet-checker in docker container
+
+**v1.2 simplifications**:
+
+1. kubelet-runner.sh - we will provide a custom docker image to run kubelet; it will contain
+kubelet binary and will run it using ```nsenter``` to workaround problem with mount propagation
+1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will
+be generated locally and copied to all nodes.
+
+**Exit criteria**:
+
+1. Can run basic API calls (e.g. create, list and delete pods) from the client side (e.g. replication
+controller works - user can create RC object and RC manager can create pods based on that)
+2. Critical master components works:
+ 1. scheduler
+ 2. controller manager
+
+## Step 4: Deploy nodes
+
+**Objective**: Start kubelet on all nodes and configure kubernetes network.
+Each node can be deployed separately and the implementation should make it ~impossible to change this assumption.
+
+### Step 4.1: Run kubelet
+
+**Substeps**:
+
+1. copy certificates
+2. run kubelet in docker container (configuration is read from apiserver Config object)
+3. run kubelet-checker in docker container
+
+**v1.2 simplifications**:
+
+1. kubelet config file - we will read kubelet configuration file from disk instead of apiserver; it will
+be generated locally and copied to all nodes.
+
+**Exit critera**:
+
+1. All nodes are registered, but not ready due to lack of kubernetes networking.
+
+### Step 4.2: Setup kubernetes networking
+
+**Objective**: Configure the Kubernetes networking to allow routing requests to pods and services.
+
+To keep default setup consistent across open source deployments we will use Flannel to configure
+kubernetes networking. However, implementation of this step will allow to easily plug in different
+network solutions.
+
+**Substeps**:
+
+1. copy manifest for flannel server to master machine
+2. create a daemonset with flannel daemon (it will read assigned CIDR and configure network appropriately).
+
+**v1.2 simplifications**:
+
+1. flannel daemon will run as a standalone binary (not in docker container)
+2. flannel server will assign CIDRs to nodes outside of kubernetes; this will require restarting kubelet
+after reconfiguring network bridge on local machine; this will also require running master nad node differently
+(```--configure-cbr0=false``` on node and ```--allocate-node-cidrs=false``` on master), which breaks encapsulation
+between nodes
+
+**Exit criteria**:
+
+1. Pods correctly created, scheduled, run and accessible from all nodes.
+
+## Step 5: Add daemons
+
+**Objective:** Start all system daemons (e.g. kube-proxy)
+
+**Substeps:**:
+
+1. Create daemonset for kube-proxy
+
+**Exit criteria**:
+
+1. Services work correctly on all nodes.
+
+## Step 6: Add add-ons
+
+**Objective**: Add default add-ons (e.g. dns, dashboard)
+
+**Substeps:**:
+
+1. Create Deployments (and daemonsets if needed) for all add-ons
+
+## Deployment technology
+
+We will use Ansible as the default technology for deployment orchestration. It has low requirements on the cluster machines
+and seems to be popular in kubernetes community which will help us to maintain it.
+
+For simpler UX we will provide simple bash scripts that will wrap all basic commands for deployment (e.g. ```up``` or ```down```)
+
+One disadvantage of using Ansible is that it adds a dependency on a machine which runs deployment scripts. We will workaround
+this by distributing deployment scripts via a docker image so that user will run the following command to create a cluster:
+
+```docker run gcr.io/google_containers/deploy_kubernetes:v1.2 up --num-nodes=3 --provider=aws```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/cluster-deployment.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/clustering.md b/contributors/design-proposals/clustering.md
new file mode 100644
index 00000000..ca42035b
--- /dev/null
+++ b/contributors/design-proposals/clustering.md
@@ -0,0 +1,128 @@
+# Clustering in Kubernetes
+
+
+## Overview
+
+The term "clustering" refers to the process of having all members of the
+Kubernetes cluster find and trust each other. There are multiple different ways
+to achieve clustering with different security and usability profiles. This
+document attempts to lay out the user experiences for clustering that Kubernetes
+aims to address.
+
+Once a cluster is established, the following is true:
+
+1. **Master -> Node** The master needs to know which nodes can take work and
+what their current status is wrt capacity.
+ 1. **Location** The master knows the name and location of all of the nodes in
+the cluster.
+ * For the purposes of this doc, location and name should be enough
+information so that the master can open a TCP connection to the Node. Most
+probably we will make this either an IP address or a DNS name. It is going to be
+important to be consistent here (master must be able to reach kubelet on that
+DNS name) so that we can verify certificates appropriately.
+ 2. **Target AuthN** A way to securely talk to the kubelet on that node.
+Currently we call out to the kubelet over HTTP. This should be over HTTPS and
+the master should know what CA to trust for that node.
+ 3. **Caller AuthN/Z** This would be the master verifying itself (and
+permissions) when calling the node. Currently, this is only used to collect
+statistics as authorization isn't critical. This may change in the future
+though.
+2. **Node -> Master** The nodes currently talk to the master to know which pods
+have been assigned to them and to publish events.
+ 1. **Location** The nodes must know where the master is at.
+ 2. **Target AuthN** Since the master is assigning work to the nodes, it is
+critical that they verify whom they are talking to.
+ 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to
+the master. Ideally this authentication is specific to each node so that
+authorization can be narrowly scoped. The details of the work to run (including
+things like environment variables) might be considered sensitive and should be
+locked down also.
+
+**Note:** While the description here refers to a singular Master, in the future
+we should enable multiple Masters operating in an HA mode. While the "Master" is
+currently the combination of the API Server, Scheduler and Controller Manager,
+we will restrict ourselves to thinking about the main API and policy engine --
+the API Server.
+
+## Current Implementation
+
+A central authority (generally the master) is responsible for determining the
+set of machines which are members of the cluster. Calls to create and remove
+worker nodes in the cluster are restricted to this single authority, and any
+other requests to add or remove worker nodes are rejected. (1.i.)
+
+Communication from the master to nodes is currently over HTTP and is not secured
+or authenticated in any way. (1.ii, 1.iii.)
+
+The location of the master is communicated out of band to the nodes. For GCE,
+this is done via Salt. Other cluster instructions/scripts use other methods.
+(2.i.)
+
+Currently most communication from the node to the master is over HTTP. When it
+is done over HTTPS there is currently no verification of the cert of the master
+(2.ii.)
+
+Currently, the node/kubelet is authenticated to the master via a token shared
+across all nodes. This token is distributed out of band (using Salt for GCE) and
+is optional. If it is not present then the kubelet is unable to publish events
+to the master. (2.iii.)
+
+Our current mix of out of band communication doesn't meet all of our needs from
+a security point of view and is difficult to set up and configure.
+
+## Proposed Solution
+
+The proposed solution will provide a range of options for setting up and
+maintaining a secure Kubernetes cluster. We want to both allow for centrally
+controlled systems (leveraging pre-existing trust and configuration systems) or
+more ad-hoc automagic systems that are incredibly easy to set up.
+
+The building blocks of an easier solution:
+
+* **Move to TLS** We will move to using TLS for all intra-cluster communication.
+We will explicitly identify the trust chain (the set of trusted CAs) as opposed
+to trusting the system CAs. We will also use client certificates for all AuthN.
+* [optional] **API driven CA** Optionally, we will run a CA in the master that
+will mint certificates for the nodes/kubelets. There will be pluggable policies
+that will automatically approve certificate requests here as appropriate.
+ * **CA approval policy** This is a pluggable policy object that can
+automatically approve CA signing requests. Stock policies will include
+`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would
+be an API for evaluating and accepting/rejecting requests. Cloud providers could
+implement a policy here that verifies other out of band information and
+automatically approves/rejects based on other external factors.
+* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give
+a node permission to register itself.
+ * To start with, we'd have the kubelets generate a cert/account in the form of
+`kubelet:<host>`. To start we would then hard code policy such that we give that
+particular account appropriate permissions. Over time, we can make the policy
+engine more generic.
+* [optional] **Bootstrap API endpoint** This is a helper service hosted outside
+of the Kubernetes cluster that helps with initial discovery of the master.
+
+### Static Clustering
+
+In this sequence diagram there is out of band admin entity that is creating all
+certificates and distributing them. It is also making sure that the kubelets
+know where to find the master. This provides for a lot of control but is more
+difficult to set up as lots of information must be communicated outside of
+Kubernetes.
+
+![Static Sequence Diagram](clustering/static.png)
+
+### Dynamic Clustering
+
+This diagram shows dynamic clustering using the bootstrap API endpoint. This
+endpoint is used to both find the location of the master and communicate the
+root CA for the master.
+
+This flow has the admin manually approving the kubelet signing requests. This is
+the `queue` policy defined above. This manual intervention could be replaced by
+code that can verify the signing requests via other means.
+
+![Dynamic Sequence Diagram](clustering/dynamic.png)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/clustering/.gitignore b/contributors/design-proposals/clustering/.gitignore
new file mode 100644
index 00000000..67bcd6cb
--- /dev/null
+++ b/contributors/design-proposals/clustering/.gitignore
@@ -0,0 +1 @@
+DroidSansMono.ttf
diff --git a/contributors/design-proposals/clustering/Dockerfile b/contributors/design-proposals/clustering/Dockerfile
new file mode 100644
index 00000000..e7abc753
--- /dev/null
+++ b/contributors/design-proposals/clustering/Dockerfile
@@ -0,0 +1,26 @@
+# Copyright 2016 The Kubernetes Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM debian:jessie
+
+RUN apt-get update
+RUN apt-get -qy install python-seqdiag make curl
+
+WORKDIR /diagrams
+
+RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf
+
+ADD . /diagrams
+
+CMD bash -c 'make >/dev/stderr && tar cf - *.png' \ No newline at end of file
diff --git a/contributors/design-proposals/clustering/Makefile b/contributors/design-proposals/clustering/Makefile
new file mode 100644
index 00000000..e72d441e
--- /dev/null
+++ b/contributors/design-proposals/clustering/Makefile
@@ -0,0 +1,41 @@
+# Copyright 2016 The Kubernetes Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FONT := DroidSansMono.ttf
+
+PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag))
+
+.PHONY: all
+all: $(PNGS)
+
+.PHONY: watch
+watch:
+ fswatch *.seqdiag | xargs -n 1 sh -c "make || true"
+
+$(FONT):
+ curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT)
+
+%.png: %.seqdiag $(FONT)
+ seqdiag --no-transparency -a -f '$(FONT)' $<
+
+# Build the stuff via a docker image
+.PHONY: docker
+docker:
+ docker build -t clustering-seqdiag .
+ docker run --rm clustering-seqdiag | tar xvf -
+
+.PHONY: docker-clean
+docker-clean:
+ docker rmi clustering-seqdiag || true
+ docker images -q --filter "dangling=true" | xargs docker rmi
diff --git a/contributors/design-proposals/clustering/README.md b/contributors/design-proposals/clustering/README.md
new file mode 100644
index 00000000..d7e2e2e0
--- /dev/null
+++ b/contributors/design-proposals/clustering/README.md
@@ -0,0 +1,35 @@
+This directory contains diagrams for the clustering design doc.
+
+This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html).
+Assuming you have a non-borked python install, this should be installable with:
+
+```sh
+pip install seqdiag
+```
+
+Just call `make` to regenerate the diagrams.
+
+## Building with Docker
+
+If you are on a Mac or your pip install is messed up, you can easily build with
+docker:
+
+```sh
+make docker
+```
+
+The first run will be slow but things should be fast after that.
+
+To clean up the docker containers that are created (and other cruft that is left
+around) you can run `make docker-clean`.
+
+## Automatically rebuild on file changes
+
+If you have the fswatch utility installed, you can have it monitor the file
+system and automatically rebuild when files have changed. Just do a
+`make watch`.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/clustering/dynamic.png b/contributors/design-proposals/clustering/dynamic.png
new file mode 100644
index 00000000..92b40fee
--- /dev/null
+++ b/contributors/design-proposals/clustering/dynamic.png
Binary files differ
diff --git a/contributors/design-proposals/clustering/dynamic.seqdiag b/contributors/design-proposals/clustering/dynamic.seqdiag
new file mode 100644
index 00000000..567d5bf9
--- /dev/null
+++ b/contributors/design-proposals/clustering/dynamic.seqdiag
@@ -0,0 +1,24 @@
+seqdiag {
+ activation = none;
+
+
+ user[label = "Admin User"];
+ bootstrap[label = "Bootstrap API\nEndpoint"];
+ master;
+ kubelet[stacked];
+
+ user -> bootstrap [label="createCluster", return="cluster ID"];
+ user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"];
+
+ user ->> master [label="start\n- bootstrap-cluster-uri"];
+ master => bootstrap [label="setMaster\n- master-location\n- master-ca"];
+
+ user ->> kubelet [label="start\n- bootstrap-cluster-uri"];
+ kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"];
+ kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"];
+ user => master [label="getSignRequests"];
+ user => master [label="approveSignRequests"];
+ kubelet <<-- master [label="returns\n- kubelet-cert"];
+
+ kubelet => master [label="register\n- kubelet-location"]
+}
diff --git a/contributors/design-proposals/clustering/static.png b/contributors/design-proposals/clustering/static.png
new file mode 100644
index 00000000..bcdeca7e
--- /dev/null
+++ b/contributors/design-proposals/clustering/static.png
Binary files differ
diff --git a/contributors/design-proposals/clustering/static.seqdiag b/contributors/design-proposals/clustering/static.seqdiag
new file mode 100644
index 00000000..bdc54b76
--- /dev/null
+++ b/contributors/design-proposals/clustering/static.seqdiag
@@ -0,0 +1,16 @@
+seqdiag {
+ activation = none;
+
+ admin[label = "Manual Admin"];
+ ca[label = "Manual CA"]
+ master;
+ kubelet[stacked];
+
+ admin => ca [label="create\n- master-cert"];
+ admin ->> master [label="start\n- ca-root\n- master-cert"];
+
+ admin => ca [label="create\n- kubelet-cert"];
+ admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"];
+
+ kubelet => master [label="register\n- kubelet-location"];
+}
diff --git a/contributors/design-proposals/command_execution_port_forwarding.md b/contributors/design-proposals/command_execution_port_forwarding.md
new file mode 100644
index 00000000..a7175403
--- /dev/null
+++ b/contributors/design-proposals/command_execution_port_forwarding.md
@@ -0,0 +1,158 @@
+# Container Command Execution & Port Forwarding in Kubernetes
+
+## Abstract
+
+This document describes how to use Kubernetes to execute commands in containers,
+with stdin/stdout/stderr streams attached and how to implement port forwarding
+to the containers.
+
+## Background
+
+See the following related issues/PRs:
+
+- [Support attach](http://issue.k8s.io/1521)
+- [Real container ssh](http://issue.k8s.io/1513)
+- [Provide easy debug network access to services](http://issue.k8s.io/1863)
+- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576)
+
+## Motivation
+
+Users and administrators are accustomed to being able to access their systems
+via SSH to run remote commands, get shell access, and do port forwarding.
+
+Supporting SSH to containers in Kubernetes is a difficult task. You must
+specify a "user" and a hostname to make an SSH connection, and `sshd` requires
+real users (resolvable by NSS and PAM). Because a container belongs to a pod,
+and the pod belongs to a namespace, you need to specify namespace/pod/container
+to uniquely identify the target container. Unfortunately, a
+namespace/pod/container is not a real user as far as SSH is concerned. Also,
+most Linux systems limit user names to 32 characters, which is unlikely to be
+large enough to contain namespace/pod/container. We could devise some scheme to
+map each namespace/pod/container to a 32-character user name, adding entries to
+`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the
+time. Alternatively, we could write custom NSS and PAM modules that allow the
+host to resolve a namespace/pod/container to a user without needing to keep
+files or LDAP in sync.
+
+As an alternative to SSH, we are using a multiplexed streaming protocol that
+runs on top of HTTP. There are no requirements about users being real users,
+nor is there any limitation on user name length, as the protocol is under our
+control. The only downside is that standard tooling that expects to use SSH
+won't be able to work with this mechanism, unless adapters can be written.
+
+## Constraints and Assumptions
+
+- SSH support is not currently in scope.
+- CGroup confinement is ultimately desired, but implementing that support is not
+currently in scope.
+- SELinux confinement is ultimately desired, but implementing that support is
+not currently in scope.
+
+## Use Cases
+
+- A user of a Kubernetes cluster wants to run arbitrary commands in a
+container with local stdin/stdout/stderr attached to the container.
+- A user of a Kubernetes cluster wants to connect to local ports on his computer
+and have them forwarded to ports in a container.
+
+## Process Flow
+
+### Remote Command Execution Flow
+
+1. The client connects to the Kubernetes Master to initiate a remote command
+execution request.
+2. The Master proxies the request to the Kubelet where the container lives.
+3. The Kubelet executes nsenter + the requested command and streams
+stdin/stdout/stderr back and forth between the client and the container.
+
+### Port Forwarding Flow
+
+1. The client connects to the Kubernetes Master to initiate a remote command
+execution request.
+2. The Master proxies the request to the Kubelet where the container lives.
+3. The client listens on each specified local port, awaiting local connections.
+4. The client connects to one of the local listening ports.
+4. The client notifies the Kubelet of the new connection.
+5. The Kubelet executes nsenter + socat and streams data back and forth between
+the client and the port in the container.
+
+## Design Considerations
+
+### Streaming Protocol
+
+The current multiplexed streaming protocol used is SPDY. This is not the
+long-term desire, however. As soon as there is viable support for HTTP/2 in Go,
+we will switch to that.
+
+### Master as First Level Proxy
+
+Clients should not be allowed to communicate directly with the Kubelet for
+security reasons. Therefore, the Master is currently the only suggested entry
+point to be used for remote command execution and port forwarding. This is not
+necessarily desirable, as it means that all remote command execution and port
+forwarding traffic must travel through the Master, potentially impacting other
+API requests.
+
+In the future, it might make more sense to retrieve an authorization token from
+the Master, and then use that token to initiate a remote command execution or
+port forwarding request with a load balanced proxy service dedicated to this
+functionality. This would keep the streaming traffic out of the Master.
+
+### Kubelet as Backend Proxy
+
+The kubelet is currently responsible for handling remote command execution and
+port forwarding requests. Just like with the Master described above, this means
+that all remote command execution and port forwarding streaming traffic must
+travel through the Kubelet, which could result in a degraded ability to service
+other requests.
+
+In the future, it might make more sense to use a separate service on the node.
+
+Alternatively, we could possibly inject a process into the container that only
+listens for a single request, expose that process's listening port on the node,
+and then issue a redirect to the client such that it would connect to the first
+level proxy, which would then proxy directly to the injected process's exposed
+port. This would minimize the amount of proxying that takes place.
+
+### Scalability
+
+There are at least 2 different ways to execute a command in a container:
+`docker exec` and `nsenter`. While `docker exec` might seem like an easier and
+more obvious choice, it has some drawbacks.
+
+#### `docker exec`
+
+We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port
+on the node), but this would require proxying from the edge and securing the
+Docker API. `docker exec` calls go through the Docker daemon, meaning that all
+stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop.
+Additionally, you can't isolate 1 malicious `docker exec` call from normal
+usage, meaning an attacker could initiate a denial of service or other attack
+and take down the Docker daemon, or the node itself.
+
+We expect remote command execution and port forwarding requests to be long
+running and/or high bandwidth operations, and routing all the streaming data
+through the Docker daemon feels like a bottleneck we can avoid.
+
+#### `nsenter`
+
+The implementation currently uses `nsenter` to run commands in containers,
+joining the appropriate container namespaces. `nsenter` runs directly on the
+node and is not proxied through any single daemon process.
+
+### Security
+
+Authentication and authorization hasn't specifically been tested yet with this
+functionality. We need to make sure that users are not allowed to execute
+remote commands or do port forwarding to containers they aren't allowed to
+access.
+
+Additional work is required to ensure that multiple command execution or port
+forwarding connections from different clients are not able to see each other's
+data. This can most likely be achieved via SELinux labeling and unique process
+ contexts.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/configmap.md b/contributors/design-proposals/configmap.md
new file mode 100644
index 00000000..658ac73b
--- /dev/null
+++ b/contributors/design-proposals/configmap.md
@@ -0,0 +1,300 @@
+# Generic Configuration Object
+
+## Abstract
+
+The `ConfigMap` API resource stores data used for the configuration of
+applications deployed on Kubernetes.
+
+The main focus of this resource is to:
+
+* Provide dynamic distribution of configuration data to deployed applications.
+* Encapsulate configuration information and simplify `Kubernetes` deployments.
+* Create a flexible configuration model for `Kubernetes`.
+
+## Motivation
+
+A `Secret`-like API resource is needed to store configuration data that pods can
+consume.
+
+Goals of this design:
+
+1. Describe a `ConfigMap` API resource.
+2. Describe the semantics of consuming `ConfigMap` as environment variables.
+3. Describe the semantics of consuming `ConfigMap` as files in a volume.
+
+## Use Cases
+
+1. As a user, I want to be able to consume configuration data as environment
+variables.
+2. As a user, I want to be able to consume configuration data as files in a
+volume.
+3. As a user, I want my view of configuration data in files to be eventually
+consistent with changes to the data.
+
+### Consuming `ConfigMap` as Environment Variables
+
+A series of events for consuming `ConfigMap` as environment variables:
+
+1. Create a `ConfigMap` object.
+2. Create a pod to consume the configuration data via environment variables.
+3. The pod is scheduled onto a node.
+4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and
+starts the container processes with the appropriate configuration data from
+environment variables.
+
+### Consuming `ConfigMap` in Volumes
+
+A series of events for consuming `ConfigMap` as configuration files in a volume:
+
+1. Create a `ConfigMap` object.
+2. Create a new pod using the `ConfigMap` via a volume plugin.
+3. The pod is scheduled onto a node.
+4. The Kubelet creates an instance of the volume plugin and calls its `Setup()`
+method.
+5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
+and projects the appropriate configuration data into the volume.
+
+### Consuming `ConfigMap` Updates
+
+Any long-running system has configuration that is mutated over time. Changes
+made to configuration data must be made visible to pods consuming data in
+volumes so that they can respond to those changes.
+
+The `resourceVersion` of the `ConfigMap` object will be updated by the API
+server every time the object is modified. After an update, modifications will be
+made visible to the consumer container:
+
+1. Create a `ConfigMap` object.
+2. Create a new pod using the `ConfigMap` via the volume plugin.
+3. The pod is scheduled onto a node.
+4. During the sync loop, the Kubelet creates an instance of the volume plugin
+and calls its `Setup()` method.
+5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
+and projects the appropriate data into the volume.
+6. The `ConfigMap` referenced by the pod is updated.
+7. During the next iteration of the `syncLoop`, the Kubelet creates an instance
+of the volume plugin and calls its `Setup()` method.
+8. The volume plugin projects the updated data into the volume atomically.
+
+It is the consuming pod's responsibility to make use of the updated data once it
+is made visible.
+
+Because environment variables cannot be updated without restarting a container,
+configuration data consumed in environment variables will not be updated.
+
+### Advantages
+
+* Easy to consume in pods; consumer-agnostic
+* Configuration data is persistent and versioned
+* Consumers of configuration data in volumes can respond to changes in the data
+
+## Proposed Design
+
+### API Resource
+
+The `ConfigMap` resource will be added to the main API:
+
+```go
+package api
+
+// ConfigMap holds configuration data for pods to consume.
+type ConfigMap struct {
+ TypeMeta `json:",inline"`
+ ObjectMeta `json:"metadata,omitempty"`
+
+ // Data contains the configuration data. Each key must be a valid
+ // DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN.
+ Data map[string]string `json:"data,omitempty"`
+}
+
+type ConfigMapList struct {
+ TypeMeta `json:",inline"`
+ ListMeta `json:"metadata,omitempty"`
+
+ Items []ConfigMap `json:"items"`
+}
+```
+
+A `Registry` implementation for `ConfigMap` will be added to
+`pkg/registry/configmap`.
+
+### Environment Variables
+
+The `EnvVarSource` will be extended with a new selector for `ConfigMap`:
+
+```go
+package api
+
+// EnvVarSource represents a source for the value of an EnvVar.
+type EnvVarSource struct {
+ // other fields omitted
+
+ // Selects a key of a ConfigMap.
+ ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
+}
+
+// Selects a key from a ConfigMap.
+type ConfigMapKeySelector struct {
+ // The ConfigMap to select from.
+ LocalObjectReference `json:",inline"`
+ // The key to select.
+ Key string `json:"key"`
+}
+```
+
+### Volume Source
+
+A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap`
+object will be added to the `VolumeSource` struct in the API:
+
+```go
+package api
+
+type VolumeSource struct {
+ // other fields omitted
+ ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"`
+}
+
+// Represents a volume that holds configuration data.
+type ConfigMapVolumeSource struct {
+ LocalObjectReference `json:",inline"`
+ // A list of keys to project into the volume.
+ // If unspecified, each key-value pair in the Data field of the
+ // referenced ConfigMap will be projected into the volume as a file whose name
+ // is the key and content is the value.
+ // If specified, the listed keys will be project into the specified paths, and
+ // unlisted keys will not be present.
+ Items []KeyToPath `json:"items,omitempty"`
+}
+
+// Represents a mapping of a key to a relative path.
+type KeyToPath struct {
+ // The name of the key to select
+ Key string `json:"key"`
+
+ // The relative path name of the file to be created.
+ // Must not be absolute or contain the '..' path. Must be utf-8 encoded.
+ // The first item of the relative path must not start with '..'
+ Path string `json:"path"`
+}
+```
+
+**Note:** The update logic used in the downward API volume plug-in will be
+extracted and re-used in the volume plug-in for `ConfigMap`.
+
+### Changes to Secret
+
+We will update the Secret volume plugin to have a similar API to the new
+`ConfigMap` volume plugin. The secret volume plugin will also begin updating
+secret content in the volume when secrets change.
+
+## Examples
+
+#### Consuming `ConfigMap` as Environment Variables
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: etcd-env-config
+data:
+ number-of-members: "1"
+ initial-cluster-state: new
+ initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN
+ discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN
+ discovery-url: http://etcd-discovery:2379
+ etcdctl-peers: http://etcd:2379
+```
+
+This pod consumes the `ConfigMap` as environment variables:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: config-env-example
+spec:
+ containers:
+ - name: etcd
+ image: openshift/etcd-20-centos7
+ ports:
+ - containerPort: 2379
+ protocol: TCP
+ - containerPort: 2380
+ protocol: TCP
+ env:
+ - name: ETCD_NUM_MEMBERS
+ valueFrom:
+ configMapKeyRef:
+ name: etcd-env-config
+ key: number-of-members
+ - name: ETCD_INITIAL_CLUSTER_STATE
+ valueFrom:
+ configMapKeyRef:
+ name: etcd-env-config
+ key: initial-cluster-state
+ - name: ETCD_DISCOVERY_TOKEN
+ valueFrom:
+ configMapKeyRef:
+ name: etcd-env-config
+ key: discovery-token
+ - name: ETCD_DISCOVERY_URL
+ valueFrom:
+ configMapKeyRef:
+ name: etcd-env-config
+ key: discovery-url
+ - name: ETCDCTL_PEERS
+ valueFrom:
+ configMapKeyRef:
+ name: etcd-env-config
+ key: etcdctl-peers
+```
+
+#### Consuming `ConfigMap` as Volumes
+
+`redis-volume-config` is intended to be used as a volume containing a config
+file:
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: redis-volume-config
+data:
+ redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n"
+```
+
+The following pod consumes the `redis-volume-config` in a volume:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: config-volume-example
+spec:
+ containers:
+ - name: redis
+ image: kubernetes/redis
+ command: ["redis-server", "/mnt/config-map/etc/redis.conf"]
+ ports:
+ - containerPort: 6379
+ volumeMounts:
+ - name: config-map-volume
+ mountPath: /mnt/config-map
+ volumes:
+ - name: config-map-volume
+ configMap:
+ name: redis-volume-config
+ items:
+ - path: "etc/redis.conf"
+ key: redis.conf
+```
+
+## Future Improvements
+
+In the future, we may add the ability to specify an init-container that can
+watch the volume contents for updates and respond to changes when they occur.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/container-init.md b/contributors/design-proposals/container-init.md
new file mode 100644
index 00000000..6e9dbb4a
--- /dev/null
+++ b/contributors/design-proposals/container-init.md
@@ -0,0 +1,444 @@
+# Pod initialization
+
+@smarterclayton
+
+March 2016
+
+## Proposal and Motivation
+
+Within a pod there is a need to initialize local data or adapt to the current
+cluster environment that is not easily achieved in the current container model.
+Containers start in parallel after volumes are mounted, leaving no opportunity
+for coordination between containers without specialization of the image. If
+two containers need to share common initialization data, both images must
+be altered to cooperate using filesystem or network semantics, which introduces
+coupling between images. Likewise, if an image requires configuration in order
+to start and that configuration is environment dependent, the image must be
+altered to add the necessary templating or retrieval.
+
+This proposal introduces the concept of an **init container**, one or more
+containers started in sequence before the pod's normal containers are started.
+These init containers may share volumes, perform network operations, and perform
+computation prior to the start of the remaining containers. They may also, by
+virtue of their sequencing, block or delay the startup of application containers
+until some precondition is met. In this document we refer to the existing pod
+containers as **app containers**.
+
+This proposal also provides a high level design of **volume containers**, which
+initialize a particular volume, as a feature that specializes some of the tasks
+defined for init containers. The init container design anticipates the existence
+of volume containers and highlights where they will take future work
+
+## Design Points
+
+* Init containers should be able to:
+ * Perform initialization of shared volumes
+ * Download binaries that will be used in app containers as execution targets
+ * Inject configuration or extension capability to generic images at startup
+ * Perform complex templating of information available in the local environment
+ * Initialize a database by starting a temporary execution process and applying
+ schema info.
+ * Delay the startup of application containers until preconditions are met
+ * Register the pod with other components of the system
+* Reduce coupling:
+ * Between application images, eliminating the need to customize those images for
+ Kubernetes generally or specific roles
+ * Inside of images, by specializing which containers perform which tasks
+ (install git into init container, use filesystem contents
+ in web container)
+ * Between initialization steps, by supporting multiple sequential init containers
+* Init containers allow simple start preconditions to be implemented that are
+ decoupled from application code
+ * The order init containers start should be predictable and allow users to easily
+ reason about the startup of a container
+ * Complex ordering and failure will not be supported - all complex workflows can
+ if necessary be implemented inside of a single init container, and this proposal
+ aims to enable that ordering without adding undue complexity to the system.
+ Pods in general are not intended to support DAG workflows.
+* Both run-once and run-forever pods should be able to use init containers
+* As much as possible, an init container should behave like an app container
+ to reduce complexity for end users, for clients, and for divergent use cases.
+ An init container is a container with the minimum alterations to accomplish
+ its goal.
+* Volume containers should be able to:
+ * Perform initialization of a single volume
+ * Start in parallel
+ * Perform computation to initialize a volume, and delay start until that
+ volume is initialized successfully.
+ * Using a volume container that does not populate a volume to delay pod start
+ (in the absence of init containers) would be an abuse of the goal of volume
+ containers.
+* Container pre-start hooks are not sufficient for all initialization cases:
+ * They cannot easily coordinate complex conditions across containers
+ * They can only function with code in the image or code in a shared volume,
+ which would have to be statically linked (not a common pattern in wide use)
+ * They cannot be implemented with the current Docker implementation - see
+ [#140](https://github.com/kubernetes/kubernetes/issues/140)
+
+
+
+## Alternatives
+
+* Any mechanism that runs user code on a node before regular pod containers
+ should itself be a container and modeled as such - we explicitly reject
+ creating new mechanisms for running user processes.
+* The container pre-start hook (not yet implemented) requires execution within
+ the container's image and so cannot adapt existing images. It also cannot
+ block startup of containers
+* Running a "pre-pod" would defeat the purpose of the pod being an atomic
+ unit of scheduling.
+
+
+## Design
+
+Each pod may have 0..N init containers defined along with the existing
+1..M app containers.
+
+On startup of the pod, after the network and volumes are initialized, the
+init containers are started in order. Each container must exit successfully
+before the next is invoked. If a container fails to start (due to the runtime)
+or exits with failure, it is retried according to the pod RestartPolicy.
+RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways
+pods will retry the failing init container with increasing backoff until it
+succeeds. To align with the design of application containers, init containers
+will only support "infinite retries" (RestartPolicyAlways) or "no retries"
+(RestartPolicyNever).
+
+A pod cannot be ready until all init containers have succeeded. The ports
+on an init container are not aggregated under a service. A pod that is
+being initialized is in the `Pending` phase but should have a distinct
+condition. Each app container and all future init containers should have
+the reason `PodInitializing`. The pod should have a condition `Initializing`
+set to `false` until all init containers have succeeded, and `true` thereafter.
+If the pod is restarted, the `Initializing` condition should be set to `false.
+
+If the pod is "restarted" all containers stopped and started due to
+a node restart, change to the pod definition, or admin interaction, all
+init containers must execute again. Restartable conditions are defined as:
+
+* An init container image is changed
+* The pod infrastructure container is restarted (shared namespaces are lost)
+* The Kubelet detects that all containers in a pod are terminated AND
+ no record of init container completion is available on disk (due to GC)
+
+Changes to the init container spec are limited to the container image field.
+Altering the container image field is equivalent to restarting the pod.
+
+Because init containers can be restarted, retried, or reexecuted, container
+authors should make their init behavior idempotent by handling volumes that
+are already populated or the possibility that this instance of the pod has
+already contacted a remote system.
+
+Each init container has all of the fields of an app container. The following
+fields are prohibited from being used on init containers by validation:
+
+* `readinessProbe` - init containers must exit for pod startup to continue,
+ are not included in rotation, and so cannot define readiness distinct from
+ completion.
+
+Init container authors may use `activeDeadlineSeconds` on the pod and
+`livenessProbe` on the container to prevent init containers from failing
+forever. The active deadline includes init containers.
+
+Because init containers are semantically different in lifecycle from app
+containers (they are run serially, rather than in parallel), for backwards
+compatibility and design clarity they will be identified as distinct fields
+in the API:
+
+ pod:
+ spec:
+ containers: ...
+ initContainers:
+ - name: init-container1
+ image: ...
+ ...
+ - name: init-container2
+ ...
+ status:
+ containerStatuses: ...
+ initContainerStatuses:
+ - name: init-container1
+ ...
+ - name: init-container2
+ ...
+
+This separation also serves to make the order of container initialization
+clear - init containers are executed in the order that they appear, then all
+app containers are started at once.
+
+The name of each app and init container in a pod must be unique - it is a
+validation error for any container to share a name.
+
+While pod containers are in alpha state, they will be serialized as an annotation
+on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status
+of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`.
+Mutation of these annotations is prohibited on existing pods.
+
+
+### Resources
+
+Given the ordering and execution for init containers, the following rules
+for resource usage apply:
+
+* The highest of any particular resource request or limit defined on all init
+ containers is the **effective init request/limit**
+* The pod's **effective request/limit** for a resource is the higher of:
+ * sum of all app containers request/limit for a resource
+ * effective init request/limit for a resource
+* Scheduling is done based on effective requests/limits, which means
+ init containers can reserve resources for initialization that are not used
+ during the life of the pod.
+* The lowest QoS tier of init containers per resource is the **effective init QoS tier**,
+ and the highest QoS tier of both init containers and regular containers is the
+ **effective pod QoS tier**.
+
+So the following pod:
+
+ pod:
+ spec:
+ initContainers:
+ - limits:
+ cpu: 100m
+ memory: 1GiB
+ - limits:
+ cpu: 50m
+ memory: 2GiB
+ containers:
+ - limits:
+ cpu: 10m
+ memory: 1100MiB
+ - limits:
+ cpu: 10m
+ memory: 1100MiB
+
+has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init
+container cpu is larger than sum of all app containers, sum of container
+memory is larger than the max of all init containers). The scheduler, node,
+and quota must respect the effective pod request/limit.
+
+In the absence of a defined request or limit on a container, the effective
+request/limit will be applied. For example, the following pod:
+
+ pod:
+ spec:
+ initContainers:
+ - limits:
+ cpu: 100m
+ memory: 1GiB
+ containers:
+ - request:
+ cpu: 10m
+ memory: 1100MiB
+
+will have an effective request of `10m / 1100MiB`, and an effective limit
+of `100m / 1GiB`, i.e.:
+
+ pod:
+ spec:
+ initContainers:
+ - request:
+ cpu: 10m
+ memory: 1GiB
+ - limits:
+ cpu: 100m
+ memory: 1100MiB
+ containers:
+ - request:
+ cpu: 10m
+ memory: 1GiB
+ - limits:
+ cpu: 100m
+ memory: 1100MiB
+
+and thus have the QoS tier **Burstable** (because request is not equal to
+limit).
+
+Quota and limits will be applied based on the effective pod request and
+limit.
+
+Pod level cGroups will be based on the effective pod request and limit, the
+same as the scheduler.
+
+
+### Kubelet and container runtime details
+
+Container runtimes should treat the set of init and app containers as one
+large pool. An individual init container execution should be identical to
+an app container, including all standard container environment setup
+(network, namespaces, hostnames, DNS, etc).
+
+All app container operations are permitted on init containers. The
+logs for an init container should be available for the duration of the pod
+lifetime or until the pod is restarted.
+
+During initialization, app container status should be shown with the reason
+PodInitializing if any init containers are present. Each init container
+should show appropriate container status, and all init containers that are
+waiting for earlier init containers to finish should have the `reason`
+PendingInitialization.
+
+The container runtime should aggressively prune failed init containers.
+The container runtime should record whether all init containers have
+succeeded internally, and only invoke new init containers if a pod
+restart is needed (for Docker, if all containers terminate or if the pod
+infra container terminates). Init containers should follow backoff rules
+as necessary. The Kubelet *must* preserve at least the most recent instance
+of an init container to serve logs and data for end users and to track
+failure states. The Kubelet *should* prefer to garbage collect completed
+init containers over app containers, as long as the Kubelet is able to
+track that initialization has been completed. In the future, container
+state checkpointing in the Kubelet may remove or reduce the need to
+preserve old init containers.
+
+For the initial implementation, the Kubelet will use the last termination
+container state of the highest indexed init container to determine whether
+the pod has completed initialization. During a pod restart, initialization
+will be restarted from the beginning (all initializers will be rerun).
+
+
+### API Behavior
+
+All APIs that access containers by name should operate on both init and
+app containers. Because names are unique the addition of the init container
+should be transparent to use cases.
+
+A client with no knowledge of init containers should see appropriate
+container status `reason` and `message` fields while the pod is in the
+`Pending` phase, and so be able to communicate that to end users.
+
+
+### Example init containers
+
+* Wait for a service to be created
+
+ pod:
+ spec:
+ initContainers:
+ - name: wait
+ image: centos:centos7
+ command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
+ containers:
+ - name: run
+ image: application-image
+ command: ["/my_application_that_depends_on_myservice"]
+
+* Register this pod with a remote server
+
+ pod:
+ spec:
+ initContainers:
+ - name: register
+ image: centos:centos7
+ command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
+ env:
+ - name: POD_NAME
+ valueFrom:
+ field: metadata.name
+ - name: POD_IP
+ valueFrom:
+ field: status.podIP
+ containers:
+ - name: run
+ image: application-image
+ command: ["/my_application_that_depends_on_myservice"]
+
+* Wait for an arbitrary period of time
+
+ pod:
+ spec:
+ initContainers:
+ - name: wait
+ image: centos:centos7
+ command: ["/bin/sh", "-c", "sleep 60"]
+ containers:
+ - name: run
+ image: application-image
+ command: ["/static_binary_without_sleep"]
+
+* Clone a git repository into a volume (can be implemented by volume containers in the future):
+
+ pod:
+ spec:
+ initContainers:
+ - name: download
+ image: image-with-git
+ command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
+ volumeMounts:
+ - mountPath: /var/lib/data
+ volumeName: git
+ containers:
+ - name: run
+ image: centos:centos7
+ command: ["/var/lib/data/binary"]
+ volumeMounts:
+ - mountPath: /var/lib/data
+ volumeName: git
+ volumes:
+ - emptyDir: {}
+ name: git
+
+* Execute a template transformation based on environment (can be implemented by volume containers in the future):
+
+ pod:
+ spec:
+ initContainers:
+ - name: copy
+ image: application-image
+ command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
+ volumeMounts:
+ - mountPath: /var/lib/data
+ volumeName: data
+ - name: transform
+ image: image-with-jinja
+ command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
+ volumeMounts:
+ - mountPath: /var/lib/data
+ volumeName: data
+ containers:
+ - name: run
+ image: application-image
+ command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
+ volumeMounts:
+ - mountPath: /var/lib/data
+ volumeName: data
+ volumes:
+ - emptyDir: {}
+ name: data
+
+* Perform a container build
+
+ pod:
+ spec:
+ initContainers:
+ - name: copy
+ image: base-image
+ workingDir: /home/user/source-tree
+ command: ["make"]
+ containers:
+ - name: commit
+ image: image-with-docker
+ command:
+ - /bin/sh
+ - -c
+ - docker commit $(complex_bash_to_get_container_id_of_copy) \
+ docker push $(commit_id) myrepo:latest
+ volumesMounts:
+ - mountPath: /var/run/docker.sock
+ volumeName: dockersocket
+
+## Backwards compatibilty implications
+
+Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not
+be able to rely on Kubelets implementing init containers. The management of feature skew between
+master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855).
+
+
+## Future work
+
+* Unify pod QoS class with init containers
+* Implement container / image volumes to make composition of runtime from images efficient
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/container-runtime-interface-v1.md b/contributors/design-proposals/container-runtime-interface-v1.md
new file mode 100644
index 00000000..36592727
--- /dev/null
+++ b/contributors/design-proposals/container-runtime-interface-v1.md
@@ -0,0 +1,267 @@
+# Redefine Container Runtime Interface
+
+The umbrella issue: [#22964](https://issues.k8s.io/22964)
+
+## Motivation
+
+Kubelet employs a declarative pod-level interface, which acts as the sole
+integration point for container runtimes (e.g., `docker` and `rkt`). The
+high-level, declarative interface has caused higher integration and maintenance
+cost, and also slowed down feature velocity for the following reasons.
+ 1. **Not every container runtime supports the concept of pods natively**.
+ When integrating with Kubernetes, a significant amount of work needs to
+ go into implementing a shim of significant size to support all pod
+ features. This also adds maintenance overhead (e.g., `docker`).
+ 2. **High-level interface discourages code sharing and reuse among runtimes**.
+ E.g, each runtime today implements an all-encompassing `SyncPod()`
+ function, with the Pod Spec as the input argument. The runtime implements
+ logic to determine how to achieve the desired state based on the current
+ status, (re-)starts pods/containers and manages lifecycle hooks
+ accordingly.
+ 3. **Pod Spec is evolving rapidly**. New features are being added constantly.
+ Any pod-level change or addition requires changing of all container
+ runtime shims. E.g., init containers and volume containers.
+
+## Goals and Non-Goals
+
+The goals of defining the interface are to
+ - **improve extensibility**: Easier container runtime integration.
+ - **improve feature velocity**
+ - **improve code maintainability**
+
+The non-goals include
+ - proposing *how* to integrate with new runtimes, i.e., where the shim
+ resides. The discussion of adopting a client-server architecture is tracked
+ by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of
+ such an architecture is discussed.
+ - versioning the new interface/API. We intend to provide API versioning to
+ offer stability for runtime integrations, but the details are beyond the
+ scope of this proposal.
+ - adding support to Windows containers. Windows container support is a
+ parallel effort and is tracked by [#22623](https://issues.k8s.io/22623).
+ The new interface will not be augmented to support Windows containers, but
+ it will be made extensible such that the support can be added in the future.
+ - re-defining Kubelet's internal interfaces. These interfaces, though, may
+ affect Kubelet's maintainability, is not relevant to runtime integration.
+ - improving Kubelet's efficiency or performance, e.g., adopting event stream
+ from the container runtime [#8756](https://issues.k8s.io/8756),
+ [#16831](https://issues.k8s.io/16831).
+
+## Requirements
+
+ * Support the already integrated container runtime: `docker` and `rkt`
+ * Support hypervisor-based container runtimes: `hyper`.
+
+The existing pod-level interface will remain as it is in the near future to
+ensure supports of all existing runtimes are continued. Meanwhile, we will
+work with all parties involved to switching to the proposed interface.
+
+
+## Container Runtime Interface
+
+The main idea of this proposal is to adopt an imperative container-level
+interface, which allows Kubelet to directly control the lifecycles of the
+containers.
+
+Pod is composed of a group of containers in an isolated environment with
+resource constraints. In Kubernetes, pod is also the smallest schedulable unit.
+After a pod has been scheduled to the node, Kubelet will create the environment
+for the pod, and add/update/remove containers in that environment to meet the
+Pod Spec. To distinguish between the environment and the pod as a whole, we
+will call the pod environment **PodSandbox.**
+
+The container runtimes may interpret the PodSandBox concept differently based
+on how it operates internally. For runtimes relying on hypervisor, sandbox
+represents a virtual machine naturally. For others, it can be Linux namespaces.
+
+In short, a PodSandbox should have the following features.
+
+ * **Isolation**: E.g., Linux namespaces or a full virtual machine, or even
+ support additional security features.
+ * **Compute resource specifications**: A PodSandbox should implement pod-level
+ resource demands and restrictions.
+
+*NOTE: The resource specification does not include externalized costs to
+container setup that are not currently trackable as Pod constraints, e.g.,
+filesystem setup, container image pulling, etc.*
+
+A container in a PodSandbox maps to an application in the Pod Spec. For Linux
+containers, they are expected to share at least network and IPC namespaces,
+with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
+
+
+Below is an example of the proposed interfaces.
+
+```go
+// PodSandboxManager contains basic operations for sandbox.
+type PodSandboxManager interface {
+ Create(config *PodSandboxConfig) (string, error)
+ Delete(id string) (string, error)
+ List(filter PodSandboxFilter) []PodSandboxListItem
+ Status(id string) PodSandboxStatus
+}
+
+// ContainerRuntime contains basic operations for containers.
+type ContainerRuntime interface {
+ Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error)
+ Start(id string) error
+ Stop(id string, timeout int) error
+ Remove(id string) error
+ List(filter ContainerFilter) ([]ContainerListItem, error)
+ Status(id string) (ContainerStatus, error)
+ Exec(id string, cmd []string, streamOpts StreamOptions) error
+}
+
+// ImageService contains image-related operations.
+type ImageService interface {
+ List() ([]Image, error)
+ Pull(image ImageSpec, auth AuthConfig) error
+ Remove(image ImageSpec) error
+ Status(image ImageSpec) (Image, error)
+ Metrics(image ImageSpec) (ImageMetrics, error)
+}
+
+type ContainerMetricsGetter interface {
+ ContainerMetrics(id string) (ContainerMetrics, error)
+}
+
+All functions listed above are expected to be thread-safe.
+```
+
+### Pod/Container Lifecycle
+
+The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox
+is created before any containers, and can exist after all containers in it have
+terminated.
+
+Assume there is a pod with a single container C. To start a pod:
+
+```
+ create sandbox Foo --> create container C --> start container C
+```
+
+To delete a pod:
+
+```
+ stop container C --> remove container C --> delete sandbox Foo
+```
+
+The container runtime must not apply any transition (such as starting a new
+container) unless explicitly instructed by Kubelet. It is Kubelet's
+responsibility to enforce garbage collection, restart policy, and otherwise
+react to changes in lifecycle.
+
+The only transitions that are possible for a container are described below:
+
+```
+() -> Created // A container can only transition to created from the
+ // empty, nonexistent state. The ContainerRuntime.Create
+ // method causes this transition.
+Created -> Running // The ContainerRuntime.Start method may be applied to a
+ // Created container to move it to Running
+Running -> Exited // The ContainerRuntime.Stop method may be applied to a running
+ // container to move it to Exited.
+ // A container may also make this transition under its own volition
+Exited -> () // An exited container can be moved to the terminal empty
+ // state via a ContainerRuntime.Remove call.
+```
+
+
+Kubelet is also responsible for gracefully terminating all the containers
+in the sandbox before deleting the sandbox. If Kubelet chooses to delete
+the sandbox with running containers in it, those containers should be forcibly
+deleted.
+
+Note that every PodSandbox/container lifecycle operation (create, start,
+stop, delete) should either return an error or block until the operation
+succeeds. A successful operation should include a state transition of the
+PodSandbox/container. E.g., if a `Create` call for a container does not
+return an error, the container state should be "created" when the runtime is
+queried.
+
+### Updates to PodSandbox or Containers
+
+Kubernetes support updates only to a very limited set of fields in the Pod
+Spec. These updates may require containers to be re-created by Kubelet. This
+can be achieved through the proposed, imperative container-level interface.
+On the other hand, PodSandbox update currently is not required.
+
+
+### Container Lifecycle Hooks
+
+Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing
+discussion for supporting pre-start and post-stop hooks in
+[#140](https://issues.k8s.io/140).
+
+These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the
+container runtime. This frees the runtimes from having to support hooks
+natively.
+
+Illustration of the container lifecycle and hooks:
+
+```
+ pre-start post-start pre-stop post-stop
+ | | | |
+ exec exec exec exec
+ | | | |
+ create --------> start ----------------> stop --------> remove
+```
+
+In order for the lifecycle hooks to function as expected, the `Exec` call
+will need access to the container's filesystem (e.g., mount namespaces).
+
+### Extensibility
+
+There are several dimensions for container runtime extensibility.
+ - Host OS (e.g., Linux)
+ - PodSandbox isolation mechanism (e.g., namespaces or VM)
+ - PodSandbox OS (e.g., Linux)
+
+As mentioned previously, this proposal will only address the Linux based
+PodSandbox and containers. All Linux-specific configuration will be grouped
+into one field. A container runtime is required to enforce all configuration
+applicable to its platform, and should return an error otherwise.
+
+### Keep it minimal
+
+The proposed interface is experimental, i.e., it will go through (many) changes
+until it stabilizes. The principle is to to keep the interface minimal and
+extend it later if needed. This includes a several features that are still in
+discussion and may be achieved alternatively:
+
+ * `AttachContainer`: [#23335](https://issues.k8s.io/23335)
+ * `PortForward`: [#25113](https://issues.k8s.io/25113)
+
+## Alternatives
+
+**[Status quo] Declarative pod-level interface**
+ - Pros: No changes needed.
+ - Cons: All the issues stated in #motivation
+
+**Allow integration at both pod- and container-level interfaces**
+ - Pros: Flexibility.
+ - Cons: All the issues stated in #motivation
+
+**Imperative pod-level interface**
+The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod().
+This implies that the runtime needs to take over container lifecycle
+management (i.e., enforce restart policy), lifecycle hooks, liveness checks,
+etc. Kubelet will mainly be responsible for interfacing with the apiserver, and
+can potentially become a very thin daemon.
+ - Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker`
+ shim maintenance cost is discounted.
+ - Cons: This will incur higher integration cost because every new container
+ runtime needs to implement all the features and need to understand the
+ concept of pods. This would also lead to lower feature velocity because the
+ interface will need to be changed, and the new pod-level feature will need
+ to be supported in each runtime.
+
+## Related Issues
+
+ * Metrics: [#27097](https://issues.k8s.io/27097)
+ * Log management: [#24677](https://issues.k8s.io/24677)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-runtime-interface-v1.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/control-plane-resilience.md b/contributors/design-proposals/control-plane-resilience.md
new file mode 100644
index 00000000..8193fd97
--- /dev/null
+++ b/contributors/design-proposals/control-plane-resilience.md
@@ -0,0 +1,241 @@
+# Kubernetes and Cluster Federation Control Plane Resilience
+
+## Long Term Design and Current Status
+
+### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
+
+### December 14, 2015
+
+## Summary
+
+Some amount of confusion exists around how we currently, and in future
+want to ensure resilience of the Kubernetes (and by implication
+Kubernetes Cluster Federation) control plane. This document is an attempt to capture that
+definitively. It covers areas including self-healing, high
+availability, bootstrapping and recovery. Most of the information in
+this document already exists in the form of github comments,
+PR's/proposals, scattered documents, and corridor conversations, so
+document is primarily a consolidation and clarification of existing
+ideas.
+
+## Terms
+
+* **Self-healing:** automatically restarting or replacing failed
+ processes and machines without human intervention
+* **High availability:** continuing to be available and work correctly
+ even if some components are down or uncontactable. This typically
+ involves multiple replicas of critical services, and a reliable way
+ to find available replicas. Note that it's possible (but not
+ desirable) to have high
+ availability properties (e.g. multiple replicas) in the absence of
+ self-healing properties (e.g. if a replica fails, nothing replaces
+ it). Fairly obviously, given enough time, such systems typically
+ become unavailable (after enough replicas have failed).
+* **Bootstrapping**: creating an empty cluster from nothing
+* **Recovery**: recreating a non-empty cluster after perhaps
+ catastrophic failure/unavailability/data corruption
+
+## Overall Goals
+
+1. **Resilience to single failures:** Kubernetes clusters constrained
+ to single availability zones should be resilient to individual
+ machine and process failures by being both self-healing and highly
+ available (within the context of such individual failures).
+1. **Ubiquitous resilience by default:** The default cluster creation
+ scripts for (at least) GCE, AWS and basic bare metal should adhere
+ to the above (self-healing and high availability) by default (with
+ options available to disable these features to reduce control plane
+ resource requirements if so required). It is hoped that other
+ cloud providers will also follow the above guidelines, but the
+ above 3 are the primary canonical use cases.
+1. **Resilience to some correlated failures:** Kubernetes clusters
+ which span multiple availability zones in a region should by
+ default be resilient to complete failure of one entire availability
+ zone (by similarly providing self-healing and high availability in
+ the default cluster creation scripts as above).
+1. **Default implementation shared across cloud providers:** The
+ differences between the default implementations of the above for
+ GCE, AWS and basic bare metal should be minimized. This implies
+ using shared libraries across these providers in the default
+ scripts in preference to highly customized implementations per
+ cloud provider. This is not to say that highly differentiated,
+ customized per-cloud cluster creation processes (e.g. for GKE on
+ GCE, or some hosted Kubernetes provider on AWS) are discouraged.
+ But those fall squarely outside the basic cross-platform OSS
+ Kubernetes distro.
+1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
+ for achieving system resilience (replication controllers, health
+ checking, service load balancing etc) should be used in preference
+ to building a separate set of mechanisms to achieve the same thing.
+ This implies that self hosting (the kubernetes control plane on
+ kubernetes) is strongly preferred, with the caveat below.
+1. **Recovery from catastrophic failure:** The ability to quickly and
+ reliably recover a cluster from catastrophic failure is critical,
+ and should not be compromised by the above goal to self-host
+ (i.e. it goes without saying that the cluster should be quickly and
+ reliably recoverable, even if the cluster control plane is
+ broken). This implies that such catastrophic failure scenarios
+ should be carefully thought out, and the subject of regular
+ continuous integration testing, and disaster recovery exercises.
+
+## Relative Priorities
+
+1. **(Possibly manual) recovery from catastrophic failures:** having a
+Kubernetes cluster, and all applications running inside it, disappear forever
+perhaps is the worst possible failure mode. So it is critical that we be able to
+recover the applications running inside a cluster from such failures in some
+well-bounded time period.
+ 1. In theory a cluster can be recovered by replaying all API calls
+ that have ever been executed against it, in order, but most
+ often that state has been lost, and/or is scattered across
+ multiple client applications or groups. So in general it is
+ probably infeasible.
+ 1. In theory a cluster can also be recovered to some relatively
+ recent non-corrupt backup/snapshot of the disk(s) backing the
+ etcd cluster state. But we have no default consistent
+ backup/snapshot, verification or restoration process. And we
+ don't routinely test restoration, so even if we did routinely
+ perform and verify backups, we have no hard evidence that we
+ can in practise effectively recover from catastrophic cluster
+ failure or data corruption by restoring from these backups. So
+ there's more work to be done here.
+1. **Self-healing:** Most major cloud providers provide the ability to
+ easily and automatically replace failed virtual machines within a
+ small number of minutes (e.g. GCE
+ [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
+ and Managed Instance Groups,
+ AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
+ and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
+ can fairly trivially be used to reduce control-plane down-time due
+ to machine failure to a small number of minutes per failure
+ (i.e. typically around "3 nines" availability), provided that:
+ 1. cluster persistent state (i.e. etcd disks) is either:
+ 1. truely persistent (i.e. remote persistent disks), or
+ 1. reconstructible (e.g. using etcd [dynamic member
+ addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
+ or [backup and
+ recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
+ 1. and boot disks are either:
+ 1. truely persistent (i.e. remote persistent disks), or
+ 1. reconstructible (e.g. using boot-from-snapshot,
+ boot-from-pre-configured-image or
+ boot-from-auto-initializing image).
+1. **High Availability:** This has the potential to increase
+ availability above the approximately "3 nines" level provided by
+ automated self-healing, but it's somewhat more complex, and
+ requires additional resources (e.g. redundant API servers and etcd
+ quorum members). In environments where cloud-assisted automatic
+ self-healing might be infeasible (e.g. on-premise bare-metal
+ deployments), it also gives cluster administrators more time to
+ respond (e.g. replace/repair failed machines) without incurring
+ system downtime.
+
+## Design and Status (as of December 2015)
+
+<table>
+<tr>
+<td><b>Control Plane Component</b></td>
+<td><b>Resilience Plan</b></td>
+<td><b>Current Status</b></td>
+</tr>
+<tr>
+<td><b>API Server</b></td>
+<td>
+
+Multiple stateless, self-hosted, self-healing API servers behind a HA
+load balancer, built out by the default "kube-up" automation on GCE,
+AWS and basic bare metal (BBM). Note that the single-host approach of
+having etcd listen only on localhost to ensure that only API server can
+connect to it will no longer work, so alternative security will be
+needed in the regard (either using firewall rules, SSL certs, or
+something else). All necessary flags are currently supported to enable
+SSL between API server and etcd (OpenShift runs like this out of the
+box), but this needs to be woven into the "kube-up" and related
+scripts. Detailed design of self-hosting and related bootstrapping
+and catastrophic failure recovery will be detailed in a separate
+design doc.
+
+</td>
+<td>
+
+No scripted self-healing or HA on GCE, AWS or basic bare metal
+currently exists in the OSS distro. To be clear, "no self healing"
+means that even if multiple e.g. API servers are provisioned for HA
+purposes, if they fail, nothing replaces them, so eventually the
+system will fail. Self-healing and HA can be set up
+manually by following documented instructions, but this is not
+currently an automated process, and it is not tested as part of
+continuous integration. So it's probably safest to assume that it
+doesn't actually work in practise.
+
+</td>
+</tr>
+<tr>
+<td><b>Controller manager and scheduler</b></td>
+<td>
+
+Multiple self-hosted, self healing warm standby stateless controller
+managers and schedulers with leader election and automatic failover of API
+server clients, automatically installed by default "kube-up" automation.
+
+</td>
+<td>As above.</td>
+</tr>
+<tr>
+<td><b>etcd</b></td>
+<td>
+
+Multiple (3-5) etcd quorum members behind a load balancer with session
+affinity (to prevent clients from being bounced from one to another).
+
+Regarding self-healing, if a node running etcd goes down, it is always necessary
+to do three things:
+<ol>
+<li>allocate a new node (not necessary if running etcd as a pod, in
+which case specific measures are required to prevent user pods from
+interfering with system pods, for example using node selectors as
+described in <A HREF="),
+<li>start an etcd replica on that new node, and
+<li>have the new replica recover the etcd state.
+</ol>
+In the case of local disk (which fails in concert with the machine), the etcd
+state must be recovered from the other replicas. This is called
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">
+dynamic member addition</A>.
+
+In the case of remote persistent disk, the etcd state can be recovered by
+attaching the remote persistent disk to the replacement node, thus the state is
+recoverable even if all other replicas are down.
+
+There are also significant performance differences between local disks and remote
+persistent disks. For example, the
+<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">
+sustained throughput local disks in GCE is approximatley 20x that of remote
+disks</A>.
+
+Hence we suggest that self-healing be provided by remotely mounted persistent
+disks in non-performance critical, single-zone cloud deployments. For
+performance critical installations, faster local SSD's should be used, in which
+case remounting on node failure is not an option, so
+<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">
+etcd runtime configuration</A> should be used to replace the failed machine.
+Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
+automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">
+runtime configuration</A> is required. Similarly, basic bare metal deployments
+cannot generally rely on remote persistent disks, so the same approach applies
+there.
+</td>
+<td>
+<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
+Somewhat vague instructions exist</A> on how to set some of this up manually in
+a self-hosted configuration. But automatic bootstrapping and self-healing is not
+described (and is not implemented for the non-PD cases). This all still needs to
+be automated and continuously tested.
+</td>
+</tr>
+</table>
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/controller-ref.md b/contributors/design-proposals/controller-ref.md
new file mode 100644
index 00000000..09dfd684
--- /dev/null
+++ b/contributors/design-proposals/controller-ref.md
@@ -0,0 +1,102 @@
+# ControllerRef proposal
+
+Author: gmarek@
+Last edit: 2016-05-11
+Status: raw
+
+Approvers:
+- [ ] briangrant
+- [ ] dbsmith
+
+**Table of Contents**
+
+- [Goal of ControllerReference](#goal-of-setreference)
+- [Non goals](#non-goals)
+- [API and semantic changes](#api-and-semantic-changes)
+- [Upgrade/downgrade procedure](#upgradedowngrade-procedure)
+- [Orphaning/adoption](#orphaningadoption)
+- [Implementation plan (sketch)](#implementation-plan-sketch)
+- [Considered alternatives](#considered-alternatives)
+
+# Goal of ControllerReference
+
+Main goal of `ControllerReference` effort is to solve a problem of overlapping controllers that fight over some resources (e.g. `ReplicaSets` fighting with `ReplicationControllers` over `Pods`), which cause serious [problems](https://github.com/kubernetes/kubernetes/issues/24433) such as exploding memory of Controller Manager.
+
+We don’t want to have (just) an in-memory solution, as we don’t want a Controller Manager crash to cause massive changes in object ownership in the system. I.e. we need to persist the information about "owning controller".
+
+Secondary goal of this effort is to improve performance of various controllers and schedulers, by removing the need for expensive lookup for all matching "controllers".
+
+# Non goals
+
+Cascading deletion is not a goal of this effort. Cascading deletion will use `ownerReferences`, which is a [separate effort](garbage-collection.md).
+
+`ControllerRef` will extend `OwnerReference` and reuse machinery written for it (GarbageCollector, adoption/orphaning logic).
+
+# API and semantic changes
+
+There will be a new API field in the `OwnerReference` in which we will store an information if given owner is a managing controller:
+
+```
+OwnerReference {
+ …
+ Controller bool
+ …
+}
+```
+
+From now on by `ControllerRef` we mean an `OwnerReference` with `Controller=true`.
+
+Most controllers (all that manage collections of things defined by label selector) will have slightly changed semantics: currently controller owns an object if its selector matches object’s labels and if it doesn't notice an older controller of the same kind that also matches the object's labels, but after introduction of `ControllerReference` a controller will own an object iff selector matches labels and the `OwnerReference` with `Controller=true`points to it.
+
+If the owner's selector or owned object's labels change, the owning controller will be responsible for orphaning (clearing `Controller` field in the `OwnerReference` and/or deleting `OwnerReference` altogether) objects, after which adoption procedure (setting `Controller` field in one of `OwnerReferencec` and/or adding new `OwnerReferences`) might occur, if another controller has a selector matching.
+
+For debugging purposes we want to add an `adoptionTime` annotation prefixed with `kubernetes.io/` which will keep the time of last controller ownership transfer.
+
+# Upgrade/downgrade procedure
+
+Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same upgrade/downgrade procedures.
+
+# Orphaning/adoption
+
+Because `ControllerRef` will be a part of `OwnerReference` effort it will have the same orphaning/adoption procedures.
+
+Controllers will orphan objects they own in two cases:
+* Change of label/selector causing selector to stop matching labels (executed by the controller)
+* Deletion of a controller with `Orphaning=true` (executed by the GarbageCollector)
+
+We will need a secondary orphaning mechanism in case of unclean controller deletion:
+* GarbageCollector will remove `ControllerRef` from objects that no longer points to existing controllers
+
+Controller will adopt (set `Controller` field in the `OwnerReference` that points to it) an object whose labels match its selector iff:
+* there are no `OwnerReferences` with `Controller` set to true in `OwnerReferences` array
+* `DeletionTimestamp` is not set
+and
+* Controller is the first controller that will manage to adopt the Pod from all Controllers that have matching label selector and don't have `DeletionTimestamp` set.
+
+By design there are possible races during adoption if multiple controllers can own a given object.
+
+To prevent re-adoption of an object during deletion the `DeletionTimestamp` will be set when deletion is starting. When a controller has a non-nil `DeletionTimestamp` it won’t take any actions except updating its `Status` (in particular it won’t adopt any objects).
+
+# Implementation plan (sketch):
+
+* Add API field for `Controller`,
+* Extend `OwnerReference` adoption procedure to set a `Controller` field in one of the owners,
+* Update all affected controllers to respect `ControllerRef`.
+
+Necessary related work:
+* `OwnerReferences` are correctly added/deleted,
+* GarbageCollector removes dangling references,
+* Controllers don't take any meaningful actions when `DeletionTimestamps` is set.
+
+# Considered alternatives
+
+* Generic "ReferenceController": centralized component that managed adoption/orphaning
+ * Dropped because: hard to write something that will work for all imaginable 3rd party objects, adding hooks to framework makes it possible for users to write their own logic
+* Separate API field for `ControllerRef` in the ObjectMeta.
+ * Dropped because: nontrivial relationship between `ControllerRef` and `OwnerReferences` when it comes to deletion/adoption.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/controller-ref.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/daemon.md b/contributors/design-proposals/daemon.md
new file mode 100644
index 00000000..2c306056
--- /dev/null
+++ b/contributors/design-proposals/daemon.md
@@ -0,0 +1,206 @@
+# DaemonSet in Kubernetes
+
+**Author**: Ananya Kumar (@AnanyaKumar)
+
+**Status**: Implemented.
+
+This document presents the design of the Kubernetes DaemonSet, describes use
+cases, and gives an overview of the code.
+
+## Motivation
+
+Many users have requested for a way to run a daemon on every node in a
+Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential
+for use cases such as building a sharded datastore, or running a logger on every
+node. In comes the DaemonSet, a way to conveniently create and manage
+daemon-like workloads in Kubernetes.
+
+## Use Cases
+
+The DaemonSet can be used for user-specified system services, cluster-level
+applications with strong node ties, and Kubernetes node services. Below are
+example use cases in each category.
+
+### User-Specified System Services:
+
+Logging: Some users want a way to collect statistics about nodes in a cluster
+and send those logs to an external database. For example, system administrators
+might want to know if their machines are performing as expected, if they need to
+add more machines to the cluster, or if they should switch cloud providers. The
+DaemonSet can be used to run a data collection service (for example fluentd) on
+every node and send the data to a service like ElasticSearch for analysis.
+
+### Cluster-Level Applications
+
+Datastore: Users might want to implement a sharded datastore in their cluster. A
+few nodes in the cluster, labeled ‘app=datastore’, might be responsible for
+storing data shards, and pods running on these nodes might serve data. This
+architecture requires a way to bind pods to specific nodes, so it cannot be
+achieved using a Replication Controller. A DaemonSet is a convenient way to
+implement such a datastore.
+
+For other uses, see the related [feature request](https://issues.k8s.io/1518)
+
+## Functionality
+
+The DaemonSet supports standard API features:
+ - create
+ - The spec for DaemonSets has a pod template field.
+ - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate
+over nodes that have a certain label. For example, suppose that in a cluster
+some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a
+datastore pod on exactly those nodes labeled ‘app=database’.
+ - Using the pod's nodeName field, DaemonSets can be restricted to operate on a
+specified node.
+ - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec
+used by the Replication Controller.
+ - The initial implementation will not guarantee that DaemonSet pods are
+created on nodes before other pods.
+ - The initial implementation of DaemonSet does not guarantee that DaemonSet
+pods show up on nodes (for example because of resource limitations of the node),
+but makes a best effort to launch DaemonSet pods (like Replication Controllers
+do with pods). Subsequent revisions might ensure that DaemonSet pods show up on
+nodes, preempting other pods if necessary.
+ - The DaemonSet controller adds an annotation:
+```"kubernetes.io/created-by: \<json API object reference\>"```
+ - YAML example:
+
+ ```YAML
+ apiVersion: extensions/v1beta1
+ kind: DaemonSet
+ metadata:
+ labels:
+ app: datastore
+ name: datastore
+ spec:
+ template:
+ metadata:
+ labels:
+ app: datastore-shard
+ spec:
+ nodeSelector:
+ app: datastore-node
+ containers:
+ name: datastore-shard
+ image: kubernetes/sharded
+ ports:
+ - containerPort: 9042
+ name: main
+```
+
+ - commands that get info:
+ - get (e.g. kubectl get daemonsets)
+ - describe
+ - Modifiers:
+ - delete (if --cascade=true, then first the client turns down all the pods
+controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is
+unlikely to be set on any node); then it deletes the DaemonSet; then it deletes
+the pods)
+ - label
+ - annotate
+ - update operations like patch and replace (only allowed to selector and to
+nodeSelector and nodeName of pod template)
+ - DaemonSets have labels, so you could, for example, list all DaemonSets
+with certain labels (the same way you would for a Replication Controller).
+
+In general, for all the supported features like get, describe, update, etc,
+the DaemonSet works in a similar way to the Replication Controller. However,
+note that the DaemonSet and the Replication Controller are different constructs.
+
+### Persisting Pods
+
+ - Ordinary liveness probes specified in the pod template work to keep pods
+created by a DaemonSet running.
+ - If a daemon pod is killed or stopped, the DaemonSet will create a new
+replica of the daemon pod on the node.
+
+### Cluster Mutations
+
+ - When a new node is added to the cluster, the DaemonSet controller starts
+daemon pods on the node for DaemonSets whose pod template nodeSelectors match
+the node’s labels.
+ - Suppose the user launches a DaemonSet that runs a logging daemon on all
+nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label
+to a node (that did not initially have the label), the logging daemon will
+launch on the node. Additionally, if a user removes the label from a node, the
+logging daemon on that node will be killed.
+
+## Alternatives Considered
+
+We considered several alternatives, that were deemed inferior to the approach of
+creating a new DaemonSet abstraction.
+
+One alternative is to include the daemon in the machine image. In this case it
+would run outside of Kubernetes proper, and thus not be monitored, health
+checked, usable as a service endpoint, easily upgradable, etc.
+
+A related alternative is to package daemons as static pods. This would address
+most of the problems described above, but they would still not be easily
+upgradable, and more generally could not be managed through the API server
+interface.
+
+A third alternative is to generalize the Replication Controller. We would do
+something like: if you set the `replicas` field of the ReplicationControllerSpec
+to -1, then it means "run exactly one replica on every node matching the
+nodeSelector in the pod template." The ReplicationController would pretend
+`replicas` had been set to some large number -- larger than the largest number
+of nodes ever expected in the cluster -- and would use some anti-affinity
+mechanism to ensure that no more than one Pod from the ReplicationController
+runs on any given node. There are two downsides to this approach. First,
+there would always be a large number of Pending pods in the scheduler (these
+will be scheduled onto new machines when they are added to the cluster). The
+second downside is more philosophical: DaemonSet and the Replication Controller
+are very different concepts. We believe that having small, targeted controllers
+for distinct purposes makes Kubernetes easier to understand and use, compared to
+having larger multi-functional controllers (see
+["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for
+some discussion of this topic).
+
+## Design
+
+#### Client
+
+- Add support for DaemonSet commands to kubectl and the client. Client code was
+added to pkg/client/unversioned. The main files in Kubectl that were modified are
+pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create,
+and Update, the client simply forwards the request to the backend via the REST
+API.
+
+#### Apiserver
+
+- Accept, parse, validate client commands
+- REST API calls are handled in pkg/registry/daemonset
+ - In particular, the api server will add the object to etcd
+ - DaemonManager listens for updates to etcd (using Framework.informer)
+- API objects for DaemonSet were created in expapi/v1/types.go and
+expapi/v1/register.go
+- Validation code is in expapi/validation
+
+#### Daemon Manager
+
+- Creates new DaemonSets when requested. Launches the corresponding daemon pod
+on all nodes with labels matching the new DaemonSet’s selector.
+- Listens for addition of new nodes to the cluster, by setting up a
+framework.NewInformer that watches for the creation of Node API objects. When a
+new node is added, the daemon manager will loop through each DaemonSet. If the
+label of the node matches the selector of the DaemonSet, then the daemon manager
+will create the corresponding daemon pod in the new node.
+- The daemon manager creates a pod on a node by sending a command to the API
+server, requesting for a pod to be bound to the node (the node will be specified
+via its hostname.)
+
+#### Kubelet
+
+- Does not need to be modified, but health checking will occur for the daemon
+pods and revive the pods if they are killed (we set the pod restartPolicy to
+Always). We reject DaemonSet objects with pod templates that don’t have
+restartPolicy set to Always.
+
+## Open Issues
+
+- Should work similarly to [Deployment](http://issues.k8s.io/1743).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/deploy.md b/contributors/design-proposals/deploy.md
new file mode 100644
index 00000000..a27fb01f
--- /dev/null
+++ b/contributors/design-proposals/deploy.md
@@ -0,0 +1,147 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Deploy through CLI](#deploy-through-cli)
+ - [Motivation](#motivation)
+ - [Requirements](#requirements)
+ - [Related `kubectl` Commands](#related-kubectl-commands)
+ - [`kubectl run`](#kubectl-run)
+ - [`kubectl scale` and `kubectl autoscale`](#kubectl-scale-and-kubectl-autoscale)
+ - [`kubectl rollout`](#kubectl-rollout)
+ - [`kubectl set`](#kubectl-set)
+ - [Mutating Operations](#mutating-operations)
+ - [Example](#example)
+ - [Support in Deployment](#support-in-deployment)
+ - [Deployment Status](#deployment-status)
+ - [Deployment Version](#deployment-version)
+ - [Pause Deployments](#pause-deployments)
+ - [Perm-failed Deployments](#perm-failed-deployments)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Deploy through CLI
+
+## Motivation
+
+Users can use [Deployments](../user-guide/deployments.md) or [`kubectl rolling-update`](../user-guide/kubectl/kubectl_rolling-update.md) to deploy in their Kubernetes clusters. A Deployment provides declarative update for Pods and ReplicationControllers, whereas `rolling-update` allows the users to update their earlier deployment without worrying about schemas and configurations. Users need a way that's similar to `rolling-update` to manage their Deployments more easily.
+
+`rolling-update` expects ReplicationController as the only resource type it deals with. It's not trivial to support exactly the same behavior with Deployment, which requires:
+- Print out scaling up/down events.
+- Stop the deployment if users press Ctrl-c.
+- The controller should not make any more changes once the process ends. (Delete the deployment when status.replicas=status.updatedReplicas=spec.replicas)
+
+So, instead, this document proposes another way to support easier deployment management via Kubernetes CLI (`kubectl`).
+
+## Requirements
+
+The followings are operations we need to support for the users to easily managing deployments:
+
+- **Create**: To create deployments.
+- **Rollback**: To restore to an earlier version of deployment.
+- **Watch the status**: To watch for the status update of deployments.
+- **Pause/resume**: To pause a deployment mid-way, and to resume it. (A use case is to support canary deployment.)
+- **Version information**: To record and show version information that's meaningful to users. This can be useful for rollback.
+
+## Related `kubectl` Commands
+
+### `kubectl run`
+
+`kubectl run` should support the creation of Deployment (already implemented) and DaemonSet resources.
+
+### `kubectl scale` and `kubectl autoscale`
+
+Users may use `kubectl scale` or `kubectl autoscale` to scale up and down Deployments (both already implemented).
+
+### `kubectl rollout`
+
+`kubectl rollout` supports both Deployment and DaemonSet. It has the following subcommands:
+- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous version of deployment.
+- `kubectl rollout pause` allows the users to pause a deployment. See [pause deployments](#pause-deployments).
+- `kubectl rollout resume` allows the users to resume a paused deployment.
+- `kubectl rollout status` shows the status of a deployment.
+- `kubectl rollout history` shows meaningful version information of all previous deployments. See [development version](#deployment-version).
+- `kubectl rollout retry` retries a failed deployment. See [perm-failed deployments](#perm-failed-deployments).
+
+### `kubectl set`
+
+`kubectl set` has the following subcommands:
+- `kubectl set env` allows the users to set environment variables of Kubernetes resources. It should support any object that contains a single, primary PodTemplate (such as Pod, ReplicationController, ReplicaSet, Deployment, and DaemonSet).
+- `kubectl set image` allows the users to update multiple images of Kubernetes resources. Users will use `--container` and `--image` flags to update the image of a container. It should support anything that has a PodTemplate.
+
+`kubectl set` should be used for things that are common and commonly modified. Other possible future commands include:
+- `kubectl set volume`
+- `kubectl set limits`
+- `kubectl set security`
+- `kubectl set port`
+
+### Mutating Operations
+
+Other means of mutating Deployments and DaemonSets, including `kubectl apply`, `kubectl edit`, `kubectl replace`, `kubectl patch`, `kubectl label`, and `kubectl annotate`, may trigger rollouts if they modify the pod template.
+
+`kubectl create` and `kubectl delete`, for creating and deleting Deployments and DaemonSets, are also relevant.
+
+### Example
+
+With the commands introduced above, here's an example of deployment management:
+
+```console
+# Create a Deployment
+$ kubectl run nginx --image=nginx --replicas=2 --generator=deployment/v1beta1
+
+# Watch the Deployment status
+$ kubectl rollout status deployment/nginx
+
+# Update the Deployment
+$ kubectl set image deployment/nginx --container=nginx --image=nginx:<some-version>
+
+# Pause the Deployment
+$ kubectl rollout pause deployment/nginx
+
+# Resume the Deployment
+$ kubectl rollout resume deployment/nginx
+
+# Check the change history (deployment versions)
+$ kubectl rollout history deployment/nginx
+
+# Rollback to a previous version.
+$ kubectl rollout undo deployment/nginx --to-version=<version>
+```
+
+## Support in Deployment
+
+### Deployment Status
+
+Deployment status should summarize information about Pods, which includes:
+- The number of pods of each version.
+- The number of ready/not ready pods.
+
+See issue [#17164](https://github.com/kubernetes/kubernetes/issues/17164).
+
+### Deployment Version
+
+We store previous deployment version information in annotations `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` of replication controllers of the deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`.
+- `rollout.kubectl.kubernetes.io/change-source`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation.
+- `rollout.kubectl.kubernetes.io/version` records a version number to distinguish the change sequence of a deployment's
+replication controllers. A deployment obtains the largest version number from its replication controllers and increments the number by 1 upon update or creation of the deployment, and update the version annotation of its new replication controller.
+
+When the users perform a rollback, i.e. `kubectl rollout undo`, the deployment first looks at its existing replication controllers, regardless of their number of replicas. Then it finds the one with annotation `rollout.kubectl.kubernetes.io/version` that either contains the specified rollback version number or contains the second largest version number among all the replication controllers (current new replication controller should obtain the largest version number) if the user didn't specify any version number (the user wants to rollback to the last change). Lastly, it
+starts scaling up that replication controller it's rolling back to, and scaling down the current ones, and then update the version counter and the rollout annotations accordingly.
+
+Note that a deployment's replication controllers use PodTemplate hashes (i.e. the hash of `.spec.template`) to distinguish from each others. When doing rollout or rollback, a deployment reuses existing replication controller if it has the same PodTemplate, and its `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` annotations will be updated by the new rollout. At this point, the earlier state of this replication controller is lost in history. For example, if we had 3 replication controllers in
+deployment history, and then we do a rollout with the same PodTemplate as version 1, then version 1 is lost and becomes version 4 after the rollout.
+
+To make deployment versions more meaningful and readable for the users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout:
+- `--description`: adds `description` annotation to an object when it's created to describe the object.
+- `--note`: adds `note` annotation to an object when it's updated to record the change.
+- `--commit`: adds `commit` annotation to an object with the commit id.
+
+### Pause Deployments
+
+Users sometimes need to temporarily disable a deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516).
+
+### Perm-failed Deployments
+
+The deployment could be marked as "permanently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment. The users can retry a failed deployment with `kubectl rollout retry`. See issue [#14519](https://github.com/kubernetes/kubernetes/issues/14519).
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deploy.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/deployment.md b/contributors/design-proposals/deployment.md
new file mode 100644
index 00000000..f12ffc9e
--- /dev/null
+++ b/contributors/design-proposals/deployment.md
@@ -0,0 +1,229 @@
+# Deployment
+
+## Abstract
+
+A proposal for implementing a new resource - Deployment - which will enable
+declarative config updates for Pods and ReplicationControllers.
+
+Users will be able to create a Deployment, which will spin up
+a ReplicationController to bring up the desired pods.
+Users can also target the Deployment at existing ReplicationControllers, in
+which case the new RC will replace the existing ones. The exact mechanics of
+replacement depends on the DeploymentStrategy chosen by the user.
+DeploymentStrategies are explained in detail in a later section.
+
+## Implementation
+
+### API Object
+
+The `Deployment` API object will have the following structure:
+
+```go
+type Deployment struct {
+ TypeMeta
+ ObjectMeta
+
+ // Specification of the desired behavior of the Deployment.
+ Spec DeploymentSpec
+
+ // Most recently observed status of the Deployment.
+ Status DeploymentStatus
+}
+
+type DeploymentSpec struct {
+ // Number of desired pods. This is a pointer to distinguish between explicit
+ // zero and not specified. Defaults to 1.
+ Replicas *int
+
+ // Label selector for pods. Existing ReplicationControllers whose pods are
+ // selected by this will be scaled down. New ReplicationControllers will be
+ // created with this selector, with a unique label `pod-template-hash`.
+ // If Selector is empty, it is defaulted to the labels present on the Pod template.
+ Selector map[string]string
+
+ // Describes the pods that will be created.
+ Template *PodTemplateSpec
+
+ // The deployment strategy to use to replace existing pods with new ones.
+ Strategy DeploymentStrategy
+}
+
+type DeploymentStrategy struct {
+ // Type of deployment. Can be "Recreate" or "RollingUpdate".
+ Type DeploymentStrategyType
+
+ // TODO: Update this to follow our convention for oneOf, whatever we decide it
+ // to be.
+ // Rolling update config params. Present only if DeploymentStrategyType =
+ // RollingUpdate.
+ RollingUpdate *RollingUpdateDeploymentStrategy
+}
+
+type DeploymentStrategyType string
+
+const (
+ // Kill all existing pods before creating new ones.
+ RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"
+
+ // Replace the old RCs by new one using rolling update i.e gradually scale down the old RCs and scale up the new one.
+ RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
+)
+
+// Spec to control the desired behavior of rolling update.
+type RollingUpdateDeploymentStrategy struct {
+ // The maximum number of pods that can be unavailable during the update.
+ // Value can be an absolute number (ex: 5) or a percentage of total pods at the start of update (ex: 10%).
+ // Absolute number is calculated from percentage by rounding up.
+ // This can not be 0 if MaxSurge is 0.
+ // By default, a fixed value of 1 is used.
+ // Example: when this is set to 30%, the old RC can be scaled down by 30%
+ // immediately when the rolling update starts. Once new pods are ready, old RC
+ // can be scaled down further, followed by scaling up the new RC, ensuring
+ // that at least 70% of original number of pods are available at all times
+ // during the update.
+ MaxUnavailable IntOrString
+
+ // The maximum number of pods that can be scheduled above the original number of
+ // pods.
+ // Value can be an absolute number (ex: 5) or a percentage of total pods at
+ // the start of the update (ex: 10%). This can not be 0 if MaxUnavailable is 0.
+ // Absolute number is calculated from percentage by rounding up.
+ // By default, a value of 1 is used.
+ // Example: when this is set to 30%, the new RC can be scaled up by 30%
+ // immediately when the rolling update starts. Once old pods have been killed,
+ // new RC can be scaled up further, ensuring that total number of pods running
+ // at any time during the update is atmost 130% of original pods.
+ MaxSurge IntOrString
+
+ // Minimum number of seconds for which a newly created pod should be ready
+ // without any of its container crashing, for it to be considered available.
+ // Defaults to 0 (pod will be considered available as soon as it is ready)
+ MinReadySeconds int
+}
+
+type DeploymentStatus struct {
+ // Total number of ready pods targeted by this deployment (this
+ // includes both the old and new pods).
+ Replicas int
+
+ // Total number of new ready pods with the desired template spec.
+ UpdatedReplicas int
+}
+
+```
+
+### Controller
+
+#### Deployment Controller
+
+The DeploymentController will make Deployments happen.
+It will watch Deployment objects in etcd.
+For each pending deployment, it will:
+
+1. Find all RCs whose label selector is a superset of DeploymentSpec.Selector.
+ - For now, we will do this in the client - list all RCs and then filter the
+ ones we want. Eventually, we want to expose this in the API.
+2. The new RC can have the same selector as the old RC and hence we add a unique
+ selector to all these RCs (and the corresponding label to their pods) to ensure
+ that they do not select the newly created pods (or old pods get selected by
+ new RC).
+ - The label key will be "pod-template-hash".
+ - The label value will be hash of the podTemplateSpec for that RC without
+ this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
+ - If the RCs and pods dont already have this label and selector:
+ - We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
+ ensure that all new pods that they create will have this label.
+ - Then we will add this label to their existing pods and then add this as a selector
+ to that RC.
+3. Find if there exists an RC for which value of "pod-template-hash" label
+ is same as hash of DeploymentSpec.PodTemplateSpec. If it exists already, then
+ this is the RC that will be ramped up. If there is no such RC, then we create
+ a new one using DeploymentSpec and then add a "pod-template-hash" label
+ to it. RCSpec.replicas = 0 for a newly created RC.
+4. Scale up the new RC and scale down the olds ones as per the DeploymentStrategy.
+ - Raise an event if we detect an error, like new pods failing to come up.
+5. Go back to step 1 unless the new RC has been ramped up to desired replicas
+ and the old RCs have been ramped down to 0.
+6. Cleanup.
+
+DeploymentController is stateless so that it can recover in case it crashes during a deployment.
+
+### MinReadySeconds
+
+We will implement MinReadySeconds using the Ready condition in Pod. We will add
+a LastTransitionTime to PodCondition and update kubelet to set Ready to false,
+each time any container crashes. Kubelet will set Ready condition back to true once
+all containers are ready. For containers without a readiness probe, we will
+assume that they are ready as soon as they are up.
+https://github.com/kubernetes/kubernetes/issues/11234 tracks updating kubelet
+and https://github.com/kubernetes/kubernetes/issues/12615 tracks adding
+LastTransitionTime to PodCondition.
+
+## Changing Deployment mid-way
+
+### Updating
+
+Users can update an ongoing deployment before it is completed.
+In this case, the existing deployment will be stalled and the new one will
+begin.
+For ex: consider the following case:
+- User creates a deployment to rolling-update 10 pods with image:v1 to
+ pods with image:v2.
+- User then updates this deployment to create pods with image:v3,
+ when the image:v2 RC had been ramped up to 5 pods and the image:v1 RC
+ had been ramped down to 5 pods.
+- When Deployment Controller observes the new deployment, it will create
+ a new RC for creating pods with image:v3. It will then start ramping up this
+ new RC to 10 pods and will ramp down both the existing RCs to 0.
+
+### Deleting
+
+Users can pause/cancel a deployment by deleting it before it is completed.
+Recreating the same deployment will resume it.
+For ex: consider the following case:
+- User creates a deployment to rolling-update 10 pods with image:v1 to
+ pods with image:v2.
+- User then deletes this deployment while the old and new RCs are at 5 replicas each.
+ User will end up with 2 RCs with 5 replicas each.
+User can then create the same deployment again in which case, DeploymentController will
+notice that the second RC exists already which it can ramp up while ramping down
+the first one.
+
+### Rollback
+
+We want to allow the user to rollback a deployment. To rollback a
+completed (or ongoing) deployment, user can create (or update) a deployment with
+DeploymentSpec.PodTemplateSpec = oldRC.PodTemplateSpec.
+
+## Deployment Strategies
+
+DeploymentStrategy specifies how the new RC should replace existing RCs.
+To begin with, we will support 2 types of deployment:
+* Recreate: We kill all existing RCs and then bring up the new one. This results
+ in quick deployment but there is a downtime when old pods are down but
+ the new ones have not come up yet.
+* Rolling update: We gradually scale down old RCs while scaling up the new one.
+ This results in a slower deployment, but there is no downtime. At all times
+ during the deployment, there are a few pods available (old or new). The number
+ of available pods and when is a pod considered "available" can be configured
+ using RollingUpdateDeploymentStrategy.
+
+In future, we want to support more deployment types.
+
+## Future
+
+Apart from the above, we want to add support for the following:
+* Running the deployment process in a pod: In future, we can run the deployment process in a pod. Then users can define their own custom deployments and we can run it using the image name.
+* More DeploymentStrategyTypes: https://github.com/openshift/origin/blob/master/examples/deployment/README.md#deployment-types lists most commonly used ones.
+* Triggers: Deployment will have a trigger field to identify what triggered the deployment. Options are: Manual/UserTriggered, Autoscaler, NewImage.
+* Automatic rollback on error: We want to support automatic rollback on error or timeout.
+
+## References
+
+- https://github.com/kubernetes/kubernetes/issues/1743 has most of the
+ discussion that resulted in this proposal.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/deployment.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/disk-accounting.md b/contributors/design-proposals/disk-accounting.md
new file mode 100755
index 00000000..19782356
--- /dev/null
+++ b/contributors/design-proposals/disk-accounting.md
@@ -0,0 +1,615 @@
+**Author**: Vishnu Kannan
+
+**Last** **Updated**: 11/16/2015
+
+**Status**: Pending Review
+
+This proposal is an attempt to come up with a means for accounting disk usage in Kubernetes clusters that are running docker as the container runtime. Some of the principles here might apply for other runtimes too.
+
+### Why is disk accounting necessary?
+
+As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesn’t prevent running pods from using up all the available disk space.
+
+Kubernetes users have no insight into how the disk is being consumed.
+
+Large images and rapid logging can lead to temporary downtime on the nodes. The node has to free up disk space by deleting images and containers. During this cleanup, existing pods can fail and new pods cannot be started. The node will also transition into an `OutOfDisk` condition, preventing more pods from being scheduled to the node.
+
+Automated eviction of pods that are hogging the local disk is not possible since proper accounting isn’t available.
+
+Since local disk is a non-compressible resource, users need means to restrict usage of local disk by pods and containers. Proper disk accounting is a prerequisite. As of today, a misconfigured low QoS class pod can end up bringing down the entire cluster by taking up all the available disk space (misconfigured logging for example)
+
+### Goals
+
+1. Account for disk usage on the nodes.
+
+2. Compatibility with the most common docker storage backends - devicemapper, aufs and overlayfs
+
+3. Provide a roadmap for enabling disk as a schedulable resource in the future.
+
+4. Provide a plugin interface for extending support to non-default filesystems and storage drivers.
+
+### Non Goals
+
+1. Compatibility with all storage backends. The matrix is pretty large already and the priority is to get disk accounting to on most widely deployed platforms.
+
+2. Support for filesystems other than ext4 and xfs.
+
+### Introduction
+
+Disk accounting in Kubernetes cluster running with docker is complex because of the plethora of ways in which disk gets utilized by a container.
+
+Disk can be consumed for:
+
+1. Container images
+
+2. Container’s writable layer
+
+3. Container’s logs - when written to stdout/stderr and default logging backend in docker is used.
+
+4. Local volumes - hostPath, emptyDir, gitRepo, etc.
+
+As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the container’s writable layer for aufs docker storage driver.
+This information is made available to end users via the heapster monitoring pipeline.
+
+#### Image layers
+
+Image layers are shared between containers (COW) and so accounting for images is complicated.
+
+Image layers will have to be accounted as system overhead.
+
+As of today, it is not possible to check if there is enough disk space available on the node before an image is pulled.
+
+#### Writable Layer
+
+Docker creates a writable layer for every container on the host. Depending on the storage driver, the location and the underlying filesystem of this layer will change.
+
+Any files that the container creates or updates (assuming there are no volumes) will be considered as writable layer usage.
+
+The underlying filesystem is whatever the docker storage directory resides on. It is ext4 by default on most distributions, and xfs on RHEL.
+
+#### Container logs
+
+Docker engine provides a pluggable logging interface. Kubernetes is currently using the default logging mode which is `local file`. In this mode, the docker daemon stores bytes written by containers to their stdout or stderr, to local disk. These log files are contained in a special directory that is managed by the docker daemon. These logs are exposed via `docker logs` interface which is then exposed via kubelet and apiserver APIs. Currently, there is a hard-requirement for persisting these log files on the disk.
+
+#### Local Volumes
+
+Volumes are slightly different from other local disk use cases. They are pod scoped. Their lifetime is tied to that of a pod. Due to this property accounting of volumes will also be at the pod level.
+
+As of now, the volume types that can use local disk directly are ‘HostPath’, ‘EmptyDir’, and ‘GitRepo’. Secretes and Downwards API volumes wrap these primitive volumes.
+Everything else is a network based volume.
+
+‘HostPath’ volumes map in existing directories in the host filesystem into a pod. Kubernetes manages only the mapping. It does not manage the source on the host filesystem.
+
+In addition to this, the changes introduced by a pod on the source of a hostPath volume is not cleaned by kubernetes once the pod exits. Due to these limitations, we will have to account hostPath volumes to system overhead. We should explicitly discourage use of HostPath in read-write mode.
+
+`EmptyDir`, `GitRepo` and other local storage volumes map to a directory on the host root filesystem, that is managed by Kubernetes (kubelet). Their contents are erased as soon as the pod exits. Tracking and potentially restricting usage for volumes is possible.
+
+### Docker storage model
+
+Before we start exploring solutions, let’s get familiar with how docker handles storage for images, writable layer and logs.
+
+On all storage drivers, logs are stored under `<docker root dir>/containers/<container-id>/`
+
+The default location of the docker root directory is `/var/lib/docker`.
+
+Volumes are handled by kubernetes.
+*Caveat: Volumes specified as part of Docker images are not handled by Kubernetes currently.*
+
+Container images and writable layers are managed by docker and their location will change depending on the storage driver. Each image layer and writable layer is referred to by an ID. The image layers are read-only. Once saved, existing writable layers can be frozen. Saving feature is not of importance to kubernetes since it works only on immutable images.
+
+*Note: Image layer IDs can be obtained by running `docker history -q --no-trunc <imagename>`*
+
+##### Aufs
+
+Image layers and writable layers are stored under `/var/lib/docker/aufs/diff/<id>`.
+
+The writable layers ID is equivalent to that of the container ID.
+
+##### Devicemapper
+
+Each container and each image gets own block device. Since this driver works at the block level, it is not possible to access the layers directly without mounting them. Each container gets its own block device while running.
+
+##### Overlayfs
+
+Image layers and writable layers are stored under `/var/lib/docker/overlay/<id>`.
+
+Identical files are hardlinked between images.
+
+The image layers contain all their data under a `root` subdirectory.
+
+Everything under `/var/lib/docker/overlay/<id>` are files required for running the container, including its writable layer.
+
+### Improve disk accounting
+
+Disk accounting is dependent on the storage driver in docker. A common solution that works across all storage drivers isn't available.
+
+I’m listing a few possible solutions for disk accounting below along with their limitations.
+
+We need a plugin model for disk accounting. Some storage drivers in docker will require special plugins.
+
+#### Container Images
+
+As of today, the partition that is holding docker images is flagged by cadvisor, and it uses filesystem stats to identify the overall disk usage of that partition.
+
+Isolated usage of just image layers is available today using `docker history <image name>`.
+But isolated usage isn't of much use because image layers are shared between containers and so it is not possible to charge a single pod for image disk usage.
+
+Continuing to use the entire partition availability for garbage collection purposes in kubelet, should not affect reliability.
+We might garbage collect more often.
+As long as we do not expose features that require persisting old containers, computing image layer usage wouldn’t be necessary.
+
+Main goals for images are
+1. Capturing total image disk usage
+2. Check if a new image will fit on disk.
+
+In case we choose to compute the size of image layers alone, the following are some of the ways to achieve that.
+
+*Note that some of the strategies mentioned below are applicable in general to other kinds of storage like volumes, etc.*
+
+##### Docker History
+
+It is possible to run `docker history` and then create a graph of all images and corresponding image layers.
+This graph will let us figure out the disk usage of all the images.
+
+**Pros**
+* Compatible across storage drivers.
+
+**Cons**
+* Requires maintaining an internal representation of images.
+
+##### Enhance docker
+
+Docker handles the upload and download of image layers. It can embed enough information about each layer. If docker is enhanced to expose this information, we can statically identify space about to be occupied by read-only image layers, even before the image layers are downloaded.
+
+A new [docker feature](https://github.com/docker/docker/pull/16450) (docker pull --dry-run) is pending review, which outputs the disk space that will be consumed by new images. Once this feature lands, we can perform feasibility checks and reject pods that will consume more disk space that what is current availability on the node.
+
+Another option is to expose disk usage of all images together as a first-class feature.
+
+**Pros**
+
+* Works across all storage drivers since docker abstracts the storage drivers.
+
+* Less code to maintain in kubelet.
+
+**Cons**
+
+* Not available today.
+
+* Requires serialized image pulls.
+
+* Metadata files are not tracked.
+
+##### Overlayfs and Aufs
+
+####### `du`
+
+We can list all the image layer specific directories, excluding container directories, and run `du` on each of those directories.
+
+**Pros**:
+
+* This is the least-intrusive approach.
+
+* It will work off the box without requiring any additional configuration.
+
+**Cons**:
+
+* `du` can consume a lot of cpu and memory. There have been several issues reported against the kubelet in the past that were related to `du`.
+
+* It is time consuming. Cannot be run frequently. Requires special handling to constrain resource usage - setting lower nice value or running in a sub-container.
+
+* Can block container deletion by keeping file descriptors open.
+
+
+####### Linux gid based Disk Quota
+
+[Disk quota](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html) feature provided by the linux kernel can be used to track the usage of image layers. Ideally, we need `project` support for disk quota, which lets us track usage of directory hierarchies using `project ids`. Unfortunately, that feature is only available for zfs filesystems. Since most of our distributions use `ext4` by default, we will have to use either `uid` or `gid` based quota tracking.
+
+Both `uids` and `gids` are meant for security. Overloading that concept for disk tracking is painful and ugly. But, that is what we have today.
+
+Kubelet needs to define a gid for tracking image layers and make that gid or group the owner of `/var/lib/docker/[aufs | overlayfs]` recursively. Once this is done, the quota sub-system in the kernel will report the blocks being consumed by the storage driver on the underlying partition.
+
+Since this number also includes the container’s writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking container’s writable layer. Once we apply a different `gid` to the container’s writable layer, which is located under `/var/lib/docker/<storage_driver>/diff/<container_id>`, the quota subsystem will not include the container’s writable layer usage.
+
+Xfs on the other hand support project quota which lets us track disk usage of arbitrary directories using a project. Support for this feature in ext4 is being reviewed. So on xfs, we can use quota without having to clobber the writable layer's uid and gid.
+
+**Pros**:
+
+* Low overhead tracking provided by the kernel.
+
+
+**Cons**
+
+* Requires updates to default ownership on docker’s internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native.
+
+* Requires additional node configuration - quota subsystem needs to be setup on the node. This can either be automated or made a requirement for the node.
+
+* Kubelet needs to perform gid management. A range of gids have to allocated to the kubelet for the purposes of quota management. This range must not be used for any other purposes out of band. Not required if project quota is available.
+
+* Breaks `docker save` semantics. Since kubernetes assumes immutable images, this is not a blocker. To support quota in docker, we will need user-namespaces along with custom gid mapping for each container. This feature does not exist today. This is not an issue with project quota.
+
+*Note: Refer to the [Appendix](#appendix) section more real examples on using quota with docker.*
+
+**Project Quota**
+
+Project Quota support for ext4 is currently being reviewed upstream. If that feature lands in upstream sometime soon, project IDs will be used to disk tracking instead of uids and gids.
+
+
+##### Devicemapper
+
+Devicemapper storage driver will setup two volumes, metadata and data, that will be used to store image layers and container writable layer. The volumes can be real devices or loopback. A Pool device is created which uses the underlying volume for real storage.
+
+A new thinly-provisioned volume, based on the pool, will be created for running container’s.
+
+The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and container’s writable layers.
+
+Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layer’s disk usage.
+
+Linux quota and `du` will not work with device mapper.
+
+A docker dry run option (mentioned above) is another possibility.
+
+
+#### Container Writable Layer
+
+###### Overlayfs / Aufs
+
+Docker creates a separate directory for the container’s writable layer which is then overlayed on top of read-only image layers.
+
+Both the previously mentioned options of `du` and `Linux Quota` will work for this case as well.
+
+Kubelet can use `du` to track usage and enforce `limits` once disk becomes a schedulable resource. As mentioned earlier `du` is resource intensive.
+
+To use Disk quota, kubelet will have to allocate a separate gid per container. Kubelet can reuse the same gid for multiple instances of the same container (restart scenario). As and when kubelet garbage collects dead containers, the usage of the container will drop.
+
+If local disk becomes a schedulable resource, `linux quota` can be used to impose `request` and `limits` on the container writable layer.
+`limits` can be enforced using hard limits. Enforcing `request` will be tricky. One option is to enforce `requests` only when the disk availability drops below a threshold (10%). Kubelet can at this point evict pods that are exceeding their requested space. Other options include using `soft limits` with grace periods, but this option is complex.
+
+###### Devicemapper
+
+FIXME: How to calculate writable layer usage with devicemapper?
+
+To enforce `limits` the volume created for the container’s writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet.
+
+
+#### Container logs
+
+Container logs are not storage driver specific. We can use either `du` or `quota` to track log usage per container. Log files are stored under `/var/lib/docker/containers/<container-id>`.
+
+In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layer’s usage individually.
+
+For the purposes of enforcing limits though, kubelet will use the sum of logs and writable layer.
+
+In the future, we can consider adding log rotation support for these log files either in kubelet or via docker.
+
+
+#### Volumes
+
+The local disk based volumes map to a directory on the disk. We can use `du` or `quota` to track the usage of volumes.
+
+There exists a concept called `FsGroup` today in kubernetes, which lets users specify a gid for all volumes in a pod. If that is set, we can use the `FsGroup` gid for quota purposes. This requires `limits` for volumes to be a pod level resource though.
+
+
+### Yet to be explored
+
+* Support for filesystems other than ext4 and xfs like `zfs`
+
+* Support for Btrfs
+
+It should be clear at this point that we need a plugin based model for disk accounting. Support for other filesystems both CoW and regular can be added as and when required. As we progress towards making accounting work on the above mentioned storage drivers, we can come up with an abstraction for storage plugins in general.
+
+
+### Implementation Plan and Milestones
+
+#### Milestone 1 - Get accounting to just work!
+
+This milestone targets exposing the following categories of disk usage from the kubelet - infrastructure (images, sys daemons, etc), containers (log + writable layer) and volumes.
+
+* `du` works today. Use `du` for all the categories and ensure that it works on both on aufs and overlayfs.
+
+* Add device mapper support.
+
+* Define a storage driver based pluggable disk accounting interface in cadvisor.
+
+* Reuse that interface for accounting volumes in kubelet.
+
+* Define a disk manager module in kubelet that will serve as a source of disk usage information for the rest of the kubelet.
+
+* Ensure that the kubelet metrics APIs (/apis/metrics/v1beta1) exposes the disk usage information. Add an integration test.
+
+
+#### Milestone 2 - node reliability
+
+Improve user experience by doing whatever is necessary to keep the node running.
+
+NOTE: [`Out of Resource Killing`](https://github.com/kubernetes/kubernetes/issues/17186) design is a prerequisite.
+
+* Disk manager will evict pods and containers based on QoS class whenever the disk availability is below a critical level.
+
+* Explore combining existing container and image garbage collection logic into disk manager.
+
+Ideally, this phase should be completed before v1.2.
+
+
+#### Milestone 3 - Performance improvements
+
+In this milestone, we will add support for quota and make it opt-in. There should be no user visible changes in this phase.
+
+* Add gid allocation manager to kubelet
+
+* Reconcile gids allocated after restart.
+
+* Configure linux quota automatically on startup. Do not set any limits in this phase.
+
+* Allocate gids for pod volumes, container’s writable layer and logs, and also for image layers.
+
+* Update the docker runtime plugin in kubelet to perform the necessary `chown’s` and `chmod’s` between container creation and startup.
+
+* Pass the allocated gids as supplementary gids to containers.
+
+* Update disk manager in kubelet to use quota when configured.
+
+
+#### Milestone 4 - Users manage local disks
+
+In this milestone, we will make local disk a schedulable resource.
+
+* Finalize volume accounting - is it at the pod level or per-volume.
+
+* Finalize multi-disk management policy. Will additional disks be handled as whole units?
+
+* Set aside some space for image layers and rest of the infra overhead - node allocable resources includes local disk.
+
+* `du` plugin triggers container or pod eviction whenever usage exceeds limit.
+
+* Quota plugin sets hard limits equal to user specified `limits`.
+
+* Devicemapper plugin resizes writable layer to not exceed the container’s disk `limit`.
+
+* Disk manager evicts pods based on `usage` - `request` delta instead of just QoS class.
+
+* Sufficient integration testing to this feature.
+
+
+### Appendix
+
+
+#### Implementation Notes
+
+The following is a rough outline of the testing I performed to corroborate by prior design ideas.
+
+Test setup information
+
+* Testing was performed on GCE virtual machines
+
+* All the test VMs were using ext4.
+
+* Distribution tested against is mentioned as part of each graph driver.
+
+##### AUFS testing notes:
+
+Tested on Debian jessie
+
+1. Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ).
+
+2. Create a new group ‘x’ on the host and enable quota for that group
+
+ 1. `groupadd -g 9000 x`
+
+ 2. `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*)
+
+ 3. `quota -g 9000 -v` // Check that quota is enabled
+
+3. Create a docker container
+
+ 4. `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"`
+
+ 8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d
+
+4. Change group on the writable layer directory for this container
+
+ 5. `chmod a+s /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d`
+
+ 6. `chown :x /var/lib/docker/aufs/diff/8d8c56dcfbf5cda9f9bfec7c6615577753292d9772ab455f581951d9a92d169d`
+
+5. Start the docker container
+
+ 7. `docker start 8d`
+
+ 8. Check usage using quota and group ‘x’
+
+ ```shell
+ $ quota -g x -v
+
+ Disk quotas for group x (gid 9000):
+
+ Filesystem **blocks** quota limit grace files quota limit grace
+
+ /dev/sda1 **10248** 0 0 3 0 0
+ ```
+
+ Using the same workflow, we can add new sticky group IDs to emptyDir volumes and account for their usage against pods.
+
+ Since each container requires a gid for the purposes of quota, we will have to reserve ranges of gids for use by the kubelet. Since kubelet does not checkpoint its state, recovery of group id allocations will be an interesting problem. More on this later.
+
+Track the space occupied by images after it has been pulled locally as follows.
+
+*Note: This approach requires serialized image pulls to be of any use to the kubelet.*
+
+1. Create a group specifically for the graph driver
+
+ 1. `groupadd -g 9001 docker-images`
+
+2. Update group ownership on the ‘graph’ (tracks image metadata) and ‘storage driver’ directories.
+
+ 2. `chown -R :9001 /var/lib/docker/[overlay | aufs]`
+
+ 3. `chmod a+s /var/lib/docker/[overlay | aufs]`
+
+ 4. `chown -R :9001 /var/lib/docker/graph`
+
+ 5. `chmod a+s /var/lib/docker/graph`
+
+3. Any new images pulled or containers created will be accounted to the `docker-images` group by default.
+
+4. Once we update the group ownership on newly created containers to a different gid, the container writable layer’s specific disk usage gets dropped from this group.
+
+#### Overlayfs
+
+Tested on Ubuntu 15.10.
+
+Overlayfs works similar to Aufs. The path to the writable directory for container writable layer changes.
+
+* Setup Linux Quota following this [tutorial](https://www.google.com/url?q=https://www.howtoforge.com/tutorial/linux-quota-ubuntu-debian/&sa=D&ust=1446146816105000&usg=AFQjCNHThn4nwfj1YLoVmv5fJ6kqAQ9FlQ).
+
+* Create a new group ‘x’ on the host and enable quota for that group
+
+ * `groupadd -g 9000 x`
+
+ * `setquota -g 9000 -a 0 100 0 100` // 100 blocks (4096 bytes each*)
+
+ * `quota -g 9000 -v` // Check that quota is enabled
+
+* Create a docker container
+
+ * `docker create -it busybox /bin/sh -c "dd if=/dev/zero of=/file count=10 bs=1M"`
+
+ * `b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61`
+
+* Change group on the writable layer’s directory for this container
+
+ * `chmod -R a+s /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
+
+ * `chown -R :9000 /var/lib/docker/overlay/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
+
+* Check quota before and after running the container.
+
+ ```shell
+ $ quota -g x -v
+
+ Disk quotas for group x (gid 9000):
+
+ Filesystem blocks quota limit grace files quota limit grace
+
+ /dev/sda1 48 0 0 19 0 0
+ ```
+
+ * Start the docker container
+
+ * `docker start b8`
+
+ * ```shell
+ quota -g x -v
+
+ Disk quotas for group x (gid 9000):
+
+ Filesystem **blocks** quota limit grace files quota limit grace
+
+ /dev/sda1 **10288** 0 0 20 0 0
+
+ ```
+
+##### Device mapper
+
+Usage of Linux Quota should be possible for the purposes of volumes and log files.
+
+Devicemapper storage driver in docker uses ["thin targets"](https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt). Underneath there are two block devices devices - “data” and “metadata”, using which more block devices are created for containers. More information [here](http://www.projectatomic.io/docs/filesystems/).
+
+These devices can be loopback or real storage devices.
+
+The base device has a maximum storage capacity. This means that the sum total of storage space occupied by images and containers cannot exceed this capacity.
+
+By default, all images and containers are created from an initial filesystem with a 10GB limit.
+
+A separate filesystem is created for each container as part of start (not create).
+
+It is possible to [resize](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/) the container filesystem.
+
+For the purposes of image space tracking, we can
+
+####Testing notes:
+
+* ```shell
+$ docker info
+
+...
+
+Storage Driver: devicemapper
+
+ Pool Name: **docker-8:1-268480-pool**
+
+ Pool Blocksize: 65.54 kB
+
+ Backing Filesystem: extfs
+
+ Data file: /dev/loop0
+
+ Metadata file: /dev/loop1
+
+ Data Space Used: 2.059 GB
+
+ Data Space Total: 107.4 GB
+
+ Data Space Available: 48.45 GB
+
+ Metadata Space Used: 1.806 MB
+
+ Metadata Space Total: 2.147 GB
+
+ Metadata Space Available: 2.146 GB
+
+ Udev Sync Supported: true
+
+ Deferred Removal Enabled: false
+
+ Data loop file: /var/lib/docker/devicemapper/devicemapper/data
+
+ Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
+
+ Library Version: 1.02.99 (2015-06-20)
+```
+
+```shell
+$ dmsetup table docker-8\:1-268480-pool
+
+0 209715200 thin-pool 7:1 7:0 **128** 32768 1 skip_block_zeroing
+```
+
+128 is the data block size
+
+Usage from kernel for the primary block device
+
+```shell
+$ dmsetup status docker-8\:1-268480-pool
+
+0 209715200 thin-pool 37 441/524288 **31424/1638400** - rw discard_passdown queue_if_no_space -
+```
+
+Usage/Available - 31424/1638400
+
+Usage in MB = 31424 * 512 * 128 (block size from above) bytes = 1964 MB
+
+Capacity in MB = 1638400 * 512 * 128 bytes = 100 GB
+
+#### Log file accounting
+
+* Setup Linux quota for a container as mentioned above.
+
+* Update group ownership on the following directories to that of the container group ID created for graphing. Adapting the examples above:
+
+ * `chmod -R a+s /var/lib/docker/**containers**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
+
+ * `chown -R :9000 /var/lib/docker/**container**/b8cc9fae3851f9bcefe922952b7bca0eb33aa31e68e9203ce0639fc9d3f3c61b/*`
+
+##### Testing titbits
+
+* Ubuntu 15.10 doesn’t ship with the quota module on virtual machines. [Install ‘linux-image-extra-virtual’](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work.
+
+* Overlay storage driver needs kernels >= 3.18. I used Ubuntu 15.10 to test Overlayfs.
+
+* If you use a non-default location for docker storage, change `/var/lib/docker` in the examples to your storage location.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/disk-accounting.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/downward_api_resources_limits_requests.md b/contributors/design-proposals/downward_api_resources_limits_requests.md
new file mode 100644
index 00000000..ab17c321
--- /dev/null
+++ b/contributors/design-proposals/downward_api_resources_limits_requests.md
@@ -0,0 +1,622 @@
+# Downward API for resource limits and requests
+
+## Background
+
+Currently the downward API (via environment variables and volume plugin) only
+supports exposing a Pod's name, namespace, annotations, labels and its IP
+([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This
+document explains the need and design to extend them to expose resources
+(e.g. cpu, memory) limits and requests.
+
+## Motivation
+
+Software applications require configuration to work optimally with the resources they're allowed to use.
+Exposing the requested and limited amounts of available resources inside containers will allow
+these applications to be configured more easily. Although docker already
+exposes some of this information inside containers, the downward API helps
+exposing this information in a runtime-agnostic manner in Kubernetes.
+
+## Use cases
+
+As an application author, I want to be able to use cpu or memory requests and
+limits to configure the operational requirements of my applications inside containers.
+For example, Java applications expect to be made aware of the available heap size via
+a command line argument to the JVM, for example: java -Xmx:`<heap-size>`. Similarly, an
+application may want to configure its thread pool based on available cpu resources and
+the exported value of GOMAXPROCS.
+
+## Design
+
+This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473).
+There are three approaches discussed in this document to obtain resources limits
+and requests to be exposed as environment variables and volumes inside
+containers:
+
+1. The first approach requires users to specify full json path selectors
+in which selectors are relative to the pod spec. The benefit of this
+approach is to specify pod-level resources, and since containers are
+also part of a pod spec, it can be used to specify container-level
+resources too.
+
+2. The second approach requires specifying partial json path selectors
+which are relative to the container spec. This approach helps
+in retrieving a container specific resource limits and requests, and at
+the same time, it is simpler to specify than full json path selectors.
+
+3. In the third approach, users specify fixed strings (magic keys) to retrieve
+resources limits and requests and do not specify any json path
+selectors. This approach is similar to the existing downward API
+implementation approach. The advantages of this approach are that it is
+simpler to specify that the first two, and does not require any type of
+conversion between internal and versioned objects or json selectors as
+discussed below.
+
+Before discussing a bit more about merits of each approach, here is a
+brief discussion about json path selectors and some implications related
+to their use.
+
+#### JSONpath selectors
+
+Versioned objects in kubernetes have json tags as part of their golang fields.
+Currently, objects in the internal API have json tags, but it is planned that
+these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933)
+for discussion). So for discussion in this proposal, we assume that
+internal objects do not have json tags. In the first two approaches
+(full and partial json selectors), when a user creates a pod and its
+containers, the user specifies a json path selector in the pod's
+spec to retrieve values of its limits and requests. The selector
+is composed of json tags similar to json paths used with kubectl
+([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal
+uses kubernetes' json path library to process the selectors to retrieve
+the values. As kubelet operates on internal objects (without json tags),
+and the selectors are part of versioned objects, retrieving values of
+the limits and requests can be handled using these two solutions:
+
+1. By converting an internal object to versioned object, and then using
+the json path library to retrieve the values from the versioned object
+by processing the selector.
+
+2. By converting a json selector of the versioned objects to internal
+object's golang expression and then using the json path library to
+retrieve the values from the internal object by processing the golang
+expression. However, converting a json selector of the versioned objects
+to internal object's golang expression will still require an instance
+of the versioned object, so it seems more work from the first solution
+unless there is another way without requiring the versioned object.
+
+So there is a one time conversion cost associated with the first (full
+path) and second (partial path) approaches, whereas the third approach
+(magic keys) does not require any such conversion and can directly
+work on internal objects. If we want to avoid conversion cost and to
+have implementation simplicity, my opinion is that magic keys approach
+is relatively easiest to implement to expose limits and requests with
+least impact on existing functionality.
+
+To summarize merits/demerits of each approach:
+
+|Approach | Scope | Conversion cost | JSON selectors | Future extension|
+| ---------- | ------------------- | -------------------| ------------------- | ------------------- |
+|Full selectors | Pod/Container | Yes | Yes | Possible |
+|Partial selectors | Container | Yes | Yes | Possible |
+|Magic keys | Container | No | No | Possible|
+
+Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object
+in conjunction with partial selectors and magic keys approaches.
+
+### API with full JSONpath selectors
+
+Full json path selectors specify the complete path to the resources
+limits and requests relative to pod spec.
+
+#### Environment variables
+
+This table shows how selectors can be used for various requests and
+limits to be exposed as environment variables. Environment variable names
+are examples only and not necessarily as specified, and the selectors do not
+have to start with dot.
+
+| Env Var Name | Selector |
+| ---- | ------------------- |
+| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu|
+| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory|
+| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu|
+| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory |
+
+#### Volume plugin
+
+This table shows how selectors can be used for various requests and
+limits to be exposed as volumes. The path names are examples only and
+not necessarily as specified, and the selectors do not have to start with dot.
+
+
+| Path | Selector |
+| ---- | ------------------- |
+| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu|
+| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory|
+| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu|
+| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory|
+
+Volumes are pod scoped, so a selector must be specified with a container name.
+
+Full json path selectors will use existing `type ObjectFieldSelector`
+to extend the current implementation for resources requests and limits.
+
+```
+// ObjectFieldSelector selects an APIVersioned field of an object.
+type ObjectFieldSelector struct {
+ APIVersion string `json:"apiVersion"`
+ // Required: Path of the field to select in the specified API version
+ FieldPath string `json:"fieldPath"`
+}
+```
+
+#### Examples
+
+These examples show how to use full selectors with environment variables and volume plugin.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: dapi-test-pod
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh","-c", "env" ]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ env:
+ - name: CPU_LIMIT
+ valueFrom:
+ fieldRef:
+ fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu
+```
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: kubernetes-downwardapi-volume-example
+spec:
+ containers:
+ - name: client-container
+ image: gcr.io/google_containers/busybox
+ command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ volumeMounts:
+ - name: podinfo
+ mountPath: /etc
+ readOnly: false
+ volumes:
+ - name: podinfo
+ downwardAPI:
+ items:
+ - path: "cpu_limit"
+ fieldRef:
+ fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu
+```
+
+#### Validations
+
+For APIs with full json path selectors, verify that selectors are
+valid relative to pod spec.
+
+
+### API with partial JSONpath selectors
+
+Partial json path selectors specify paths to resources limits and requests
+relative to the container spec. These will be implemented by introducing a
+`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current
+implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`.
+
+```
+// ContainerSpecFieldSelector selects an APIVersioned field of an object.
+type ContainerSpecFieldSelector struct {
+ APIVersion string `json:"apiVersion"`
+ // Container name
+ ContainerName string `json:"containerName,omitempty"`
+ // Required: Path of the field to select in the specified API version
+ FieldPath string `json:"fieldPath"`
+}
+
+// Represents a single file containing information from the downward API
+type DownwardAPIVolumeFile struct {
+ // Required: Path is the relative path name of the file to be created.
+ Path string `json:"path"`
+ // Selects a field of the pod: only annotations, labels, name and
+ // namespace are supported.
+ FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"`
+ // Selects a field of the container: only resources limits and requests
+ // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu,
+ // resources.requests.memory) are currently supported.
+ ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"`
+}
+
+// EnvVarSource represents a source for the value of an EnvVar.
+// Only one of its fields may be set.
+type EnvVarSource struct {
+ // Selects a field of the container: only resources limits and requests
+ // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu,
+ // resources.requests.memory) are currently supported.
+ ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"`
+ // Selects a field of the pod; only name and namespace are supported.
+ FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
+ // Selects a key of a ConfigMap.
+ ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
+ // Selects a key of a secret in the pod's namespace.
+ SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"`
+}
+```
+
+#### Environment variables
+
+This table shows how partial selectors can be used for various requests and
+limits to be exposed as environment variables. Environment variable names
+are examples only and not necessarily as specified, and the selectors do not
+have to start with dot.
+
+| Env Var Name | Selector |
+| -------------------- | -------------------|
+| CPU_LIMIT | resources.limits.cpu |
+| MEMORY_LIMIT | resources.limits.memory |
+| CPU_REQUEST | resources.requests.cpu |
+| MEMORY_REQUEST | resources.requests.memory |
+
+Since environment variables are container scoped, it is optional
+to specify container name as part of the partial selectors as they are
+relative to container spec. If container name is not specified, then
+it defaults to current container. However, container name could be specified
+to expose variables from other containers.
+
+#### Volume plugin
+
+This table shows volume paths and partial selectors used for resources cpu and memory.
+Volume path names are examples only and not necessarily as specified, and the
+selectors do not have to start with dot.
+
+| Path | Selector |
+| -------------------- | -------------------|
+| cpu_limit | resources.limits.cpu |
+| memory_limit | resources.limits.memory |
+| cpu_request | resources.requests.cpu |
+| memory_request | resources.requests.memory |
+
+Volumes are pod scoped, the container name must be specified as part of
+`containerSpecFieldRef` with them.
+
+#### Examples
+
+These examples show how to use partial selectors with environment variables and volume plugin.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: dapi-test-pod
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh","-c", "env" ]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ env:
+ - name: CPU_LIMIT
+ valueFrom:
+ containerSpecFieldRef:
+ fieldPath: resources.limits.cpu
+```
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: kubernetes-downwardapi-volume-example
+spec:
+ containers:
+ - name: client-container
+ image: gcr.io/google_containers/busybox
+ command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ volumeMounts:
+ - name: podinfo
+ mountPath: /etc
+ readOnly: false
+ volumes:
+ - name: podinfo
+ downwardAPI:
+ items:
+ - path: "cpu_limit"
+ containerSpecFieldRef:
+ containerName: "client-container"
+ fieldPath: resources.limits.cpu
+```
+
+#### Validations
+
+For APIs with partial json path selectors, verify
+that selectors are valid relative to container spec.
+Also verify that container name is provided with volumes.
+
+
+### API with magic keys
+
+In this approach, users specify fixed strings (or magic keys) to retrieve resources
+limits and requests. This approach is similar to the existing downward
+API implementation approach. The fixed string used for resources limits and requests
+for cpu and memory are `limits.cpu`, `limits.memory`,
+`requests.cpu` and `requests.memory`. Though these strings are same
+as json path selectors but are processed as fixed strings. These will be implemented by
+introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current
+implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`.
+
+The fields in ResourceFieldSelector are `containerName` to specify the name of a
+container, `resource` to specify the type of a resource (cpu or memory), and `divisor`
+to specify the output format of values of exposed resources. The default value of divisor
+is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid
+values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer
+(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes),
+`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`,
+`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes).
+For more information about these resource formats, [see details](resources.md).
+
+Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor.
+For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then
+exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and
+the ceiling of 0.25 is 1.
+
+```
+type ResourceFieldSelector struct {
+ // Container name
+ ContainerName string `json:"containerName,omitempty"`
+ // Required: Resource to select
+ Resource string `json:"resource"`
+ // Specifies the output format of the exposed resources
+ Divisor resource.Quantity `json:"divisor,omitempty"`
+}
+
+// Represents a single file containing information from the downward API
+type DownwardAPIVolumeFile struct {
+ // Required: Path is the relative path name of the file to be created.
+ Path string `json:"path"`
+ // Selects a field of the pod: only annotations, labels, name and
+ // namespace are supported.
+ FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"`
+ // Selects a resource of the container: only resources limits and requests
+ // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
+ ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
+}
+
+// EnvVarSource represents a source for the value of an EnvVar.
+// Only one of its fields may be set.
+type EnvVarSource struct {
+ // Selects a resource of the container: only resources limits and requests
+ // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
+ ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
+ // Selects a field of the pod; only name and namespace are supported.
+ FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
+ // Selects a key of a ConfigMap.
+ ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
+ // Selects a key of a secret in the pod's namespace.
+ SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"`
+}
+```
+
+#### Environment variables
+
+This table shows environment variable names and strings used for resources cpu and memory.
+The variable names are examples only and not necessarily as specified.
+
+| Env Var Name | Resource |
+| -------------------- | -------------------|
+| CPU_LIMIT | limits.cpu |
+| MEMORY_LIMIT | limits.memory |
+| CPU_REQUEST | requests.cpu |
+| MEMORY_REQUEST | requests.memory |
+
+Since environment variables are container scoped, it is optional
+to specify container name as part of the partial selectors as they are
+relative to container spec. If container name is not specified, then
+it defaults to current container. However, container name could be specified
+to expose variables from other containers.
+
+#### Volume plugin
+
+This table shows volume paths and strings used for resources cpu and memory.
+Volume path names are examples only and not necessarily as specified.
+
+| Path | Resource |
+| -------------------- | -------------------|
+| cpu_limit | limits.cpu |
+| memory_limit | limits.memory|
+| cpu_request | requests.cpu |
+| memory_request | requests.memory |
+
+Volumes are pod scoped, the container name must be specified as part of
+`resourceFieldRef` with them.
+
+#### Examples
+
+These examples show how to use magic keys approach with environment variables and volume plugin.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: dapi-test-pod
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh","-c", "env" ]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ env:
+ - name: CPU_LIMIT
+ valueFrom:
+ resourceFieldRef:
+ resource: limits.cpu
+ - name: MEMORY_LIMIT
+ valueFrom:
+ resourceFieldRef:
+ resource: limits.memory
+ divisor: "1Mi"
+```
+
+In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: kubernetes-downwardapi-volume-example
+spec:
+ containers:
+ - name: client-container
+ image: gcr.io/google_containers/busybox
+ command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"]
+ resources:
+ requests:
+ memory: "64Mi"
+ cpu: "250m"
+ limits:
+ memory: "128Mi"
+ cpu: "500m"
+ volumeMounts:
+ - name: podinfo
+ mountPath: /etc
+ readOnly: false
+ volumes:
+ - name: podinfo
+ downwardAPI:
+ items:
+ - path: "cpu_limit"
+ resourceFieldRef:
+ containerName: client-container
+ resource: limits.cpu
+ divisor: "1m"
+ - path: "memory_limit"
+ resourceFieldRef:
+ containerName: client-container
+ resource: limits.memory
+```
+
+In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively.
+
+
+#### Validations
+
+For APIs with magic keys, verify that the resource strings are valid and is one
+of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`.
+Also verify that container name is provided with volumes.
+
+## Pod-level and container-level resource access
+
+Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in
+all approaches. Container-level resources will be accessed by `type ObjectFieldSelector`
+with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef`
+with partial and magic keys approaches, respectively. The following table
+summarizes resource access with these approaches.
+
+| Approach | Pod resources| Container resources |
+| -------------------- | -------------------|-------------------|
+| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`|
+| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` |
+| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` |
+
+## Output format
+
+The output format for resources limits and requests will be same as
+cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024
+and rounded to integer) and memory in bytes. For example, memory request
+or limit of `64Mi` in the container spec will be output as `67108864`
+bytes, and cpu request or limit of `250m` (millicores) will be output as
+`256` of cpu shares.
+
+## Implementation approach
+
+The current implementation of this proposal will focus on the API with magic keys
+approach. The main reason for selecting this approach is that it might be
+easier to incorporate and extend resource specific functionality.
+
+## Applied example
+
+Here we discuss how to use exposed resource values to set, for example, Java
+memory size or GOMAXPROCS for your applications. Lets say, you expose a container's
+(running an application like tomcat for example) requested memory as `HEAP_SIZE`
+and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable.
+One way to set the heap size or cpu for this application would be to wrap the binary
+in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it)
+and GOMAXPROCS environment variables inside the container image. The spec file for the
+application pod could look like:
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+ name: kubernetes-downwardapi-volume-example
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh","-c", "env" ]
+ resources:
+ requests:
+ memory: "64M"
+ cpu: "250m"
+ limits:
+ memory: "128M"
+ cpu: "500m"
+ env:
+ - name: HEAP_SIZE
+ valueFrom:
+ resourceFieldRef:
+ resource: requests.memory
+ - name: CPU_LIMIT
+ valueFrom:
+ resourceFieldRef:
+ resource: requests.cpu
+```
+
+Note that the value of divisor by default is `1`. Now inside the container,
+the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as:
+
+```
+export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)"
+
+and
+
+export GOMAXPROCS=$(CPU_LIMIT)"
+```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/dramatically-simplify-cluster-creation.md b/contributors/design-proposals/dramatically-simplify-cluster-creation.md
new file mode 100644
index 00000000..d5bc8a38
--- /dev/null
+++ b/contributors/design-proposals/dramatically-simplify-cluster-creation.md
@@ -0,0 +1,266 @@
+# Proposal: Dramatically Simplify Kubernetes Cluster Creation
+
+> ***Please note: this proposal doesn't reflect final implementation, it's here for the purpose of capturing the original ideas.***
+> ***You should probably [read `kubeadm` docs](http://kubernetes.io/docs/getting-started-guides/kubeadm/), to understand the end-result of this effor.***
+
+Luke Marsden & many others in [SIG-cluster-lifecycle](https://github.com/kubernetes/community/tree/master/sig-cluster-lifecycle).
+
+17th August 2016
+
+*This proposal aims to capture the latest consensus and plan of action of SIG-cluster-lifecycle. It should satisfy the first bullet point [required by the feature description](https://github.com/kubernetes/features/issues/11).*
+
+See also: [this presentation to community hangout on 4th August 2016](https://docs.google.com/presentation/d/17xrFxrTwqrK-MJk0f2XCjfUPagljG7togXHcC39p0sM/edit?ts=57a33e24#slide=id.g158d2ee41a_0_76)
+
+## Motivation
+
+Kubernetes is hard to install, and there are many different ways to do it today. None of them are excellent. We believe this is hindering adoption.
+
+## Goals
+
+Have one recommended, official, tested, "happy path" which will enable a majority of new and existing Kubernetes users to:
+
+* Kick the tires and easily turn up a new cluster on infrastructure of their choice
+
+* Get a reasonably secure, production-ready cluster, with reasonable defaults and a range of easily-installable add-ons
+
+We plan to do so by improving and simplifying Kubernetes itself, rather than building lots of tooling which "wraps" Kubernetes by poking all the bits into the right place.
+
+## Scope of project
+
+There are logically 3 steps to deploying a Kubernetes cluster:
+
+1. *Provisioning*: Getting some servers - these may be VMs on a developer's workstation, VMs in public clouds, or bare-metal servers in a user's data center.
+
+2. *Install & Discovery*: Installing the Kubernetes core components on those servers (kubelet, etc) - and bootstrapping the cluster to a state of basic liveness, including allowing each server in the cluster to discover other servers: for example teaching etcd servers about their peers, having TLS certificates provisioned, etc.
+
+3. *Add-ons*: Now that basic cluster functionality is working, installing add-ons such as DNS or a pod network (should be possible using kubectl apply).
+
+Notably, this project is *only* working on dramatically improving 2 and 3 from the perspective of users typing commands directly into root shells of servers. The reason for this is that there are a great many different ways of provisioning servers, and users will already have their own preferences.
+
+What's more, once we've radically improved the user experience of 2 and 3, it will make the job of tools that want to do all three much easier.
+
+## User stories
+
+### Phase I
+
+**_In time to be an alpha feature in Kubernetes 1.4._**
+
+Note: the current plan is to deliver `kubeadm` which implements these stories as "alpha" packages built from master (after the 1.4 feature freeze), but which are capable of installing a Kubernetes 1.4 cluster.
+
+* *Install*: As a potential Kubernetes user, I can deploy a Kubernetes 1.4 cluster on a handful of computers running Linux and Docker by typing two commands on each of those computers. The process is so simple that it becomes obvious to me how to easily automate it if I so wish.
+
+* *Pre-flight check*: If any of the computers don't have working dependencies installed (e.g. bad version of Docker, too-old Linux kernel), I am informed early on and given clear instructions on how to fix it so that I can keep trying until it works.
+
+* *Control*: Having provisioned a cluster, I can gain user credentials which allow me to remotely control it using kubectl.
+
+* *Install-addons*: I can select from a set of recommended add-ons to install directly after installing Kubernetes on my set of initial computers with kubectl apply.
+
+* *Add-node*: I can add another computer to the cluster.
+
+* *Secure*: As an attacker with (presumed) control of the network, I cannot add malicious nodes I control to the cluster created by the user. I also cannot remotely control the cluster.
+
+### Phase II
+
+**_In time for Kubernetes 1.5:_**
+*Everything from Phase I as beta/stable feature, everything else below as beta feature in Kubernetes 1.5.*
+
+* *Upgrade*: Later, when Kubernetes 1.4.1 or any newer release is published, I can upgrade to it by typing one other command on each computer.
+
+* *HA*: If one of the computers in the cluster fails, the cluster carries on working. I can find out how to replace the failed computer, including if the computer was one of the masters.
+
+## Top-down view: UX for Phase I items
+
+We will introduce a new binary, kubeadm, which ships with the Kubernetes OS packages (and binary tarballs, for OSes without package managers).
+
+```
+laptop$ kubeadm --help
+kubeadm: bootstrap a secure kubernetes cluster easily.
+
+ /==========================================================\
+ | KUBEADM IS ALPHA, DO NOT USE IT FOR PRODUCTION CLUSTERS! |
+ | |
+ | But, please try it out! Give us feedback at: |
+ | https://github.com/kubernetes/kubernetes/issues |
+ | and at-mention @kubernetes/sig-cluster-lifecycle |
+ \==========================================================/
+
+Example usage:
+
+ Create a two-machine cluster with one master (which controls the cluster),
+ and one node (where workloads, like pods and containers run).
+
+ On the first machine
+ ====================
+ master# kubeadm init master
+ Your token is: <token>
+
+ On the second machine
+ =====================
+ node# kubeadm join node --token=<token> <ip-of-master>
+
+Usage:
+ kubeadm [command]
+
+Available Commands:
+ init Run this on the first server you deploy onto.
+ join Run this on other servers to join an existing cluster.
+ user Get initial admin credentials for a cluster.
+ manual Advanced, less-automated functionality, for power users.
+
+Use "kubeadm [command] --help" for more information about a command.
+```
+
+### Install
+
+*On first machine:*
+
+```
+master# kubeadm init master
+Initializing kubernetes master... [done]
+Cluster token: 73R2SIPM739TNZOA
+Run the following command on machines you want to become nodes:
+ kubeadm join node --token=73R2SIPM739TNZOA <master-ip>
+You can now run kubectl here.
+```
+
+*On N "node" machines:*
+
+```
+node# kubeadm join node --token=73R2SIPM739TNZOA <master-ip>
+Initializing kubernetes node... [done]
+Bootstrapping certificates... [done]
+Joined node to cluster, see 'kubectl get nodes' on master.
+```
+
+Note `[done]` would be colored green in all of the above.
+
+### Install: alternative for automated deploy
+
+*The user (or their config management system) creates a token and passes the same one to both init and join.*
+
+```
+master# kubeadm init master --token=73R2SIPM739TNZOA
+Initializing kubernetes master... [done]
+You can now run kubectl here.
+```
+
+### Pre-flight check
+
+```
+master# kubeadm init master
+Error: socat not installed. Unable to proceed.
+```
+
+### Control
+
+*On master, after Install, kubectl is automatically able to talk to localhost:8080:*
+
+```
+master# kubectl get pods
+[normal kubectl output]
+```
+
+*To mint new user credentials on the master:*
+
+```
+master# kubeadm user create -o kubeconfig-bob bob
+
+Waiting for cluster to become ready... [done]
+Creating user certificate for user... [done]
+Waiting for user certificate to be signed... [done]
+Your cluster configuration file has been saved in kubeconfig.
+
+laptop# scp <master-ip>:/root/kubeconfig-bob ~/.kubeconfig
+laptop# kubectl get pods
+[normal kubectl output]
+```
+
+### Install-addons
+
+*Using CNI network as example:*
+
+```
+master# kubectl apply --purge -f \
+ https://git.io/kubernetes-addons/<X>.yaml
+[normal kubectl apply output]
+```
+
+### Add-node
+
+*Same as Install – "on node machines".*
+
+### Secure
+
+```
+node# kubeadm join --token=GARBAGE node <master-ip>
+Unable to join mesh network. Check your token.
+```
+
+## Work streams – critical path – must have in 1.4 before feature freeze
+
+1. [TLS bootstrapping](https://github.com/kubernetes/features/issues/43) - so that kubeadm can mint credentials for kubelets and users
+
+ * Requires [#25764](https://github.com/kubernetes/kubernetes/pull/25764) and auto-signing [#30153](https://github.com/kubernetes/kubernetes/pull/30153) but does not require [#30094](https://github.com/kubernetes/kubernetes/pull/30094).
+ * @philips, @gtank & @yifan-gu
+
+1. Fix for [#30515](https://github.com/kubernetes/kubernetes/issues/30515) - so that kubeadm can install a kubeconfig which kubelet then picks up
+
+ * @smarterclayton
+
+## Work streams – can land after 1.4 feature freeze
+
+1. [Debs](https://github.com/kubernetes/release/pull/35) and [RPMs](https://github.com/kubernetes/release/pull/50) (and binaries?) - so that kubernetes can be installed in the first place
+
+ * @mikedanese & @dgoodwin
+
+1. [kubeadm implementation](https://github.com/lukemarsden/kubernetes/tree/kubeadm-scaffolding) - the kubeadm CLI itself, will get bundled into "alpha" kubeadm packages
+
+ * @lukemarsden & @errordeveloper
+
+1. [Implementation of JWS server](https://github.com/jbeda/kubernetes/blob/discovery-api/docs/proposals/super-simple-discovery-api.md#method-jws-token) from [#30707](https://github.com/kubernetes/kubernetes/pull/30707) - so that we can implement the simple UX with no dependencies
+
+ * @jbeda & @philips?
+
+1. Documentation - so that new users can see this in 1.4 (even if it’s caveated with alpha/experimental labels and flags all over it)
+
+ * @lukemarsden
+
+1. `kubeadm` alpha packages
+
+ * @lukemarsden, @mikedanese, @dgoodwin
+
+### Nice to have
+
+1. [Kubectl apply --purge](https://github.com/kubernetes/kubernetes/pull/29551) - so that addons can be maintained using k8s infrastructure
+
+ * @lukemarsden & @errordeveloper
+
+## kubeadm implementation plan
+
+Based on [@philips' comment here](https://github.com/kubernetes/kubernetes/pull/30361#issuecomment-239588596).
+The key point with this implementation plan is that it requires basically no changes to kubelet except [#30515](https://github.com/kubernetes/kubernetes/issues/30515).
+It also doesn't require kubelet to do TLS bootstrapping - kubeadm handles that.
+
+### kubeadm init master
+
+1. User installs and configures kubelet to look for manifests in `/etc/kubernetes/manifests`
+1. API server CA certs are generated by kubeadm
+1. kubeadm generates pod manifests to launch API server and etcd
+1. kubeadm pushes replica set for prototype jsw-server and the JWS into API server with host-networking so it is listening on the master node IP
+1. kubeadm prints out the IP of JWS server and JWS token
+
+### kubeadm join node --token IP
+
+1. User installs and configures kubelet to have a kubeconfig at `/var/lib/kubelet/kubeconfig` but the kubelet is in a crash loop and is restarted by host init system
+1. kubeadm talks to jws-server on IP with token and gets the cacert, then talks to the apiserver TLS bootstrap API to get client cert, etc and generates a kubelet kubeconfig
+1. kubeadm places kubeconfig into `/var/lib/kubelet/kubeconfig` and waits for kubelet to restart
+1. Mission accomplished, we think.
+
+## See also
+
+* [Joe Beda's "K8s the hard way easier"](https://docs.google.com/document/d/1lJ26LmCP-I_zMuqs6uloTgAnHPcuT7kOYtQ7XSgYLMA/edit#heading=h.ilgrv18sg5t) which combines Kelsey's "Kubernetes the hard way" with history of proposed UX at the end (scroll all the way down to the bottom).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/dramatically-simplify-cluster-creation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/enhance-pluggable-policy.md b/contributors/design-proposals/enhance-pluggable-policy.md
new file mode 100644
index 00000000..2468d3c1
--- /dev/null
+++ b/contributors/design-proposals/enhance-pluggable-policy.md
@@ -0,0 +1,429 @@
+# Enhance Pluggable Policy
+
+While trying to develop an authorization plugin for Kubernetes, we found a few
+places where API extensions would ease development and add power. There are a
+few goals:
+ 1. Provide an authorization plugin that can evaluate a .Authorize() call based
+on the full content of the request to RESTStorage. This includes information
+like the full verb, the content of creates and updates, and the names of
+resources being acted upon.
+ 1. Provide a way to ask whether a user is permitted to take an action without
+ running in process with the API Authorizer. For instance, a proxy for exec
+ calls could ask whether a user can run the exec they are requesting.
+ 1. Provide a way to ask who can perform a given action on a given resource.
+This is useful for answering questions like, "who can create replication
+controllers in my namespace".
+
+This proposal adds to and extends the existing API to so that authorizers may
+provide the functionality described above. It does not attempt to describe how
+the policies themselves can be expressed, that is up the authorization plugins
+themselves.
+
+
+## Enhancements to existing Authorization interfaces
+
+The existing Authorization interfaces are described
+[here](../admin/authorization.md). A couple additions will allow the development
+of an Authorizer that matches based on different rules than the existing
+implementation.
+
+### Request Attributes
+
+The existing authorizer.Attributes only has 5 attributes (user, groups,
+isReadOnly, kind, and namespace). If we add more detailed verbs, content, and
+resource names, then Authorizer plugins will have the same level of information
+available to RESTStorage components in order to express more detailed policy.
+The replacement excerpt is below.
+
+An API request has the following attributes that can be considered for
+authorization:
+ - user - the user-string which a user was authenticated as. This is included
+in the Context.
+ - groups - the groups to which the user belongs. This is included in the
+Context.
+ - verb - string describing the requesting action. Today we have: get, list,
+watch, create, update, and delete. The old `readOnly` behavior is equivalent to
+allowing get, list, watch.
+ - namespace - the namespace of the object being access, or the empty string if
+the endpoint does not support namespaced objects. This is included in the
+Context.
+ - resourceGroup - the API group of the resource being accessed
+ - resourceVersion - the API version of the resource being accessed
+ - resource - which resource is being accessed
+ - applies only to the API endpoints, such as `/api/v1beta1/pods`. For
+miscellaneous endpoints, like `/version`, the kind is the empty string.
+ - resourceName - the name of the resource during a get, update, or delete
+action.
+ - subresource - which subresource is being accessed
+
+A non-API request has 2 attributes:
+ - verb - the HTTP verb of the request
+ - path - the path of the URL being requested
+
+
+### Authorizer Interface
+
+The existing Authorizer interface is very simple, but there isn't a way to
+provide details about allows, denies, or failures. The extended detail is useful
+for UIs that want to describe why certain actions are allowed or disallowed. Not
+all Authorizers will want to provide that information, but for those that do,
+having that capability is useful. In addition, adding a `GetAllowedSubjects`
+method that returns back the users and groups that can perform a particular
+action makes it possible to answer questions like, "who can see resources in my
+namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down).
+
+```go
+// OLD
+type Authorizer interface {
+ Authorize(a Attributes) error
+}
+```
+
+```go
+// NEW
+// Authorizer provides the ability to determine if a particular user can perform
+// a particular action
+type Authorizer interface {
+ // Authorize takes a Context (for namespace, user, and traceability) and
+ // Attributes to make a policy determination.
+ // reason is an optional return value that can describe why a policy decision
+ // was made. Reasons are useful during debugging when trying to figure out
+ // why a user or group has access to perform a particular action.
+ Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error)
+}
+
+// AuthorizerIntrospection is an optional interface that provides the ability to
+// determine which users and groups can perform a particular action. This is
+// useful for building caches of who can see what. For instance, "which
+// namespaces can this user see". That would allow someone to see only the
+// namespaces they are allowed to view instead of having to choose between
+// listing them all or listing none.
+type AuthorizerIntrospection interface {
+ // GetAllowedSubjects takes a Context (for namespace and traceability) and
+ // Attributes to determine which users and groups are allowed to perform the
+ // described action in the namespace. This API enables the ResourceBasedReview
+ // requests below
+ GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error)
+}
+```
+
+### SubjectAccessReviews
+
+This set of APIs answers the question: can a user or group (use authenticated
+user if none is specified) perform a given action. Given the Authorizer
+interface (proposed or existing), this endpoint can be implemented generically
+against any Authorizer by creating the correct Attributes and making an
+.Authorize() call.
+
+There are three different flavors:
+
+1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this
+checks to see if a specified user or group can perform a given action at the
+cluster scope or across all namespaces. This is a highly privileged operation.
+It allows a cluster-admin to inspect rights of any person across the entire
+cluster and against cluster level resources.
+2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` -
+this checks to see if the current user (including his groups) can perform a
+given action at any specified scope. This is an unprivileged operation. It
+doesn't expose any information that a user couldn't discover simply by trying an
+endpoint themselves.
+3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` -
+this checks to see if a specified user or group can perform a given action in
+**this** namespace. This is a moderately privileged operation. In a multi-tenant
+environment, having a namespace scoped resource makes it very easy to reason
+about powers granted to a namespace admin. This allows a namespace admin
+(someone able to manage permissions inside of one namespaces, but not all
+namespaces), the power to inspect whether a given user or group can manipulate
+resources in his namespace.
+
+SubjectAccessReview is runtime.Object with associated RESTStorage that only
+accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets
+a SubjectAccessReviewResponse back. Here is an example of a call and its
+corresponding return:
+
+```
+// input
+{
+ "kind": "SubjectAccessReview",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "authorizationAttributes": {
+ "verb": "create",
+ "resource": "pods",
+ "user": "Clark",
+ "groups": ["admins", "managers"]
+ }
+}
+
+// POSTed like this
+curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json
+// or
+accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject)
+
+// output
+{
+ "kind": "SubjectAccessReviewResponse",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "allowed": true
+}
+```
+
+PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that
+only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL
+and he gets a SubjectAccessReviewResponse back. Here is an example of a call and
+its corresponding return:
+
+```
+// input
+{
+ "kind": "PersonalSubjectAccessReview",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "authorizationAttributes": {
+ "verb": "create",
+ "resource": "pods",
+ "namespace": "any-ns",
+ }
+}
+
+// POSTed like this
+curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json
+// or
+accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject)
+
+// output
+{
+ "kind": "PersonalSubjectAccessReviewResponse",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "allowed": true
+}
+```
+
+LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only
+accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he
+gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and
+its corresponding return:
+
+```
+// input
+{
+ "kind": "LocalSubjectAccessReview",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "namespace": "my-ns"
+ "authorizationAttributes": {
+ "verb": "create",
+ "resource": "pods",
+ "user": "Clark",
+ "groups": ["admins", "managers"]
+ }
+}
+
+// POSTed like this
+curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json
+// or
+accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject)
+
+// output
+{
+ "kind": "LocalSubjectAccessReviewResponse",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "namespace": "my-ns"
+ "allowed": true
+}
+```
+
+The actual Go objects look like this:
+
+```go
+type AuthorizationAttributes struct {
+ // Namespace is the namespace of the action being requested. Currently, there
+ // is no distinction between no namespace and all namespaces
+ Namespace string `json:"namespace" description:"namespace of the action being requested"`
+ // Verb is one of: get, list, watch, create, update, delete
+ Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"`
+ // Resource is one of the existing resource types
+ ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"`
+ // ResourceVersion is the version of resource
+ ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"`
+ // Resource is one of the existing resource types
+ Resource string `json:"resource" description:"one of the existing resource types"`
+ // ResourceName is the name of the resource being requested for a "get" or
+ // deleted for a "delete"
+ ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"`
+ // Subresource is one of the existing subresources types
+ Subresource string `json:"subresource" description:"one of the existing subresources"`
+}
+
+// SubjectAccessReview is an object for requesting information about whether a
+// user or group can perform an action
+type SubjectAccessReview struct {
+ kapi.TypeMeta `json:",inline"`
+
+ // AuthorizationAttributes describes the action being tested.
+ AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
+ // User is optional, but at least one of User or Groups must be specified
+ User string `json:"user" description:"optional, user to check"`
+ // Groups is optional, but at least one of User or Groups must be specified
+ Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
+}
+
+// SubjectAccessReviewResponse describes whether or not a user or group can
+// perform an action
+type SubjectAccessReviewResponse struct {
+ kapi.TypeMeta
+
+ // Allowed is required. True if the action would be allowed, false otherwise.
+ Allowed bool
+ // Reason is optional. It indicates why a request was allowed or denied.
+ Reason string
+}
+
+// PersonalSubjectAccessReview is an object for requesting information about
+// whether a user or group can perform an action
+type PersonalSubjectAccessReview struct {
+ kapi.TypeMeta `json:",inline"`
+
+ // AuthorizationAttributes describes the action being tested.
+ AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
+}
+
+// PersonalSubjectAccessReviewResponse describes whether this user can perform
+// an action
+type PersonalSubjectAccessReviewResponse struct {
+ kapi.TypeMeta
+
+ // Namespace is the namespace used for the access review
+ Namespace string
+ // Allowed is required. True if the action would be allowed, false otherwise.
+ Allowed bool
+ // Reason is optional. It indicates why a request was allowed or denied.
+ Reason string
+}
+
+// LocalSubjectAccessReview is an object for requesting information about
+// whether a user or group can perform an action
+type LocalSubjectAccessReview struct {
+ kapi.TypeMeta `json:",inline"`
+
+ // AuthorizationAttributes describes the action being tested.
+ AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
+ // User is optional, but at least one of User or Groups must be specified
+ User string `json:"user" description:"optional, user to check"`
+ // Groups is optional, but at least one of User or Groups must be specified
+ Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
+}
+
+// LocalSubjectAccessReviewResponse describes whether or not a user or group can
+// perform an action
+type LocalSubjectAccessReviewResponse struct {
+ kapi.TypeMeta
+
+ // Namespace is the namespace used for the access review
+ Namespace string
+ // Allowed is required. True if the action would be allowed, false otherwise.
+ Allowed bool
+ // Reason is optional. It indicates why a request was allowed or denied.
+ Reason string
+}
+```
+
+### ResourceAccessReview
+
+This set of APIs nswers the question: which users and groups can perform the
+specified verb on the specified resourceKind. Given the Authorizer interface
+described above, this endpoint can be implemented generically against any
+Authorizer by calling the .GetAllowedSubjects() function.
+
+There are two different flavors:
+
+1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this
+checks to see which users and groups can perform a given action at the cluster
+scope or across all namespaces. This is a highly privileged operation. It allows
+a cluster-admin to inspect rights of all subjects across the entire cluster and
+against cluster level resources.
+2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` -
+this checks to see which users and groups can perform a given action in **this**
+namespace. This is a moderately privileged operation. In a multi-tenant
+environment, having a namespace scoped resource makes it very easy to reason
+about powers granted to a namespace admin. This allows a namespace admin
+(someone able to manage permissions inside of one namespaces, but not all
+namespaces), the power to inspect which users and groups can manipulate
+resources in his namespace.
+
+ResourceAccessReview is a runtime.Object with associated RESTStorage that only
+accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets
+a ResourceAccessReviewResponse back. Here is an example of a call and its
+corresponding return:
+
+```
+// input
+{
+ "kind": "ResourceAccessReview",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "authorizationAttributes": {
+ "verb": "list",
+ "resource": "replicationcontrollers"
+ }
+}
+
+// POSTed like this
+curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json
+// or
+accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject)
+
+// output
+{
+ "kind": "ResourceAccessReviewResponse",
+ "apiVersion": "authorization.kubernetes.io/v1",
+ "namespace": "default"
+ "users": ["Clark", "Hubert"],
+ "groups": ["cluster-admins"]
+}
+```
+
+The actual Go objects look like this:
+
+```go
+// ResourceAccessReview is a means to request a list of which users and groups
+// are authorized to perform the action specified by spec
+type ResourceAccessReview struct {
+ kapi.TypeMeta `json:",inline"`
+
+ // AuthorizationAttributes describes the action being tested.
+ AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
+}
+
+// ResourceAccessReviewResponse describes who can perform the action
+type ResourceAccessReviewResponse struct {
+ kapi.TypeMeta
+
+ // Users is the list of users who can perform the action
+ Users []string
+ // Groups is the list of groups who can perform the action
+ Groups []string
+}
+
+// LocalResourceAccessReview is a means to request a list of which users and
+// groups are authorized to perform the action specified in a specific namespace
+type LocalResourceAccessReview struct {
+ kapi.TypeMeta `json:",inline"`
+
+ // AuthorizationAttributes describes the action being tested.
+ AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
+}
+
+// LocalResourceAccessReviewResponse describes who can perform the action
+type LocalResourceAccessReviewResponse struct {
+ kapi.TypeMeta
+
+ // Namespace is the namespace used for the access review
+ Namespace string
+ // Users is the list of users who can perform the action
+ Users []string
+ // Groups is the list of groups who can perform the action
+ Groups []string
+}
+```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/event_compression.md b/contributors/design-proposals/event_compression.md
new file mode 100644
index 00000000..7a1cbb33
--- /dev/null
+++ b/contributors/design-proposals/event_compression.md
@@ -0,0 +1,169 @@
+# Kubernetes Event Compression
+
+This document captures the design of event compression.
+
+## Background
+
+Kubernetes components can get into a state where they generate tons of events.
+
+The events can be categorized in one of two ways:
+
+1. same - The event is identical to previous events except it varies only on
+timestamp.
+2. similar - The event is identical to previous events except it varies on
+timestamp and message.
+
+For example, when pulling a non-existing image, Kubelet will repeatedly generate
+`image_not_existing` and `container_is_waiting` events until upstream components
+correct the image. When this happens, the spam from the repeated events makes
+the entire event mechanism useless. It also appears to cause memory pressure in
+etcd (see [#3853](http://issue.k8s.io/3853)).
+
+The goal is introduce event counting to increment same events, and event
+aggregation to collapse similar events.
+
+## Proposal
+
+Each binary that generates events (for example, `kubelet`) should keep track of
+previously generated events so that it can collapse recurring events into a
+single event instead of creating a new instance for each new event. In addition,
+if many similar events are created, events should be aggregated into a single
+event to reduce spam.
+
+Event compression should be best effort (not guaranteed). Meaning, in the worst
+case, `n` identical (minus timestamp) events may still result in `n` event
+entries.
+
+## Design
+
+Instead of a single Timestamp, each event object
+[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following
+fields:
+ * `FirstTimestamp unversioned.Time`
+ * The date/time of the first occurrence of the event.
+ * `LastTimestamp unversioned.Time`
+ * The date/time of the most recent occurrence of the event.
+ * On first occurrence, this is equal to the FirstTimestamp.
+ * `Count int`
+ * The number of occurrences of this event between FirstTimestamp and
+LastTimestamp.
+ * On first occurrence, this is 1.
+
+Each binary that generates events:
+ * Maintains a historical record of previously generated events:
+ * Implemented with
+["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go)
+in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go).
+ * Implemented behind an `EventCorrelator` that manages two subcomponents:
+`EventAggregator` and `EventLogger`.
+ * The `EventCorrelator` observes all incoming events and lets each
+subcomponent visit and modify the event in turn.
+ * The `EventAggregator` runs an aggregation function over each event. This
+function buckets each event based on an `aggregateKey` and identifies the event
+uniquely with a `localKey` in that bucket.
+ * The default aggregation function groups similar events that differ only by
+`event.Message`. Its `localKey` is `event.Message` and its aggregate key is
+produced by joining:
+ * `event.Source.Component`
+ * `event.Source.Host`
+ * `event.InvolvedObject.Kind`
+ * `event.InvolvedObject.Namespace`
+ * `event.InvolvedObject.Name`
+ * `event.InvolvedObject.UID`
+ * `event.InvolvedObject.APIVersion`
+ * `event.Reason`
+ * If the `EventAggregator` observes a similar event produced 10 times in a 10
+minute window, it drops the event that was provided as input and creates a new
+event that differs only on the message. The message denotes that this event is
+used to group similar events that matched on reason. This aggregated `Event` is
+then used in the event processing sequence.
+ * The `EventLogger` observes the event out of `EventAggregation` and tracks
+the number of times it has observed that event previously by incrementing a key
+in a cache associated with that matching event.
+ * The key in the cache is generated from the event object minus
+timestamps/count/transient fields, specifically the following events fields are
+used to construct a unique key for an event:
+ * `event.Source.Component`
+ * `event.Source.Host`
+ * `event.InvolvedObject.Kind`
+ * `event.InvolvedObject.Namespace`
+ * `event.InvolvedObject.Name`
+ * `event.InvolvedObject.UID`
+ * `event.InvolvedObject.APIVersion`
+ * `event.Reason`
+ * `event.Message`
+ * The LRU cache is capped at 4096 events for both `EventAggregator` and
+`EventLogger`. That means if a component (e.g. kubelet) runs for a long period
+of time and generates tons of unique events, the previously generated events
+cache will not grow unchecked in memory. Instead, after 4096 unique events are
+generated, the oldest events are evicted from the cache.
+ * When an event is generated, the previously generated events cache is checked
+(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)).
+ * If the key for the new event matches the key for a previously generated
+event (meaning all of the above fields match between the new event and some
+previously generated event), then the event is considered to be a duplicate and
+the existing event entry is updated in etcd:
+ * The new PUT (update) event API is called to update the existing event
+entry in etcd with the new last seen timestamp and count.
+ * The event is also updated in the previously generated events cache with
+an incremented count, updated last seen timestamp, name, and new resource
+version (all required to issue a future event update).
+ * If the key for the new event does not match the key for any previously
+generated event (meaning none of the above fields match between the new event
+and any previously generated events), then the event is considered to be
+new/unique and a new event entry is created in etcd:
+ * The usual POST/create event API is called to create a new event entry in
+etcd.
+ * An entry for the event is also added to the previously generated events
+cache.
+
+## Issues/Risks
+
+ * Compression is not guaranteed, because each component keeps track of event
+ history in memory
+ * An application restart causes event history to be cleared, meaning event
+history is not preserved across application restarts and compression will not
+occur across component restarts.
+ * Because an LRU cache is used to keep track of previously generated events,
+if too many unique events are generated, old events will be evicted from the
+cache, so events will only be compressed until they age out of the events cache,
+at which point any new instance of the event will cause a new entry to be
+created in etcd.
+
+## Example
+
+Sample kubectl output:
+
+```console
+FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
+Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet.
+Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet.
+Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet.
+Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet.
+Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
+Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
+Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
+Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
+Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
+Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest"
+Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal
+```
+
+This demonstrates what would have been 20 separate entries (indicating
+scheduling failure) collapsed/compressed down to 5 entries.
+
+## Related Pull Requests/Issues
+
+ * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events.
+ * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API.
+ * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow
+compressing multiple recurring events in to a single event.
+ * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a
+single event to optimize etcd storage.
+ * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache
+instead of map.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/expansion.md b/contributors/design-proposals/expansion.md
new file mode 100644
index 00000000..ace1faf0
--- /dev/null
+++ b/contributors/design-proposals/expansion.md
@@ -0,0 +1,417 @@
+# Variable expansion in pod command, args, and env
+
+## Abstract
+
+A proposal for the expansion of environment variables using a simple `$(var)`
+syntax.
+
+## Motivation
+
+It is extremely common for users to need to compose environment variables or
+pass arguments to their commands using the values of environment variables.
+Kubernetes should provide a facility for the 80% cases in order to decrease
+coupling and the use of workarounds.
+
+## Goals
+
+1. Define the syntax format
+2. Define the scoping and ordering of substitutions
+3. Define the behavior for unmatched variables
+4. Define the behavior for unexpected/malformed input
+
+## Constraints and Assumptions
+
+* This design should describe the simplest possible syntax to accomplish the
+use-cases.
+* Expansion syntax will not support more complicated shell-like behaviors such
+as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
+
+## Use Cases
+
+1. As a user, I want to compose new environment variables for a container using
+a substitution syntax to reference other variables in the container's
+environment and service environment variables.
+1. As a user, I want to substitute environment variables into a container's
+command.
+1. As a user, I want to do the above without requiring the container's image to
+have a shell.
+1. As a user, I want to be able to specify a default value for a service
+variable which may not exist.
+1. As a user, I want to see an event associated with the pod if an expansion
+fails (ie, references variable names that cannot be expanded).
+
+### Use Case: Composition of environment variables
+
+Currently, containers are injected with docker-style environment variables for
+the services in their pod's namespace. There are several variables for each
+service, but users routinely need to compose URLs based on these variables
+because there is not a variable for the exact format they need. Users should be
+able to build new environment variables with the exact format they need.
+Eventually, it should also be possible to turn off the automatic injection of
+the docker-style variables into pods and let the users consume the exact
+information they need via the downward API and composition.
+
+#### Expanding expanded variables
+
+It should be possible to reference an variable which is itself the result of an
+expansion, if the referenced variable is declared in the container's environment
+prior to the one referencing it. Put another way -- a container's environment is
+expanded in order, and expanded variables are available to subsequent
+expansions.
+
+### Use Case: Variable expansion in command
+
+Users frequently need to pass the values of environment variables to a
+container's command. Currently, Kubernetes does not perform any expansion of
+variables. The workaround is to invoke a shell in the container's command and
+have the shell perform the substitution, or to write a wrapper script that sets
+up the environment and runs the command. This has a number of drawbacks:
+
+1. Solutions that require a shell are unfriendly to images that do not contain
+a shell.
+2. Wrapper scripts make it harder to use images as base images.
+3. Wrapper scripts increase coupling to Kubernetes.
+
+Users should be able to do the 80% case of variable expansion in command without
+writing a wrapper script or adding a shell invocation to their containers'
+commands.
+
+### Use Case: Images without shells
+
+The current workaround for variable expansion in a container's command requires
+the container's image to have a shell. This is unfriendly to images that do not
+contain a shell (`scratch` images, for example). Users should be able to perform
+the other use-cases in this design without regard to the content of their
+images.
+
+### Use Case: See an event for incomplete expansions
+
+It is possible that a container with incorrect variable values or command line
+may continue to run for a long period of time, and that the end-user would have
+no visual or obvious warning of the incorrect configuration. If the kubelet
+creates an event when an expansion references a variable that cannot be
+expanded, it will help users quickly detect problems with expansions.
+
+## Design Considerations
+
+### What features should be supported?
+
+In order to limit complexity, we want to provide the right amount of
+functionality so that the 80% cases can be realized and nothing more. We felt
+that the essentials boiled down to:
+
+1. Ability to perform direct expansion of variables in a string.
+2. Ability to specify default values via a prioritized mapping function but
+without support for defaults as a syntax-level feature.
+
+### What should the syntax be?
+
+The exact syntax for variable expansion has a large impact on how users perceive
+and relate to the feature. We considered implementing a very restrictive subset
+of the shell `${var}` syntax. This syntax is an attractive option on some level,
+because many people are familiar with it. However, this syntax also has a large
+number of lesser known features such as the ability to provide default values
+for unset variables, perform inline substitution, etc.
+
+In the interest of preventing conflation of the expansion feature in Kubernetes
+with the shell feature, we chose a different syntax similar to the one in
+Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since
+it is not required to implement the required use-cases.
+
+Nested references, ie, variable expansion within variable names, are not
+supported.
+
+#### How should unmatched references be treated?
+
+Ideally, it should be extremely clear when a variable reference couldn't be
+expanded. We decided the best experience for unmatched variable references would
+be to have the entire reference, syntax included, show up in the output. As an
+example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then
+`$(VARIABLE_NAME)` should be present in the output.
+
+#### Escaping the operator
+
+Although the `$(var)` syntax does overlap with the `$(command)` form of command
+substitution supported by many shells, because unexpanded variables are present
+verbatim in the output, we expect this will not present a problem to many users.
+If there is a collision between a variable name and command substitution syntax,
+the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate
+to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
+
+## Design
+
+This design encompasses the variable expansion syntax and specification and the
+changes needed to incorporate the expansion feature into the container's
+environment and command.
+
+### Syntax and expansion mechanics
+
+This section describes the expansion syntax, evaluation of variable values, and
+how unexpected or malformed inputs are handled.
+
+#### Syntax
+
+The inputs to the expansion feature are:
+
+1. A utf-8 string (the input string) which may contain variable references.
+2. A function (the mapping function) that maps the name of a variable to the
+variable's value, of type `func(string) string`.
+
+Variable references in the input string are indicated exclusively with the syntax
+`$(<variable-name>)`. The syntax tokens are:
+
+- `$`: the operator,
+- `(`: the reference opener, and
+- `)`: the reference closer.
+
+The operator has no meaning unless accompanied by the reference opener and
+closer tokens. The operator can be escaped using `$$`. One literal `$` will be
+emitted for each `$$` in the input.
+
+The reference opener and closer characters have no meaning when not part of a
+variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME`
+without a closing expression, the operator and expression opening characters are
+treated as ordinary characters without special meanings.
+
+#### Scope and ordering of substitutions
+
+The scope in which variable references are expanded is defined by the mapping
+function. Within the mapping function, any arbitrary strategy may be used to
+determine the value of a variable name. The most basic implementation of a
+mapping function is to use a `map[string]string` to lookup the value of a
+variable.
+
+In order to support default values for variables like service variables
+presented by the kubelet, which may not be bound because the service that
+provides them does not yet exist, there should be a mapping function that uses a
+list of `map[string]string` like:
+
+```go
+func MakeMappingFunc(maps ...map[string]string) func(string) string {
+ return func(input string) string {
+ for _, context := range maps {
+ val, ok := context[input]
+ if ok {
+ return val
+ }
+ }
+
+ return ""
+ }
+}
+
+// elsewhere
+containerEnv := map[string]string{
+ "FOO": "BAR",
+ "ZOO": "ZAB",
+ "SERVICE2_HOST": "some-host",
+}
+
+serviceEnv := map[string]string{
+ "SERVICE_HOST": "another-host",
+ "SERVICE_PORT": "8083",
+}
+
+// single-map variation
+mapping := MakeMappingFunc(containerEnv)
+
+// default variables not found in serviceEnv
+mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv)
+```
+
+### Implementation changes
+
+The necessary changes to implement this functionality are:
+
+1. Add a new interface, `ObjectEventRecorder`, which is like the
+`EventRecorder` interface, but scoped to a single object, and a function that
+returns an `ObjectEventRecorder` given an `ObjectReference` and an
+`EventRecorder`.
+2. Introduce `third_party/golang/expansion` package that provides:
+ 1. An `Expand(string, func(string) string) string` function.
+ 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string`
+function.
+3. Make the kubelet expand environment correctly.
+4. Make the kubelet expand command correctly.
+
+#### Event Recording
+
+In order to provide an event when an expansion references undefined variables,
+the mapping function must be able to create an event. In order to facilitate
+this, we should create a new interface in the `api/client/record` package which
+is similar to `EventRecorder`, but scoped to a single object:
+
+```go
+// ObjectEventRecorder knows how to record events about a single object.
+type ObjectEventRecorder interface {
+ // Event constructs an event from the given information and puts it in the queue for sending.
+ // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
+ // be used to automate handling of events, so imagine people writing switch statements to
+ // handle them. You want to make that easy.
+ // 'message' is intended to be human readable.
+ //
+ // The resulting event will be created in the same namespace as the reference object.
+ Event(reason, message string)
+
+ // Eventf is just like Event, but with Sprintf for the message field.
+ Eventf(reason, messageFmt string, args ...interface{})
+
+ // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
+ PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
+}
+```
+
+There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object`
+and an `EventRecorder`:
+
+```go
+type objectRecorderImpl struct {
+ object runtime.Object
+ recorder EventRecorder
+}
+
+func (r *objectRecorderImpl) Event(reason, message string) {
+ r.recorder.Event(r.object, reason, message)
+}
+
+func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
+ return &objectRecorderImpl{object, recorder}
+}
+```
+
+#### Expansion package
+
+The expansion package should provide two methods:
+
+```go
+// MappingFuncFor returns a mapping function for use with Expand that
+// implements the expansion semantics defined in the expansion spec; it
+// returns the input string wrapped in the expansion syntax if no mapping
+// for the input is found. If no expansion is found for a key, an event
+// is raised on the given recorder.
+func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
+ // ...
+}
+
+// Expand replaces variable references in the input string according to
+// the expansion spec using the given mapping function to resolve the
+// values of variables.
+func Expand(input string, mapping func(string) string) string {
+ // ...
+}
+```
+
+#### Kubelet changes
+
+The Kubelet should be made to correctly expand variables references in a
+container's environment, command, and args. Changes will need to be made to:
+
+1. The `makeEnvironmentVariables` function in the kubelet; this is used by
+`GenerateRunContainerOptions`, which is used by both the docker and rkt
+container runtimes.
+2. The docker manager `setEntrypointAndCommand` func has to be changed to
+perform variable expansion.
+3. The rkt runtime should be made to support expansion in command and args
+when support for it is implemented.
+
+### Examples
+
+#### Inputs and outputs
+
+These examples are in the context of the mapping:
+
+| Name | Value |
+|-------------|------------|
+| `VAR_A` | `"A"` |
+| `VAR_B` | `"B"` |
+| `VAR_C` | `"C"` |
+| `VAR_REF` | `$(VAR_A)` |
+| `VAR_EMPTY` | `""` |
+
+No other variables are defined.
+
+| Input | Result |
+|--------------------------------|----------------------------|
+| `"$(VAR_A)"` | `"A"` |
+| `"___$(VAR_B)___"` | `"___B___"` |
+| `"___$(VAR_C)"` | `"___C"` |
+| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` |
+| `"$(VAR_A)-1"` | `"A-1"` |
+| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` |
+| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` |
+| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` |
+| `"f000-$$VAR_A"` | `"f000-$VAR_A"` |
+| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` |
+| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` |
+| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` |
+| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` |
+| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` |
+| `"$(VAR_REF)"` | `"$(VAR_A)"` |
+| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` |
+| `"foo$(VAR_EMPTY)bar"` | `"foobar"` |
+| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` |
+| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` |
+| `"$?_boo_$!"` | `"$?_boo_$!"` |
+| `"$VAR_A"` | `"$VAR_A"` |
+| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` |
+| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` |
+| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` |
+| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` |
+| `"$$$$$$$(VAR_A)"` | `"$$$A"` |
+| `"$VAR_A)"` | `"$VAR_A)"` |
+| `"${VAR_A}"` | `"${VAR_A}"` |
+| `"$(VAR_B)_______$(A"` | `"B_______$(A"` |
+| `"$(VAR_C)_______$("` | `"C_______$("` |
+| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` |
+| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` |
+| `"--$($($($($--"` | `"--$($($($($--"` |
+| `"$($($($($--foo$("` | `"$($($($($--foo$("` |
+| `"foo0--$($($($("` | `"foo0--$($($($("` |
+| `"$(foo$$var)` | `$(foo$$var)` |
+
+#### In a pod: building a URL
+
+Notice the `$(var)` syntax.
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: expansion-pod
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh", "-c", "env" ]
+ env:
+ - name: PUBLIC_URL
+ value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)"
+ restartPolicy: Never
+```
+
+#### In a pod: building a URL using downward API
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: expansion-pod
+spec:
+ containers:
+ - name: test-container
+ image: gcr.io/google_containers/busybox
+ command: [ "/bin/sh", "-c", "env" ]
+ env:
+ - name: POD_NAMESPACE
+ valueFrom:
+ fieldRef:
+ fieldPath: "metadata.namespace"
+ - name: PUBLIC_URL
+ value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)"
+ restartPolicy: Never
+```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/extending-api.md b/contributors/design-proposals/extending-api.md
new file mode 100644
index 00000000..45a07ca5
--- /dev/null
+++ b/contributors/design-proposals/extending-api.md
@@ -0,0 +1,203 @@
+# Adding custom resources to the Kubernetes API server
+
+This document describes the design for implementing the storage of custom API
+types in the Kubernetes API Server.
+
+
+## Resource Model
+
+### The ThirdPartyResource
+
+The `ThirdPartyResource` resource describes the multiple versions of a custom
+resource that the user wants to add to the Kubernetes API. `ThirdPartyResource`
+is a non-namespaced resource; attempting to place it in a namespace will return
+an error.
+
+Each `ThirdPartyResource` resource has the following:
+ * Standard Kubernetes object metadata.
+ * ResourceKind - The kind of the resources described by this third party
+resource.
+ * Description - A free text description of the resource.
+ * APIGroup - An API group that this resource should be placed into.
+ * Versions - One or more `Version` objects.
+
+### The `Version` Object
+
+The `Version` object describes a single concrete version of a custom resource.
+The `Version` object currently only specifies:
+ * The `Name` of the version.
+ * The `APIGroup` this version should belong to.
+
+## Expectations about third party objects
+
+Every object that is added to a third-party Kubernetes object store is expected
+to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata).
+This requirement enables the Kubernetes API server to provide the following
+features:
+ * Filtering lists of objects via label queries.
+ * `resourceVersion`-based optimistic concurrency via compare-and-swap.
+ * Versioned storage.
+ * Event recording.
+ * Integration with basic `kubectl` command line tooling.
+ * Watch for resource changes.
+
+The `Kind` for an instance of a third-party object (e.g. CronTab) below is
+expected to be programmatically convertible to the name of the resource using
+the following conversion. Kinds are expected to be of the form
+`<CamelCaseKind>`, and the `APIVersion` for the object is expected to be
+`<api-group>/<api-version>`. To prevent collisions, it's expected that you'll
+use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`.
+
+For example `mygroup.example.com/v1`
+
+'CamelCaseKind' is the specific type name.
+
+To convert this into the `metadata.name` for the `ThirdPartyResource` resource
+instance, the `<domain-name>` is copied verbatim, the `CamelCaseKind` is then
+converted using '-' instead of capitalization ('camel-case'), with the first
+character being assumed to be capitalized. In pseudo code:
+
+```go
+var result string
+for ix := range kindName {
+ if isCapital(kindName[ix]) {
+ result = append(result, '-')
+ }
+ result = append(result, toLowerCase(kindName[ix])
+}
+```
+
+As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines
+resources of Kind `CamelCaseKind`, in the APIGroup with the prefix
+`mygroup.example.com/...`.
+
+The reason for this is to enable rapid lookup of a `ThirdPartyResource` object
+given the kind information. This is also the reason why `ThirdPartyResource` is
+not namespaced.
+
+## Usage
+
+When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts
+by creating a new, namespaced RESTful resource path. For now, non-namespaced
+objects are not supported. As with existing built-in objects, deleting a
+namespace deletes all third party resources in that namespace.
+
+For example, if a user creates:
+
+```yaml
+metadata:
+ name: cron-tab.mygroup.example.com
+apiVersion: extensions/v1beta1
+kind: ThirdPartyResource
+description: "A specification of a Pod to run on a cron style schedule"
+versions:
+- name: v1
+- name: v2
+```
+
+Then the API server will program in the new RESTful resource path:
+ * `/apis/mygroup.example.com/v1/namespaces/<namespace>/crontabs/...`
+
+**Note: This may take a while before RESTful resource path registration happen, please
+always check this before you create resource instances.**
+
+Now that this schema has been created, a user can `POST`:
+
+```json
+{
+ "metadata": {
+ "name": "my-new-cron-object"
+ },
+ "apiVersion": "mygroup.example.com/v1",
+ "kind": "CronTab",
+ "cronSpec": "* * * * /5",
+ "image": "my-awesome-cron-image"
+}
+```
+
+to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs`
+
+and the corresponding data will be stored into etcd by the APIServer, so that
+when the user issues:
+
+```
+GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object`
+```
+
+And when they do that, they will get back the same data, but with additional
+Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in.
+
+Likewise, to list all resources, a user can issue:
+
+```
+GET /apis/mygroup.example.com/v1/namespaces/default/crontabs
+```
+
+and get back:
+
+```json
+{
+ "apiVersion": "mygroup.example.com/v1",
+ "kind": "CronTabList",
+ "items": [
+ {
+ "metadata": {
+ "name": "my-new-cron-object"
+ },
+ "apiVersion": "mygroup.example.com/v1",
+ "kind": "CronTab",
+ "cronSpec": "* * * * /5",
+ "image": "my-awesome-cron-image"
+ }
+ ]
+}
+```
+
+Because all objects are expected to contain standard Kubernetes metadata fields,
+these list operations can also use label queries to filter requests down to
+specific subsets.
+
+Likewise, clients can use watch endpoints to watch for changes to stored
+objects.
+
+## Storage
+
+In order to store custom user data in a versioned fashion inside of etcd, we
+need to also introduce a `Codec`-compatible object for persistent storage in
+etcd. This object is `ThirdPartyResourceData` and it contains:
+ * Standard API Metadata.
+ * `Data`: The raw JSON data for this custom object.
+
+### Storage key specification
+
+Each custom object stored by the API server needs a custom key in storage, this
+is described below:
+
+#### Definitions
+
+ * `resource-namespace`: the namespace of the particular resource that is
+being stored
+ * `resource-name`: the name of the particular resource being stored
+ * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource`
+resource that represents the type for the specific instance being stored
+ * `third-party-resource-name`: the name of the `ThirdPartyResource` resource
+that represents the type for the specific instance being stored
+
+#### Key
+
+Given the definitions above, the key for a specific third-party object is:
+
+```
+${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name}
+```
+
+Thus, listing a third-party resource can be achieved by listing the directory:
+
+```
+${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/
+```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/external-lb-source-ip-preservation.md b/contributors/design-proposals/external-lb-source-ip-preservation.md
new file mode 100644
index 00000000..e1450e64
--- /dev/null
+++ b/contributors/design-proposals/external-lb-source-ip-preservation.md
@@ -0,0 +1,238 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Overview](#overview)
+ - [Motivation](#motivation)
+- [Alpha Design](#alpha-design)
+ - [Overview](#overview-1)
+ - [Traffic Steering using LB programming](#traffic-steering-using-lb-programming)
+ - [Traffic Steering using Health Checks](#traffic-steering-using-health-checks)
+ - [Choice of traffic steering approaches by individual Cloud Provider implementations](#choice-of-traffic-steering-approaches-by-individual-cloud-provider-implementations)
+ - [API Changes](#api-changes)
+ - [Local Endpoint Recognition Support](#local-endpoint-recognition-support)
+ - [Service Annotation to opt-in for new behaviour](#service-annotation-to-opt-in-for-new-behaviour)
+ - [NodePort allocation for HealthChecks](#nodeport-allocation-for-healthchecks)
+ - [Behavior Changes expected](#behavior-changes-expected)
+ - [External Traffic Blackholed on nodes with no local endpoints](#external-traffic-blackholed-on-nodes-with-no-local-endpoints)
+ - [Traffic Balancing Changes](#traffic-balancing-changes)
+ - [Cloud Provider support](#cloud-provider-support)
+ - [GCE 1.4](#gce-14)
+ - [GCE Expected Packet Source/Destination IP (Datapath)](#gce-expected-packet-sourcedestination-ip-datapath)
+ - [GCE Expected Packet Destination IP (HealthCheck path)](#gce-expected-packet-destination-ip-healthcheck-path)
+ - [AWS TBD](#aws-tbd)
+ - [Openstack TBD](#openstack-tbd)
+ - [Azure TBD](#azure-tbd)
+ - [Testing](#testing)
+- [Beta Design](#beta-design)
+ - [API Changes from Alpha to Beta](#api-changes-from-alpha-to-beta)
+- [Future work](#future-work)
+- [Appendix](#appendix)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Overview
+
+Kubernetes provides an external loadbalancer service type which creates a virtual external ip
+(in supported cloud provider environments) that can be used to load-balance traffic to
+the pods matching the service pod-selector.
+
+## Motivation
+
+The current implementation requires that the cloud loadbalancer balances traffic across all
+Kubernetes worker nodes, and this traffic is then equally distributed to all the backend
+pods for that service.
+Due to the DNAT required to redirect the traffic to its ultimate destination, the return
+path for each session MUST traverse the same node again. To ensure this, the node also
+performs a SNAT, replacing the source ip with its own.
+
+This causes the service endpoint to see the session as originating from a cluster local ip address.
+*The original external source IP is lost*
+
+This is not a satisfactory solution - the original external source IP MUST be preserved for a
+lot of applications and customer use-cases.
+
+# Alpha Design
+
+This section describes the proposed design for
+[alpha-level](../../docs/devel/api_changes.md#alpha-beta-and-stable-versions) support, although
+additional features are described in [future work](#future-work).
+
+## Overview
+
+The double hop must be prevented by programming the external load balancer to direct traffic
+only to nodes that have local pods for the service. This can be accomplished in two ways, either
+by API calls to add/delete nodes from the LB node pool or by adding health checking to the LB and
+failing/passing health checks depending on the presence of local pods.
+
+## Traffic Steering using LB programming
+
+This approach requires that the Cloud LB be reprogrammed to be in sync with endpoint presence.
+Whenever the first service endpoint is scheduled onto a node, the node is added to the LB pool.
+Whenever the last service endpoint is unhealthy on a node, the node needs to be removed from the LB pool.
+
+This is a slow operation, on the order of 30-60 seconds, and involves the Cloud Provider API path.
+If the API endpoint is temporarily unavailable, the datapath will be misprogrammed till the
+reprogramming is successful and the API->datapath tables are updated by the cloud provider backend.
+
+## Traffic Steering using Health Checks
+
+This approach requires that all worker nodes in the cluster be programmed into the LB target pool.
+To steer traffic only onto nodes that have endpoints for the service, we program the LB to perform
+node healthchecks. The kube-proxy daemons running on each node will be responsible for responding
+to these healthcheck requests (URL `/healthz`) from the cloud provider LB healthchecker. An additional nodePort
+will be allocated for these health check for this purpose.
+kube-proxy already watches for Service and Endpoint changes, it will maintain an in-memory lookup
+table indicating the number of local endpoints for each service.
+For a value of zero local endpoints, it responds with a health check failure (503 Service Unavailable),
+and success (200 OK) for non-zero values.
+
+Healthchecks are programmable with a min period of 1 second on most cloud provider LBs, and min
+failures to trigger node health state change can be configurable from 2 through 5.
+
+This will allow much faster transition times on the order of 1-5 seconds, and involve no
+API calls to the cloud provider (and hence reduce the impact of API unreliability), keeping the
+time window where traffic might get directed to nodes with no local endpoints to a minimum.
+
+## Choice of traffic steering approaches by individual Cloud Provider implementations
+
+The cloud provider package may choose either of these approaches. kube-proxy will provide these
+healthcheck responder capabilities, regardless of the cloud provider configured on a cluster.
+
+## API Changes
+
+### Local Endpoint Recognition Support
+
+To allow kube-proxy to recognize if an endpoint is local requires that the EndpointAddress struct
+should also contain the NodeName it resides on. This new string field will be read-only and
+populated *only* by the Endpoints Controller.
+
+### Service Annotation to opt-in for new behaviour
+
+A new annotation `service.alpha.kubernetes.io/external-traffic` will be recognized
+by the service controller only for services of Type LoadBalancer. Services that wish to opt-in to
+the new LoadBalancer behaviour must annotate the Service to request the new ESIPP behavior.
+Supported values for this annotation are OnlyLocal and Global.
+- OnlyLocal activates the new logic (described in this proposal) and balances locally within a node.
+- Global activates the old logic of balancing traffic across the entire cluster.
+
+### NodePort allocation for HealthChecks
+
+An additional nodePort allocation will be necessary for services that are of type LoadBalancer and
+have the new annotation specified. This additional nodePort is necessary for kube-proxy to listen for
+healthcheck requests on all nodes.
+This NodePort will be added as an annotation (`service.alpha.kubernetes.io/healthcheck-nodeport`) to
+the Service after allocation (in the alpha release). The value of this annotation may also be
+specified during the Create call and the allocator will reserve that specific nodePort.
+
+
+## Behavior Changes expected
+
+### External Traffic Blackholed on nodes with no local endpoints
+
+When the last endpoint on the node has gone away and the LB has not marked the node as unhealthy,
+worst-case window size = (N+1) * HCP, where N = minimum failed healthchecks and HCP = Health Check Period,
+external traffic will still be steered to the node. This traffic will be blackholed and not forwarded
+to other endpoints elsewhere in the cluster.
+
+Internal pod to pod traffic should behave as before, with equal probability across all pods.
+
+### Traffic Balancing Changes
+
+GCE/AWS load balancers do not provide weights for their target pools. This was not an issue with the old LB
+kube-proxy rules which would correctly balance across all endpoints.
+
+With the new functionality, the external traffic will not be equally load balanced across pods, but rather
+equally balanced at the node level (because GCE/AWS and other external LB implementations do not have the ability
+for specifying the weight per node, they balance equally across all target nodes, disregarding the number of
+pods on each node).
+
+We can, however, state that for NumServicePods << NumNodes or NumServicePods >> NumNodes, a fairly close-to-equal
+distribution will be seen, even without weights.
+
+Once the external load balancers provide weights, this functionality can be added to the LB programming path.
+*Future Work: No support for weights is provided for the 1.4 release, but may be added at a future date*
+
+## Cloud Provider support
+
+This feature is added as an opt-in annotation.
+Default behaviour of LoadBalancer type services will be unchanged for all Cloud providers.
+The annotation will be ignored by existing cloud provider libraries until they add support.
+
+### GCE 1.4
+
+For the 1.4 release, this feature will be implemented for the GCE cloud provider.
+
+#### GCE Expected Packet Source/Destination IP (Datapath)
+
+- Node: On the node, we expect to see the real source IP of the client. Destination IP will be the Service Virtual External IP.
+
+- Pod: For processes running inside the Pod network namepsace, the source IP will be the real client source IP. The destination address will the be Pod IP.
+
+#### GCE Expected Packet Destination IP (HealthCheck path)
+
+kube-proxy listens on the health check node port for TCP health checks on :::.
+This allow responding to health checks when the destination IP is either the VM IP or the Service Virtual External IP.
+In practice, tcpdump traces on GCE show source IP is 169.254.169.254 and destination address is the Service Virtual External IP.
+
+### AWS TBD
+
+TBD *discuss timelines and feasibility with Kubernetes sig-aws team members*
+
+### Openstack TBD
+
+This functionality may not be introduced in Openstack in the near term.
+
+*Note from Openstack team member @anguslees*
+Underlying vendor devices might be able to do this, but we only expose full-NAT/proxy loadbalancing through the OpenStack API (LBaaS v1/v2 and Octavia). So I'm afraid this will be unsupported on OpenStack, afaics.
+
+### Azure TBD
+
+*To be confirmed* For the 1.4 release, this feature will be implemented for the Azure cloud provider.
+
+## Testing
+
+The cases we should test are:
+
+1. Core Functionality Tests
+
+1.1 Source IP Preservation
+
+Test the main intent of this change, source ip preservation - use the all-in-one network tests container
+with new functionality that responds with the client IP. Verify the container is seeing the external IP
+of the test client.
+
+1.2 Health Check responses
+
+Testcases use pods explicitly pinned to nodes and delete/add to nodes randomly. Validate that healthchecks succeed
+and fail on the expected nodes as endpoints move around. Gather LB response times (time from pod declares ready to
+time for Cloud LB to declare node healthy and vice versa) to endpoint changes.
+
+2. Inter-Operability Tests
+
+Validate that internal cluster communications are still possible from nodes without local endpoints. This change
+is only for externally sourced traffic.
+
+3. Backward Compatibility Tests
+
+Validate that old and new functionality can simultaneously exist in a single cluster. Create services with and without
+the annotation, and validate datapath correctness.
+
+# Beta Design
+
+The only part of the design that changes for beta is the API, which is upgraded from
+annotation-based to first class fields.
+
+## API Changes from Alpha to Beta
+
+Annotation `service.alpha.kubernetes.io/node-local-loadbalancer` will switch to a Service object field.
+
+# Future work
+
+Post-1.4 feature ideas. These are not fully-fleshed designs.
+
+
+
+# Appendix
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/external-lb-source-ip-preservation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federated-api-servers.md b/contributors/design-proposals/federated-api-servers.md
new file mode 100644
index 00000000..1d2d5ba1
--- /dev/null
+++ b/contributors/design-proposals/federated-api-servers.md
@@ -0,0 +1,209 @@
+# Federated API Servers
+
+## Abstract
+
+We want to divide the single monolithic API server into multiple federated
+servers. Anyone should be able to write their own federated API server to expose APIs they want.
+Cluster admins should be able to expose new APIs at runtime by bringing up new
+federated servers.
+
+## Motivation
+
+* Extensibility: We want to allow community members to write their own API
+ servers to expose APIs they want. Cluster admins should be able to use these
+ servers without having to require any change in the core kubernetes
+ repository.
+* Unblock new APIs from core kubernetes team review: A lot of new API proposals
+ are currently blocked on review from the core kubernetes team. By allowing
+ developers to expose their APIs as a separate server and enabling the cluster
+ admin to use it without any change to the core kubernetes repository, we
+ unblock these APIs.
+* Place for staging experimental APIs: New APIs can remain in separate
+ federated servers until they become stable, at which point, they can be moved
+ to the core kubernetes master, if appropriate.
+* Ensure that new APIs follow kubernetes conventions: Without the mechanism
+ proposed here, community members might be forced to roll their own thing which
+ may or may not follow kubernetes conventions.
+
+## Goal
+
+* Developers should be able to write their own API server and cluster admins
+ should be able to add them to their cluster, exposing new APIs at runtime. All
+ of this should not require any change to the core kubernetes API server.
+* These new APIs should be seamless extension of the core kubernetes APIs (ex:
+ they should be operated upon via kubectl).
+
+## Non Goals
+
+The following are related but are not the goals of this specific proposal:
+* Make it easy to write a kubernetes API server.
+
+## High Level Architecture
+
+There will be 2 new components in the cluster:
+* A simple program to summarize discovery information from all the servers.
+* A reverse proxy to proxy client requests to individual servers.
+
+The reverse proxy is optional. Clients can discover server URLs using the
+summarized discovery information and contact them directly. Simple clients, can
+always use the proxy.
+The same program can provide both discovery summarization and reverse proxy.
+
+### Constraints
+
+* Unique API groups across servers: Each API server (and groups of servers, in HA)
+ should expose unique API groups.
+* Follow API conventions: APIs exposed by every API server should adhere to [kubernetes API
+ conventions](../devel/api-conventions.md).
+* Support discovery API: Each API server should support the kubernetes discovery API
+ (list the suported groupVersions at `/apis` and list the supported resources
+ at `/apis/<groupVersion>/`)
+* No bootstrap problem: The core kubernetes server should not depend on any
+ other federated server to come up. Other servers can only depend on the core
+ kubernetes server.
+
+## Implementation Details
+
+### Summarizing discovery information
+
+We can have a very simple Go program to summarize discovery information from all
+servers. Cluster admins will register each federated API server (its baseURL and swagger
+spec path) with the proxy. The proxy will summarize the list of all group versions
+exposed by all registered API servers with their individual URLs at `/apis`.
+
+### Reverse proxy
+
+We can use any standard reverse proxy server like nginx or extend the same Go program that
+summarizes discovery information to act as reverse proxy for all federated servers.
+
+Cluster admins are also free to use any of the multiple open source API management tools
+(for example, there is [Kong](https://getkong.org/), which is written in lua and there is
+[Tyk](https://tyk.io/), which is written in Go). These API management tools
+provide a lot more functionality like: rate-limiting, caching, logging,
+transformations and authentication.
+In future, we can also use ingress. That will give cluster admins the flexibility to
+easily swap out the ingress controller by a Go reverse proxy, nginx, haproxy
+or any other solution they might want.
+
+### Storage
+
+Each API server is responsible for storing their resources. They can have their
+own etcd or can use kubernetes server's etcd using [third party
+resources](../design/extending-api.md#adding-custom-resources-to-the-kubernetes-api-server).
+
+### Health check
+
+Kubernetes server's `/api/v1/componentstatuses` will continue to report status
+of master components that it depends on (scheduler and various controllers).
+Since clients have access to server URLs, they can use that to do
+health check of individual servers.
+In future, if a global health check is required, we can expose a health check
+endpoint in the proxy that will report the status of all federated api servers
+in the cluster.
+
+### Auth
+
+Since the actual server which serves client's request can be opaque to the client,
+all API servers need to have homogeneous authentication and authorisation mechanisms.
+All API servers will handle authn and authz for their resources themselves.
+In future, we can also have the proxy do the auth and then have apiservers trust
+it (via client certs) to report the actual user in an X-something header.
+
+For now, we will trust system admins to configure homogeneous auth on all servers.
+Future proposals will refine how auth is managed across the cluster.
+
+### kubectl
+
+kubectl will talk to the discovery endpoint (or proxy) and use the discovery API to
+figure out the operations and resources supported in the cluster.
+Today, it uses RESTMapper to determine that. We will update kubectl code to populate
+RESTMapper using the discovery API so that we can add and remove resources
+at runtime.
+We will also need to make kubectl truly generic. Right now, a lot of operations
+(like get, describe) are hardcoded in the binary for all resources. A future
+proposal will provide details on moving those operations to server.
+
+Note that it is possible for kubectl to talk to individual servers directly in
+which case proxy will not be required at all, but this requires a bit more logic
+in kubectl. We can do this in future, if desired.
+
+### Handling global policies
+
+Now that we have resources spread across multiple API servers, we need to
+be careful to ensure that global policies (limit ranges, resource quotas, etc) are enforced.
+Future proposals will improve how this is done across the cluster.
+
+#### Namespaces
+
+When a namespaced resource is created in any of the federated server, that
+server first needs to check with the kubernetes server that:
+
+* The namespace exists.
+* User has authorization to create resources in that namespace.
+* Resource quota for the namespace is not exceeded.
+
+To prevent race conditions, the kubernetes server might need to expose an atomic
+API for all these operations.
+
+While deleting a namespace, kubernetes server needs to ensure that resources in
+that namespace maintained by other servers are deleted as well. We can do this
+using resource [finalizers](../design/namespaces.md#finalizers). Each server
+will add themselves in the set of finalizers before they create a resource in
+the corresponding namespace and delete all their resources in that namespace,
+whenever it is to be deleted (kubernetes API server already has this code, we
+will refactor it into a library to enable reuse).
+
+Future proposal will talk about this in more detail and provide a better
+mechanism.
+
+#### Limit ranges and resource quotas
+
+kubernetes server maintains [resource quotas](../admin/resourcequota/README.md) and
+[limit ranges](../admin/limitrange/README.md) for all resources.
+Federated servers will need to check with the kubernetes server before creating any
+resource.
+
+## Running on hosted kubernetes cluster
+
+This proposal is not enough for hosted cluster users, but allows us to improve
+that in the future.
+On a hosted kubernetes cluster, for e.g. on GKE - where Google manages the kubernetes
+API server, users will have to bring up and maintain the proxy and federated servers
+themselves.
+Other system components like the various controllers, will not be aware of the
+proxy and will only talk to the kubernetes API server.
+
+One possible solution to fix this is to update kubernetes API server to detect when
+there are federated servers in the cluster and then change its advertise address to
+the IP address of the proxy.
+Future proposal will talk about this in more detail.
+
+## Alternatives
+
+There were other alternatives that we had discussed.
+
+* Instead of adding a proxy in front, let the core kubernetes server provide an
+ API for other servers to register themselves. It can also provide a discovery
+ API which the clients can use to discover other servers and then talk to them
+ directly. But this would have required another server API a lot of client logic as well.
+* Validating federated servers: We can validate new servers when they are registered
+ with the proxy, or keep validating them at regular intervals, or validate
+ them only when explicitly requested, or not validate at all.
+ We decided that the proxy will just assume that all the servers are valid
+ (conform to our api conventions). In future, we can provide conformance tests.
+
+## Future Work
+
+* Validate servers: We should have some conformance tests that validate that the
+ servers follow kubernetes api-conventions.
+* Provide centralised auth service: It is very hard to ensure homogeneous auth
+ across multiple federated servers, especially in case of hosted clusters
+ (where different people control the different servers). We can fix it by
+ providing a centralised authentication and authorization service which all of
+ the servers can use.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-api-servers.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federated-ingress.md b/contributors/design-proposals/federated-ingress.md
new file mode 100644
index 00000000..05d36e1d
--- /dev/null
+++ b/contributors/design-proposals/federated-ingress.md
@@ -0,0 +1,223 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubernetes Federated Ingress
+
+ Requirements and High Level Design
+
+ Quinton Hoole
+
+ July 17, 2016
+
+## Overview/Summary
+
+[Kubernetes Ingress](https://github.com/kubernetes/kubernetes.github.io/blob/master/docs/user-guide/ingress.md)
+provides an abstraction for sophisticated L7 load balancing through a
+single IP address (and DNS name) across multiple pods in a single
+Kubernetes cluster. Multiple alternative underlying implementations
+are provided, including one based on GCE L7 load balancing and another
+using an in-cluster nginx/HAProxy deployment (for non-GCE
+environments). An AWS implementation, based on Elastic Load Balancers
+and Route53 is under way by the community.
+
+To extend the above to cover multiple clusters, Kubernetes Federated
+Ingress aims to provide a similar/identical API abstraction and,
+again, multiple implementations to cover various
+cloud-provider-specific as well as multi-cloud scenarios. The general
+model is to allow the user to instantiate a single Ingress object via
+the Federation API, and have it automatically provision all of the
+necessary underlying resources (L7 cloud load balancers, in-cluster
+proxies etc) to provide L7 load balancing across a service spanning
+multiple clusters.
+
+Four options are outlined:
+
+1. GCP only
+1. AWS only
+1. Cross-cloud via GCP in-cluster proxies (i.e. clients get to AWS and on-prem via GCP).
+1. Cross-cloud via AWS in-cluster proxies (i.e. clients get to GCP and on-prem via AWS).
+
+Option 1 is the:
+
+1. easiest/quickest,
+1. most featureful
+
+Recommendations:
+
++ Suggest tackling option 1 (GCP only) first (target beta in v1.4)
++ Thereafter option 3 (cross-cloud via GCP)
++ We should encourage/facilitate the community to tackle option 2 (AWS-only)
+
+## Options
+
+## Google Cloud Platform only - backed by GCE L7 Load Balancers
+
+This is an option for federations across clusters which all run on Google Cloud Platform (i.e. GCE and/or GKE)
+
+### Features
+
+In summary, all of [GCE L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/) features:
+
+1. Single global virtual (a.k.a. "anycast") IP address ("VIP" - no dependence on dynamic DNS)
+1. Geo-locality for both external and GCP-internal clients
+1. Load-based overflow to next-closest geo-locality (i.e. cluster). Based on either queries per second, or CPU load (unfortunately on the first-hop target VM, not the final destination K8s Service).
+1. URL-based request direction (different backend services can fulfill each different URL).
+1. HTTPS request termination (at the GCE load balancer, with server SSL certs)
+
+### Implementation
+
+1. Federation user creates (federated) Ingress object (the services
+ backing the ingress object must share the same nodePort, as they
+ share a single GCP health check).
+1. Federated Ingress Controller creates Ingress object in each cluster
+ in the federation (after [configuring each cluster ingress
+ controller to share the same ingress UID](https://gist.github.com/bprashanth/52648b2a0b6a5b637f843e7efb2abc97)).
+1. Each cluster-level Ingress Controller ("GLBC") creates Google L7
+ Load Balancer machinery (forwarding rules, target proxy, URL map,
+ backend service, health check) which ensures that traffic to the
+ Ingress (backed by a Service), is directed to the nodes in the cluster.
+1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance)
+
+An alternative implementation approach involves lifting the current
+Federated Ingress Controller functionality up into the Federation
+control plane. This alternative is not considered any any further
+detail in this document.
+
+### Outstanding work Items
+
+1. This should in theory all work out of the box. Need to confirm
+with a manual setup. ([#29341](https://github.com/kubernetes/kubernetes/issues/29341))
+1. Implement Federated Ingress:
+ 1. API machinery (~1 day)
+ 1. Controller (~3 weeks)
+1. Add DNS field to Ingress object (currently missing, but needs to be added, independent of federation)
+ 1. API machinery (~1 day)
+ 1. KubeDNS support (~ 1 week?)
+
+### Pros
+
+1. Global VIP is awesome - geo-locality, load-based overflow (but see caveats below)
+1. Leverages existing K8s Ingress machinery - not too much to add.
+1. Leverages existing Federated Service machinery - controller looks
+ almost identical, DNS provider also re-used.
+
+### Cons
+
+1. Only works across GCP clusters (but see below for a light at the end of the tunnel, for future versions).
+
+## Amazon Web Services only - backed by Route53
+
+This is an option for AWS-only federations. Parts of this are
+apparently work in progress, see e.g.
+[AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346)
+[[WIP/RFC] Simple ingress -> DNS controller, using AWS
+Route53](https://github.com/kubernetes/contrib/pull/841).
+
+### Features
+
+In summary, most of the features of [AWS Elastic Load Balancing](https://aws.amazon.com/elasticloadbalancing/) and [Route53 DNS](https://aws.amazon.com/route53/).
+
+1. Geo-aware DNS direction to closest regional elastic load balancer
+1. DNS health checks to route traffic to only healthy elastic load
+balancers
+1. A variety of possible DNS routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin
+1. Elastic Load Balancing automatically routes traffic across multiple
+ instances and multiple Availability Zones within the same region.
+1. Health checks ensure that only healthy Amazon EC2 instances receive traffic.
+
+### Implementation
+
+1. Federation user creates (federated) Ingress object
+1. Federated Ingress Controller creates Ingress object in each cluster in the federation
+1. Each cluster-level AWS Ingress Controller creates/updates
+ 1. (regional) AWS Elastic Load Balancer machinery which ensures that traffic to the Ingress (backed by a Service), is directed to one of the nodes in one of the clusters in the region.
+ 1. (global) AWS Route53 DNS machinery which ensures that clients are directed to the closest non-overloaded (regional) elastic load balancer.
+1. KubeProxy redirects to one of the backend Pods (currently round-robin, per KubeProxy instance) in the destination K8s cluster.
+
+### Outstanding Work Items
+
+Most of this remains is currently unimplemented ([AWS Ingress controller](https://github.com/kubernetes/contrib/issues/346)
+[[WIP/RFC] Simple ingress -> DNS controller, using AWS
+Route53](https://github.com/kubernetes/contrib/pull/841).
+
+1. K8s AWS Ingress Controller
+1. Re-uses all of the non-GCE specific Federation machinery discussed above under "GCP-only...".
+
+### Pros
+
+1. Geo-locality (via geo-DNS, not VIP)
+1. Load-based overflow
+1. Real load balancing (same caveats as for GCP above).
+1. L7 SSL connection termination.
+1. Seems it can be made to work for hybrid with on-premise (using VPC). More research required.
+
+### Cons
+
+1. K8s Ingress Controller still needs to be developed. Lots of work.
+1. geo-DNS based locality/failover is not as nice as VIP-based (but very useful, nonetheless)
+1. Only works on AWS (initial version, at least).
+
+## Cross-cloud via GCP
+
+### Summary
+
+Use GCP Federated Ingress machinery described above, augmented with additional HA-proxy backends in all GCP clusters to proxy to non-GCP clusters (via either Service External IP's, or VPN directly to KubeProxy or Pods).
+
+### Features
+
+As per GCP-only above, except that geo-locality would be to the closest GCP cluster (and possibly onwards to the closest AWS/on-prem cluster).
+
+### Implementation
+
+TBD - see Summary above in the mean time.
+
+### Outstanding Work
+
+Assuming that GCP-only (see above) is complete:
+
+1. Wire-up the HA-proxy load balancers to redirect to non-GCP clusters
+1. Probably some more - additional detailed research and design necessary.
+
+### Pros
+
+1. Works for cross-cloud.
+
+### Cons
+
+1. Traffic to non-GCP clusters proxies through GCP clusters. Additional bandwidth costs (3x?) in those cases.
+
+## Cross-cloud via AWS
+
+In theory the same approach as "Cross-cloud via GCP" above could be used, except that AWS infrastructure would be used to get traffic first to an AWS cluster, and then proxied onwards to non-AWS and/or on-prem clusters.
+Detail docs TBD.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federated-ingress.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federated-replicasets.md b/contributors/design-proposals/federated-replicasets.md
new file mode 100644
index 00000000..f1744ade
--- /dev/null
+++ b/contributors/design-proposals/federated-replicasets.md
@@ -0,0 +1,513 @@
+# Federated ReplicaSets
+
+# Requirements & Design Document
+
+This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion.
+
+Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com)
+Based on discussions with
+Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com)
+
+## Overview
+
+### Summary & Vision
+
+When running a global application on a federation of Kubernetes
+clusters the owner currently has to start it in multiple clusters and
+control whether he has both enough application replicas running
+locally in each of the clusters (so that, for example, users are
+handled by a nearby cluster, with low latency) and globally (so that
+there is always enough capacity to handle all traffic). If one of the
+clusters has issues or hasn’t enough capacity to run the given set of
+replicas the replicas should be automatically moved to some other
+cluster to keep the application responsive.
+
+In single cluster Kubernetes there is a concept of ReplicaSet that
+manages the replicas locally. We want to expand this concept to the
+federation level.
+
+### Goals
+
++ Win large enterprise customers who want to easily run applications
+ across multiple clusters
++ Create a reference controller implementation to facilitate bringing
+ other Kubernetes concepts to Federated Kubernetes.
+
+## Glossary
+
+Federation Cluster - a cluster that is a member of federation.
+
+Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster
+that is a member of federation.
+
+Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server.
+
+Federated ReplicaSet Controller (FRSC) - A controller running inside
+of Federated K8S server that controlls FRS.
+
+## User Experience
+
+### Critical User Journeys
+
++ [CUJ1] User wants to create a ReplicaSet in each of the federation
+ cluster. They create a definition of federated ReplicaSet on the
+ federated master and (local) ReplicaSets are automatically created
+ in each of the federation clusters. The number of replicas is each
+ of the Local ReplicaSets is (perheps indirectly) configurable by
+ the user.
++ [CUJ2] When the current number of replicas in a cluster drops below
+ the desired number and new replicas cannot be scheduled then they
+ should be started in some other cluster.
+
+### Features Enabling Critical User Journeys
+
+Feature #1 -> CUJ1:
+A component which looks for newly created Federated ReplicaSets and
+creates the appropriate Local ReplicaSet definitions in the federated
+clusters.
+
+Feature #2 -> CUJ2:
+A component that checks how many replicas are actually running in each
+of the subclusters and if the number matches to the
+FederatedReplicaSet preferences (by default spread replicas evenly
+across the clusters but custom preferences are allowed - see
+below). If it doesn’t and the situation is unlikely to improve soon
+then the replicas should be moved to other subclusters.
+
+### API and CLI
+
+All interaction with FederatedReplicaSet will be done by issuing
+kubectl commands pointing on the Federated Master API Server. All the
+commands would behave in a similar way as on the regular master,
+however in the next versions (1.5+) some of the commands may give
+slightly different output. For example kubectl describe on federated
+replica set should also give some information about the subclusters.
+
+Moreover, for safety, some defaults will be different. For example for
+kubectl delete federatedreplicaset cascade will be set to false.
+
+FederatedReplicaSet would have the same object as local ReplicaSet
+(although it will be accessible in a different part of the
+api). Scheduling preferences (how many replicas in which cluster) will
+be passed as annotations.
+
+### FederateReplicaSet preferences
+
+The preferences are expressed by the following structure, passed as a
+serialized json inside annotations.
+
+```
+type FederatedReplicaSetPreferences struct {
+ // If set to true then already scheduled and running replicas may be moved to other clusters to
+ // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false,
+ // up and running replicas will not be moved.
+ Rebalance bool `json:"rebalance,omitempty"`
+
+ // Map from cluster name to preferences for that cluster. It is assumed that if a cluster
+ // doesn’t have a matching entry then it should not have local replica. The cluster matches
+ // to "*" if there is no entry with the real cluster name.
+ Clusters map[string]LocalReplicaSetPreferences
+}
+
+// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset.
+type ClusterReplicaSetPreferences struct {
+ // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default.
+ MinReplicas int64 `json:"minReplicas,omitempty"`
+
+ // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default).
+ MaxReplicas *int64 `json:"maxReplicas,omitempty"`
+
+ // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default.
+ Weight int64
+}
+```
+
+How this works in practice:
+
+**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config:
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ Weight: 1}
+ }
+}
+```
+
+Example:
+
++ Clusters A,B,C, all have capacity.
+ Replica layout: A=16 B=17 C=17.
++ Clusters A,B,C and C has capacity for 6 replicas.
+ Replica layout: A=22 B=22 C=6
++ Clusters A,B,C. B and C are offline:
+ Replica layout: A=50
+
+**Scenario 2**. I want to have only 2 replicas in each of the clusters.
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1}
+ }
+}
+```
+
+Or
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 }
+ }
+ }
+
+```
+
+Or
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2}
+ }
+}
+```
+
+There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running.
+
+**Scenario 3**. I want to have 20 replicas in each of 3 clusters.
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0}
+ }
+}
+```
+
+There is a global target for 50, however clusters require 60. So some clusters will have less replicas.
+ Replica layout: A=20 B=20 C=10.
+
+**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C.
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : true
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ Weight: 1}
+ “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1}
+ }
+}
+```
+
+Example:
+
++ All have capacity.
+ Replica layout: A=16 B=17 C=17.
++ B is offline/has no capacity
+ Replica layout: A=30 B=0 C=20
++ A and B are offline:
+ Replica layout: C=20
+
+**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally.
+
+```
+FederatedReplicaSetPreferences {
+ Clusters : map[string]LocalReplicaSet {
+ “A” : LocalReplicaSet{ Weight: 1000000}
+ “B” : LocalReplicaSet{ Weight: 1}
+ “C” : LocalReplicaSet{ Weight: 1}
+ }
+}
+```
+
+Example:
+
++ All have capacity.
+ Replica layout: A=50 B=0 C=0.
++ A has capacity for only 40 replicas
+ Replica layout: A=40 B=5 C=5
+
+**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters.
+
+```
+FederatedReplicaSetPreferences {
+ Clusters : map[string]LocalReplicaSet {
+ “A” : LocalReplicaSet{ Weight: 2}
+ “B” : LocalReplicaSet{ Weight: 1}
+ “C” : LocalReplicaSet{ Weight: 1}
+ }
+}
+```
+
+**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there
+are already some replicas, please do not move them. Config:
+
+```
+FederatedReplicaSetPreferences {
+ Rebalance : false
+ Clusters : map[string]LocalReplicaSet {
+ "*" : LocalReplicaSet{ Weight: 1}
+ }
+}
+```
+
+Example:
+
++ Clusters A,B,C, all have capacity, but A already has 20 replicas
+ Replica layout: A=20 B=15 C=15.
++ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas.
+ Replica layout: A=22 B=22 C=6
++ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas.
+ Replica layout: A=30 B=14 C=6
+
+## The Idea
+
+A new federated controller - Federated Replica Set Controller (FRSC)
+will be created inside federated controller manager. Below are
+enumerated the key idea elements:
+
++ [I0] It is considered OK to have slightly higher number of replicas
+ globally for some time.
+
++ [I1] FRSC starts an informer on the FederatedReplicaSet that listens
+ on FRS being created, updated or deleted. On each create/update the
+ scheduling code will be started to calculate where to put the
+ replicas. The default behavior is to start the same amount of
+ replicas in each of the cluster. While creating LocalReplicaSets
+ (LRS) the following errors/issues can occur:
+
+ + [E1] Master rejects LRS creation (for known or unknown
+ reason). In this case another attempt to create a LRS should be
+ attempted in 1m or so. This action can be tied with
+ [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created
+ the situation is the same as [E5]. If this happens multiple
+ times all due replicas should be moved elsewhere and later moved
+ back once the LRS is created.
+
+ + [E2] LRS with the same name but different configuration already
+ exists. The LRS is then overwritten and an appropriate event
+ created to explain what happened. Pods under the control of the
+ old LRS are left intact and the new LRS may adopt them if they
+ match the selector.
+
+ + [E3] LRS is new but the pods that match the selector exist. The
+ pods are adopted by the RS (if not owned by some other
+ RS). However they may have a different image, configuration
+ etc. Just like with regular LRS.
+
++ [I2] For each of the cluster FRSC starts a store and an informer on
+ LRS that will listen for status updates. These status changes are
+ only interesting in case of troubles. Otherwise it is assumed that
+ LRS runs trouble free and there is always the right number of pod
+ created but possibly not scheduled.
+
+
+ + [E4] LRS is manually deleted from the local cluster. In this case
+ a new LRS should be created. It is the same case as
+ [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind
+ won’t be killed and will be adopted after the LRS is recreated.
+
+ + [E5] LRS fails to create (not necessary schedule) the desired
+ number of pods due to master troubles, admission control
+ etc. This should be considered as the same situation as replicas
+ unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)).
+
+ + [E6] It is impossible to tell that an informer lost connection
+ with a remote cluster or has other synchronization problem so it
+ should be handled by cluster liveness probe and deletion
+ [[I6]](#heading=h.z90979gc2216).
+
++ [I3] For each of the cluster start an store and informer to monitor
+ whether the created pods are eventually scheduled and what is the
+ current number of correctly running ready pods. Errors:
+
+ + [E7] It is impossible to tell that an informer lost connection
+ with a remote cluster or has other synchronization problem so it
+ should be handled by cluster liveness probe and deletion
+ [[I6]](#heading=h.z90979gc2216)
+
++ [I4] It is assumed that a not scheduled pod is a normal situation
+and can last up to X min if there is a huge traffic on the
+cluster. However if the replicas are not scheduled in that time then
+FRSC should consider moving most of the unscheduled replicas
+elsewhere. For that purpose FRSC will maintain a data structure
+where for each FRS controlled LRS we store a list of pods belonging
+to that LRS along with their current status and status change timestamp.
+
++ [I5] If a new cluster is added to the federation then it doesn’t
+ have a LRS and the situation is equal to
+ [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef).
+
++ [I6] If a cluster is removed from the federation then the situation
+ is equal to multiple [E4]. It is assumed that if a connection with
+ a cluster is lost completely then the cluster is removed from the
+ the cluster list (or marked accordingly) so
+ [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
+ don’t need to be handled.
+
++ [I7] All ToBeChecked FRS are browsed every 1 min (configurable),
+ checked against the current list of clusters, and all missing LRS
+ are created. This will be executed in combination with [I8].
+
++ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min
+ (configurable) to check whether some replica move between clusters
+ is needed or not.
+
++ FRSC never moves replicas to LRS that have not scheduled/running
+pods or that has pods that failed to be created.
+
+ + When FRSC notices that a number of pods are not scheduler/running
+ or not_even_created in one LRS for more than Y minutes it takes
+ most of them from LRS, leaving couple still waiting so that once
+ they are scheduled FRSC will know that it is ok to put some more
+ replicas to that cluster.
+
++ [I9] FRS becomes ToBeChecked if:
+ + It is newly created
+ + Some replica set inside changed its status
+ + Some pods inside cluster changed their status
+ + Some cluster is added or deleted.
+> FRS stops ToBeChecked if is in desired configuration (or is stable enough).
+
+## (RE)Scheduling algorithm
+
+To calculate the (re)scheduling moves for a given FRS:
+
+1. For each cluster FRSC calculates the number of replicas that are placed
+(not necessary up and running) in the cluster and the number of replicas that
+failed to be scheduled. Cluster capacity is the difference between the
+the placed and failed to be scheduled.
+
+2. Order all clusters by their weight and hash of the name so that every time
+we process the same replica-set we process the clusters in the same order.
+Include federated replica set name in the cluster name hash so that we get
+slightly different ordering for different RS. So that not all RS of size 1
+end up on the same cluster.
+
+3. Assign minimum prefered number of replicas to each of the clusters, if
+there is enough replicas and capacity.
+
+4. If rebalance = false, assign the previously present replicas to the clusters,
+remember the number of extra replicas added (ER). Of course if there
+is enough replicas and capacity.
+
+5. Distribute the remaining replicas with regard to weights and cluster capacity.
+In multiple iterations calculate how many of the replicas should end up in the cluster.
+For each of the cluster cap the number of assigned replicas by max number of replicas and
+cluster capacity. If there were some extra replicas added to the cluster in step
+4, don't really add the replicas but balance them gains ER from 4.
+
+## Goroutines layout
+
++ [GR1] Involved in FRS informer (see
+ [[I1]]). Whenever a FRS is created and
+ updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with
+ delay 0.
+
++ [GR2_1...GR2_N] Involved in informers/store on LRS (see
+ [[I2]]). On all changes the FRS is put on
+ FRS_TO_CHECK_QUEUE with delay 1min.
+
++ [GR3_1...GR3_N] Involved in informers/store on Pods
+ (see [[I3]] and [[I4]]). They maintain the status store
+ so that for each of the LRS we know the number of pods that are
+ actually running and ready in O(1) time. They also put the
+ corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min.
+
++ [GR4] Involved in cluster informer (see
+ [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE
+ with delay 0.
+
++ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on
+ FRS_CHANNEL after the given delay (and remove from
+ FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to
+ FRS_TO_CHECK_QUEUE the delays are compared and updated so that the
+ shorter delay is used.
+
++ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever
+ a FRS is received it is put to a work queue. Work queue has no delay
+ and makes sure that a single replica set is process is processed by
+ only one goroutine.
+
++ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS.
+ Multiple replica set can be processed in parallel. Two Goroutines cannot
+ process the same FRS at the same time.
+
+
+## Func DoFrsCheck
+
+The function does [[I7]] and[[I8]]. It is assumed that it is run on a
+single thread/goroutine so we check and evaluate the same FRS on many
+goroutines (however if needed the function can be parallelized for
+different FRS). It takes data only from store maintained by GR2_* and
+GR3_*. The external communication is only required to:
+
++ Create LRS. If a LRS doesn’t exist it is created after the
+ rescheduling, when we know how much replicas should it have.
+
++ Update LRS replica targets.
+
+If FRS is not in the desired state then it is put to
+FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing).
+
+## Monitoring and status reporting
+
+FRCS should expose a number of metrics form the run, like
+
++ FRSC -> LRS communication latency
++ Total times spent in various elements of DoFrsCheck
+
+FRSC should also expose the status of FRS as an annotation on FRS and
+as events.
+
+## Workflow
+
+Here is the sequence of tasks that need to be done in order for a
+typical FRS to be split into a number of LRS’s and to be created in
+the underlying federated clusters.
+
+Note a: the reason the workflow would be helpful at this phase is that
+for every one or two steps we can create PRs accordingly to start with
+the development.
+
+Note b: we assume that the federation is already in place and the
+federated clusters are added to the federation.
+
+Step 1. the client sends an RS create request to the
+federation-apiserver
+
+Step 2. federation-apiserver persists an FRS into the federation etcd
+
+Note c: federation-apiserver populates the clusterid field in the FRS
+before persisting it into the federation etcd
+
+Step 3: the federation-level “informer” in FRSC watches federation
+etcd for new/modified FRS’s, with empty clusterid or clusterid equal
+to federation ID, and if detected, it calls the scheduling code
+
+Step 4.
+
+Note d: scheduler populates the clusterid field in the LRS with the
+IDs of target clusters
+
+Note e: at this point let us assume that it only does the even
+distribution, i.e., equal weights for all of the underlying clusters
+
+Step 5. As soon as the scheduler function returns the control to FRSC,
+the FRSC starts a number of cluster-level “informer”s, one per every
+target cluster, to watch changes in every target cluster etcd
+regarding the posted LRS’s and if any violation from the scheduled
+number of replicase is detected the scheduling code is re-called for
+re-scheduling purposes.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federated-services.md b/contributors/design-proposals/federated-services.md
new file mode 100644
index 00000000..b9d51c43
--- /dev/null
+++ b/contributors/design-proposals/federated-services.md
@@ -0,0 +1,517 @@
+# Kubernetes Cluster Federation (previously nicknamed "Ubernetes")
+
+## Cross-cluster Load Balancing and Service Discovery
+
+### Requirements and System Design
+
+### by Quinton Hoole, Dec 3 2015
+
+## Requirements
+
+### Discovery, Load-balancing and Failover
+
+1. **Internal discovery and connection**: Pods/containers (running in
+ a Kubernetes cluster) must be able to easily discover and connect
+ to endpoints for Kubernetes services on which they depend in a
+ consistent way, irrespective of whether those services exist in a
+ different kubernetes cluster within the same cluster federation.
+ Hence-forth referred to as "cluster-internal clients", or simply
+ "internal clients".
+1. **External discovery and connection**: External clients (running
+ outside a Kubernetes cluster) must be able to discover and connect
+ to endpoints for Kubernetes services on which they depend.
+ 1. **External clients predominantly speak HTTP(S)**: External
+ clients are most often, but not always, web browsers, or at
+ least speak HTTP(S) - notable exceptions include Enterprise
+ Message Busses (Java, TLS), DNS servers (UDP),
+ SIP servers and databases)
+1. **Find the "best" endpoint:** Upon initial discovery and
+ connection, both internal and external clients should ideally find
+ "the best" endpoint if multiple eligible endpoints exist. "Best"
+ in this context implies the closest (by network topology) endpoint
+ that is both operational (as defined by some positive health check)
+ and not overloaded (by some published load metric). For example:
+ 1. An internal client should find an endpoint which is local to its
+ own cluster if one exists, in preference to one in a remote
+ cluster (if both are operational and non-overloaded).
+ Similarly, one in a nearby cluster (e.g. in the same zone or
+ region) is preferable to one further afield.
+ 1. An external client (e.g. in New York City) should find an
+ endpoint in a nearby cluster (e.g. U.S. East Coast) in
+ preference to one further away (e.g. Japan).
+1. **Easy fail-over:** If the endpoint to which a client is connected
+ becomes unavailable (no network response/disconnected) or
+ overloaded, the client should reconnect to a better endpoint,
+ somehow.
+ 1. In the case where there exist one or more connection-terminating
+ load balancers between the client and the serving Pod, failover
+ might be completely automatic (i.e. the client's end of the
+ connection remains intact, and the client is completely
+ oblivious of the fail-over). This approach incurs network speed
+ and cost penalties (by traversing possibly multiple load
+ balancers), but requires zero smarts in clients, DNS libraries,
+ recursing DNS servers etc, as the IP address of the endpoint
+ remains constant over time.
+ 1. In a scenario where clients need to choose between multiple load
+ balancer endpoints (e.g. one per cluster), multiple DNS A
+ records associated with a single DNS name enable even relatively
+ dumb clients to try the next IP address in the list of returned
+ A records (without even necessarily re-issuing a DNS resolution
+ request). For example, all major web browsers will try all A
+ records in sequence until a working one is found (TBD: justify
+ this claim with details for Chrome, IE, Safari, Firefox).
+ 1. In a slightly more sophisticated scenario, upon disconnection, a
+ smarter client might re-issue a DNS resolution query, and
+ (modulo DNS record TTL's which can typically be set as low as 3
+ minutes, and buggy DNS resolvers, caches and libraries which
+ have been known to completely ignore TTL's), receive updated A
+ records specifying a new set of IP addresses to which to
+ connect.
+
+### Portability
+
+A Kubernetes application configuration (e.g. for a Pod, Replication
+Controller, Service etc) should be able to be successfully deployed
+into any Kubernetes Cluster or Federation of Clusters,
+without modification. More specifically, a typical configuration
+should work correctly (although possibly not optimally) across any of
+the following environments:
+
+1. A single Kubernetes Cluster on one cloud provider (e.g. Google
+ Compute Engine, GCE).
+1. A single Kubernetes Cluster on a different cloud provider
+ (e.g. Amazon Web Services, AWS).
+1. A single Kubernetes Cluster on a non-cloud, on-premise data center
+1. A Federation of Kubernetes Clusters all on the same cloud provider
+ (e.g. GCE).
+1. A Federation of Kubernetes Clusters across multiple different cloud
+ providers and/or on-premise data centers (e.g. one cluster on
+ GCE/GKE, one on AWS, and one on-premise).
+
+### Trading Portability for Optimization
+
+It should be possible to explicitly opt out of portability across some
+subset of the above environments in order to take advantage of
+non-portable load balancing and DNS features of one or more
+environments. More specifically, for example:
+
+1. For HTTP(S) applications running on GCE-only Federations,
+ [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+ should be usable. These provide single, static global IP addresses
+ which load balance and fail over globally (i.e. across both regions
+ and zones). These allow for really dumb clients, but they only
+ work on GCE, and only for HTTP(S) traffic.
+1. For non-HTTP(S) applications running on GCE-only Federations within
+ a single region,
+ [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+ should be usable. These provide TCP (i.e. both HTTP/S and
+ non-HTTP/S) load balancing and failover, but only on GCE, and only
+ within a single region.
+ [Google Cloud DNS](https://cloud.google.com/dns) can be used to
+ route traffic between regions (and between different cloud
+ providers and on-premise clusters, as it's plain DNS, IP only).
+1. For applications running on AWS-only Federations,
+ [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+ should be usable. These provide both L7 (HTTP(S)) and L4 load
+ balancing, but only within a single region, and only on AWS
+ ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
+ used to load balance and fail over across multiple regions, and is
+ also capable of resolving to non-AWS endpoints).
+
+## Component Cloud Services
+
+Cross-cluster Federated load balancing is built on top of the following:
+
+1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+ provide single, static global IP addresses which load balance and
+ fail over globally (i.e. across both regions and zones). These
+ allow for really dumb clients, but they only work on GCE, and only
+ for HTTP(S) traffic.
+1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
+ provide both HTTP(S) and non-HTTP(S) load balancing and failover,
+ but only on GCE, and only within a single region.
+1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
+ provide both L7 (HTTP(S)) and L4 load balancing, but only within a
+ single region, and only on AWS.
+1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
+ programmable DNS service, like
+ [CloudFlare](http://www.cloudflare.com) can be used to route
+ traffic between regions (and between different cloud providers and
+ on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
+ doesn't provide any built-in geo-DNS, latency-based routing, health
+ checking, weighted round robin or other advanced capabilities.
+ It's plain old DNS. We would need to build all the aforementioned
+ on top of it. It can provide internal DNS services (i.e. serve RFC
+ 1918 addresses).
+ 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
+ be used to load balance and fail over across regions, and is also
+ capable of routing to non-AWS endpoints). It provides built-in
+ geo-DNS, latency-based routing, health checking, weighted
+ round robin and optional tight integration with some other
+ AWS services (e.g. Elastic Load Balancers).
+1. Kubernetes L4 Service Load Balancing: This provides both a
+ [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
+ and a
+ [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
+ service IP which is load-balanced (currently simple round-robin)
+ across the healthy pods comprising a service within a single
+ Kubernetes cluster.
+1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html):
+A generic wrapper around cloud-provided L4 and L7 load balancing services, and
+roll-your-own load balancers run in pods, e.g. HA Proxy.
+
+## Cluster Federation API
+
+The Cluster Federation API for load balancing should be compatible with the equivalent
+Kubernetes API, to ease porting of clients between Kubernetes and
+federations of Kubernetes clusters.
+Further details below.
+
+## Common Client Behavior
+
+To be useful, our load balancing solution needs to work properly with real
+client applications. There are a few different classes of those...
+
+### Browsers
+
+These are the most common external clients. These are all well-written. See below.
+
+### Well-written clients
+
+1. Do a DNS resolution every time they connect.
+1. Don't cache beyond TTL (although a small percentage of the DNS
+ servers on which they rely might).
+1. Do try multiple A records (in order) to connect.
+1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
+
+Examples:
+
++ all common browsers (except for SRV records)
++ ...
+
+### Dumb clients
+
+1. Don't do a DNS resolution every time they connect (or do cache beyond the
+TTL).
+1. Do try multiple A records
+
+Examples:
+
++ ...
+
+### Dumber clients
+
+1. Only do a DNS lookup once on startup.
+1. Only try the first returned DNS A record.
+
+Examples:
+
++ ...
+
+### Dumbest clients
+
+1. Never do a DNS lookup - are pre-configured with a single (or possibly
+multiple) fixed server IP(s). Nothing else matters.
+
+## Architecture and Implementation
+
+### General Control Plane Architecture
+
+Each cluster hosts one or more Cluster Federation master components (Federation API
+servers, controller managers with leader election, and etcd quorum members. This
+is documented in more detail in a separate design doc:
+[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
+
+In the description below, assume that 'n' clusters, named 'cluster-1'...
+'cluster-n' have been registered against a Cluster Federation "federation-1",
+each with their own set of Kubernetes API endpoints,so,
+"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
+[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
+... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
+
+### Federated Services
+
+Federated Services are pretty straight-forward. They're comprised of multiple
+equivalent underlying Kubernetes Services, each with their own external
+endpoint, and a load balancing mechanism across them. Let's work through how
+exactly that works in practice.
+
+Our user creates the following Federated Service (against a Federation
+API endpoint):
+
+ $ kubectl create -f my-service.yaml --context="federation-1"
+
+where service.yaml contains the following:
+
+ kind: Service
+ metadata:
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ spec:
+ ports:
+ - port: 2379
+ protocol: TCP
+ targetPort: 2379
+ name: client
+ - port: 2380
+ protocol: TCP
+ targetPort: 2380
+ name: peer
+ selector:
+ run: my-service
+ type: LoadBalancer
+
+The Cluster Federation control system in turn creates one equivalent service (identical config to the above)
+in each of the underlying Kubernetes clusters, each of which results in
+something like this:
+
+ $ kubectl get -o yaml --context="cluster-1" service my-service
+
+ apiVersion: v1
+ kind: Service
+ metadata:
+ creationTimestamp: 2015-11-25T23:35:25Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ resourceVersion: "147365"
+ selfLink: /api/v1/namespaces/my-namespace/services/my-service
+ uid: 33bfc927-93cd-11e5-a38c-42010af00002
+ spec:
+ clusterIP: 10.0.153.185
+ ports:
+ - name: client
+ nodePort: 31333
+ port: 2379
+ protocol: TCP
+ targetPort: 2379
+ - name: peer
+ nodePort: 31086
+ port: 2380
+ protocol: TCP
+ targetPort: 2380
+ selector:
+ run: my-service
+ sessionAffinity: None
+ type: LoadBalancer
+ status:
+ loadBalancer:
+ ingress:
+ - ip: 104.197.117.10
+
+Similar services are created in `cluster-2` and `cluster-3`, each of which are
+allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`.
+
+In the Cluster Federation `federation-1`, the resulting federated service looks as follows:
+
+ $ kubectl get -o yaml --context="federation-1" service my-service
+
+ apiVersion: v1
+ kind: Service
+ metadata:
+ creationTimestamp: 2015-11-25T23:35:23Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ resourceVersion: "157333"
+ selfLink: /api/v1/namespaces/my-namespace/services/my-service
+ uid: 33bfc927-93cd-11e5-a38c-42010af00007
+ spec:
+ clusterIP:
+ ports:
+ - name: client
+ nodePort: 31333
+ port: 2379
+ protocol: TCP
+ targetPort: 2379
+ - name: peer
+ nodePort: 31086
+ port: 2380
+ protocol: TCP
+ targetPort: 2380
+ selector:
+ run: my-service
+ sessionAffinity: None
+ type: LoadBalancer
+ status:
+ loadBalancer:
+ ingress:
+ - hostname: my-service.my-namespace.my-federation.my-domain.com
+
+Note that the federated service:
+
+1. Is API-compatible with a vanilla Kubernetes service.
+1. has no clusterIP (as it is cluster-independent)
+1. has a federation-wide load balancer hostname
+
+In addition to the set of underlying Kubernetes services (one per cluster)
+described above, the Cluster Federation control system has also created a DNS name (e.g. on
+[Google Cloud DNS](https://cloud.google.com/dns) or
+[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration)
+which provides load balancing across all of those services. For example, in a
+very basic configuration:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
+
+Each of the above IP addresses (which are just the external load balancer
+ingress IP's of each cluster service) is of course load balanced across the pods
+comprising the service in each cluster.
+
+In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster
+Federation control system
+automatically creates a
+[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
+which exposes a single, globally load-balanced IP:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44
+
+Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS)
+in each Kubernetes cluster to preferentially return the local
+clusterIP for the service in that cluster, with other clusters'
+external service IP's (or a global load-balanced IP) also configured
+for failover purposes:
+
+ $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
+ my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
+
+If Cluster Federation Global Service Health Checking is enabled, multiple service health
+checkers running across the federated clusters collaborate to monitor the health
+of the service endpoints, and automatically remove unhealthy endpoints from the
+DNS record (e.g. a majority quorum is required to vote a service endpoint
+unhealthy, to avoid false positives due to individual health checker network
+isolation).
+
+### Federated Replication Controllers
+
+So far we have a federated service defined, with a resolvable load balancer
+hostname by which clients can reach it, but no pods serving traffic directed
+there. So now we need a Federated Replication Controller. These are also fairly
+straight-forward, being comprised of multiple underlying Kubernetes Replication
+Controllers which do the hard work of keeping the desired number of Pod replicas
+alive in each Kubernetes cluster.
+
+ $ kubectl create -f my-service-rc.yaml --context="federation-1"
+
+where `my-service-rc.yaml` contains the following:
+
+ kind: ReplicationController
+ metadata:
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ spec:
+ replicas: 6
+ selector:
+ run: my-service
+ template:
+ metadata:
+ labels:
+ run: my-service
+ spec:
+ containers:
+ image: gcr.io/google_samples/my-service:v1
+ name: my-service
+ ports:
+ - containerPort: 2379
+ protocol: TCP
+ - containerPort: 2380
+ protocol: TCP
+
+The Cluster Federation control system in turn creates one equivalent replication controller
+(identical config to the above, except for the replica count) in each
+of the underlying Kubernetes clusters, each of which results in
+something like this:
+
+ $ ./kubectl get -o yaml rc my-service --context="cluster-1"
+ kind: ReplicationController
+ metadata:
+ creationTimestamp: 2015-12-02T23:00:47Z
+ labels:
+ run: my-service
+ name: my-service
+ namespace: my-namespace
+ selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
+ uid: 86542109-9948-11e5-a38c-42010af00002
+ spec:
+ replicas: 2
+ selector:
+ run: my-service
+ template:
+ metadata:
+ labels:
+ run: my-service
+ spec:
+ containers:
+ image: gcr.io/google_samples/my-service:v1
+ name: my-service
+ ports:
+ - containerPort: 2379
+ protocol: TCP
+ - containerPort: 2380
+ protocol: TCP
+ resources: {}
+ dnsPolicy: ClusterFirst
+ restartPolicy: Always
+ status:
+ replicas: 2
+
+The exact number of replicas created in each underlying cluster will of course
+depend on what scheduling policy is in force. In the above example, the
+scheduler created an equal number of replicas (2) in each of the three
+underlying clusters, to make up the total of 6 replicas required. To handle
+entire cluster failures, various approaches are possible, including:
+1. **simple overprovisioning**, such that sufficient replicas remain even if a
+ cluster fails. This wastes some resources, but is simple and reliable.
+2. **pod autoscaling**, where the replication controller in each
+ cluster automatically and autonomously increases the number of
+ replicas in its cluster in response to the additional traffic
+ diverted from the failed cluster. This saves resources and is relatively
+ simple, but there is some delay in the autoscaling.
+3. **federated replica migration**, where the Cluster Federation
+ control system detects the cluster failure and automatically
+ increases the replica count in the remainaing clusters to make up
+ for the lost replicas in the failed cluster. This does not seem to
+ offer any benefits relative to pod autoscaling above, and is
+ arguably more complex to implement, but we note it here as a
+ possibility.
+
+### Implementation Details
+
+The implementation approach and architecture is very similar to Kubernetes, so
+if you're familiar with how Kubernetes works, none of what follows will be
+surprising. One additional design driver not present in Kubernetes is that
+the Cluster Federation control system aims to be resilient to individual cluster and availability zone
+failures. So the control plane spans multiple clusters. More specifically:
+
++ Cluster Federation runs it's own distinct set of API servers (typically one
+ or more per underlying Kubernetes cluster). These are completely
+ distinct from the Kubernetes API servers for each of the underlying
+ clusters.
++ Cluster Federation runs it's own distinct quorum-based metadata store (etcd,
+ by default). Approximately 1 quorum member runs in each underlying
+ cluster ("approximately" because we aim for an odd number of quorum
+ members, and typically don't want more than 5 quorum members, even
+ if we have a larger number of federated clusters, so 2 clusters->3
+ quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
+
+Cluster Controllers in the Federation control system watch against the
+Federation API server/etcd
+state, and apply changes to the underlying kubernetes clusters accordingly. They
+also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired"
+state against kubernetes "actual desired" state.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federation-high-level-arch.png b/contributors/design-proposals/federation-high-level-arch.png
new file mode 100644
index 00000000..8a416cc1
--- /dev/null
+++ b/contributors/design-proposals/federation-high-level-arch.png
Binary files differ
diff --git a/contributors/design-proposals/federation-lite.md b/contributors/design-proposals/federation-lite.md
new file mode 100644
index 00000000..549f98df
--- /dev/null
+++ b/contributors/design-proposals/federation-lite.md
@@ -0,0 +1,201 @@
+# Kubernetes Multi-AZ Clusters
+
+## (previously nicknamed "Ubernetes-Lite")
+
+## Introduction
+
+Full Cluster Federation will offer sophisticated federation between multiple kubernetes
+clusters, offering true high-availability, multiple provider support &
+cloud-bursting, multiple region support etc. However, many users have
+expressed a desire for a "reasonably" high-available cluster, that runs in
+multiple zones on GCE or availability zones in AWS, and can tolerate the failure
+of a single zone without the complexity of running multiple clusters.
+
+Multi-AZ Clusters aim to deliver exactly that functionality: to run a single
+Kubernetes cluster in multiple zones. It will attempt to make reasonable
+scheduling decisions, in particular so that a replication controller's pods are
+spread across zones, and it will try to be aware of constraints - for example
+that a volume cannot be mounted on a node in a different zone.
+
+Multi-AZ Clusters are deliberately limited in scope; for many advanced functions
+the answer will be "use full Cluster Federation". For example, multiple-region
+support is not in scope. Routing affinity (e.g. so that a webserver will
+prefer to talk to a backend service in the same zone) is similarly not in
+scope.
+
+## Design
+
+These are the main requirements:
+
+1. kube-up must allow bringing up a cluster that spans multiple zones.
+1. pods in a replication controller should attempt to spread across zones.
+1. pods which require volumes should not be scheduled onto nodes in a different zone.
+1. load-balanced services should work reasonably
+
+### kube-up support
+
+kube-up support for multiple zones will initially be considered
+advanced/experimental functionality, so the interface is not initially going to
+be particularly user-friendly. As we design the evolution of kube-up, we will
+make multiple zones better supported.
+
+For the initial implementation, kube-up must be run multiple times, once for
+each zone. The first kube-up will take place as normal, but then for each
+additional zone the user must run kube-up again, specifying
+`KUBE_USE_EXISTING_MASTER=true` and `KUBE_SUBNET_CIDR=172.20.x.0/24`. This will then
+create additional nodes in a different zone, but will register them with the
+existing master.
+
+### Zone spreading
+
+This will be implemented by modifying the existing scheduler priority function
+`SelectorSpread`. Currently this priority function aims to put pods in an RC
+on different hosts, but it will be extended first to spread across zones, and
+then to spread across hosts.
+
+So that the scheduler does not need to call out to the cloud provider on every
+scheduling decision, we must somehow record the zone information for each node.
+The implementation of this will be described in the implementation section.
+
+Note that zone spreading is 'best effort'; zones are just be one of the factors
+in making scheduling decisions, and thus it is not guaranteed that pods will
+spread evenly across zones. However, this is likely desirable: if a zone is
+overloaded or failing, we still want to schedule the requested number of pods.
+
+### Volume affinity
+
+Most cloud providers (at least GCE and AWS) cannot attach their persistent
+volumes across zones. Thus when a pod is being scheduled, if there is a volume
+attached, that will dictate the zone. This will be implemented using a new
+scheduler predicate (a hard constraint): `VolumeZonePredicate`.
+
+When `VolumeZonePredicate` observes a pod scheduling request that includes a
+volume, if that volume is zone-specific, `VolumeZonePredicate` will exclude any
+nodes not in that zone.
+
+Again, to avoid the scheduler calling out to the cloud provider, this will rely
+on information attached to the volumes. This means that this will only support
+PersistentVolumeClaims, because direct mounts do not have a place to attach
+zone information. PersistentVolumes will then include zone information where
+volumes are zone-specific.
+
+### Load-balanced services should operate reasonably
+
+For both AWS & GCE, Kubernetes creates a native cloud load-balancer for each
+service of type LoadBalancer. The native cloud load-balancers on both AWS &
+GCE are region-level, and support load-balancing across instances in multiple
+zones (in the same region). For both clouds, the behaviour of the native cloud
+load-balancer is reasonable in the face of failures (indeed, this is why clouds
+provide load-balancing as a primitve).
+
+For multi-AZ clusters we will therefore simply rely on the native cloud provider
+load balancer behaviour, and we do not anticipate substantial code changes.
+
+One notable shortcoming here is that load-balanced traffic still goes through
+kube-proxy controlled routing, and kube-proxy does not (currently) favor
+targeting a pod running on the same instance or even the same zone. This will
+likely produce a lot of unnecessary cross-zone traffic (which is likely slower
+and more expensive). This might be sufficiently low-hanging fruit that we
+choose to address it in kube-proxy / multi-AZ clusters, but this can be addressed
+after the initial implementation.
+
+
+## Implementation
+
+The main implementation points are:
+
+1. how to attach zone information to Nodes and PersistentVolumes
+1. how nodes get zone information
+1. how volumes get zone information
+
+### Attaching zone information
+
+We must attach zone information to Nodes and PersistentVolumes, and possibly to
+other resources in future. There are two obvious alternatives: we can use
+labels/annotations, or we can extend the schema to include the information.
+
+For the initial implementation, we propose to use labels. The reasoning is:
+
+1. It is considerably easier to implement.
+1. We will reserve the two labels `failure-domain.alpha.kubernetes.io/zone` and
+`failure-domain.alpha.kubernetes.io/region` for the two pieces of information
+we need. By putting this under the `kubernetes.io` namespace there is no risk
+of collision, and by putting it under `alpha.kubernetes.io` we clearly mark
+this as an experimental feature.
+1. We do not yet know whether these labels will be sufficient for all
+environments, nor which entities will require zone information. Labels give us
+more flexibility here.
+1. Because the labels are reserved, we can move to schema-defined fields in
+future using our cross-version mapping techniques.
+
+### Node labeling
+
+We do not want to require an administrator to manually label nodes. We instead
+modify the kubelet to include the appropriate labels when it registers itself.
+The information is easily obtained by the kubelet from the cloud provider.
+
+### Volume labeling
+
+As with nodes, we do not want to require an administrator to manually label
+volumes. We will create an admission controller `PersistentVolumeLabel`.
+`PersistentVolumeLabel` will intercept requests to create PersistentVolumes,
+and will label them appropriately by calling in to the cloud provider.
+
+## AWS Specific Considerations
+
+The AWS implementation here is fairly straightforward. The AWS API is
+region-wide, meaning that a single call will find instances and volumes in all
+zones. In addition, instance ids and volume ids are unique per-region (and
+hence also per-zone). I believe they are actually globally unique, but I do
+not know if this is guaranteed; in any case we only need global uniqueness if
+we are to span regions, which will not be supported by multi-AZ clusters (to do
+that correctly requires a full Cluster Federation type approach).
+
+## GCE Specific Considerations
+
+The GCE implementation is more complicated than the AWS implementation because
+GCE APIs are zone-scoped. To perform an operation, we must perform one REST
+call per zone and combine the results, unless we can determine in advance that
+an operation references a particular zone. For many operations, we can make
+that determination, but in some cases - such as listing all instances, we must
+combine results from calls in all relevant zones.
+
+A further complexity is that GCE volume names are scoped per-zone, not
+per-region. Thus it is permitted to have two volumes both named `myvolume` in
+two different GCE zones. (Instance names are currently unique per-region, and
+thus are not a problem for multi-AZ clusters).
+
+The volume scoping leads to a (small) behavioural change for multi-AZ clusters on
+GCE. If you had two volumes both named `myvolume` in two different GCE zones,
+this would not be ambiguous when Kubernetes is operating only in a single zone.
+But, when operating a cluster across multiple zones, `myvolume` is no longer
+sufficient to specify a volume uniquely. Worse, the fact that a volume happens
+to be unambigious at a particular time is no guarantee that it will continue to
+be unambigious in future, because a volume with the same name could
+subsequently be created in a second zone. While perhaps unlikely in practice,
+we cannot automatically enable multi-AZ clusters for GCE users if this then causes
+volume mounts to stop working.
+
+This suggests that (at least on GCE), multi-AZ clusters must be optional (i.e.
+there must be a feature-flag). It may be that we can make this feature
+semi-automatic in future, by detecting whether nodes are running in multiple
+zones, but it seems likely that kube-up could instead simply set this flag.
+
+For the initial implementation, creating volumes with identical names will
+yield undefined results. Later, we may add some way to specify the zone for a
+volume (and possibly require that volumes have their zone specified when
+running in multi-AZ cluster mode). We could add a new `zone` field to the
+PersistentVolume type for GCE PD volumes, or we could use a DNS-style dotted
+name for the volume name (<name>.<zone>)
+
+Initially therefore, the GCE changes will be to:
+
+1. change kube-up to support creation of a cluster in multiple zones
+1. pass a flag enabling multi-AZ clusters with kube-up
+1. change the kubernetes cloud provider to iterate through relevant zones when resolving items
+1. tag GCE PD volumes with the appropriate zone information
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation-lite.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federation-phase-1.md b/contributors/design-proposals/federation-phase-1.md
new file mode 100644
index 00000000..0a3a8f50
--- /dev/null
+++ b/contributors/design-proposals/federation-phase-1.md
@@ -0,0 +1,407 @@
+# Ubernetes Design Spec (phase one)
+
+**Huawei PaaS Team**
+
+## INTRODUCTION
+
+In this document we propose a design for the “Control Plane” of
+Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
+this work please refer to
+[this proposal](../../docs/proposals/federation.md).
+The document is arranged as following. First we briefly list scenarios
+and use cases that motivate K8S federation work. These use cases drive
+the design and they also verify the design. We summarize the
+functionality requirements from these use cases, and define the “in
+scope” functionalities that will be covered by this design (phase
+one). After that we give an overview of the proposed architecture, API
+and building blocks. And also we go through several activity flows to
+see how these building blocks work together to support use cases.
+
+## REQUIREMENTS
+
+There are many reasons why customers may want to build a K8S
+federation:
+
++ **High Availability:** Customers want to be immune to the outage of
+ a single availability zone, region or even a cloud provider.
++ **Sensitive workloads:** Some workloads can only run on a particular
+ cluster. They cannot be scheduled to or migrated to other clusters.
++ **Capacity overflow:** Customers prefer to run workloads on a
+ primary cluster. But if the capacity of the cluster is not
+ sufficient, workloads should be automatically distributed to other
+ clusters.
++ **Vendor lock-in avoidance:** Customers want to spread their
+ workloads on different cloud providers, and can easily increase or
+ decrease the workload proportion of a specific provider.
++ **Cluster Size Enhancement:** Currently K8S cluster can only support
+a limited size. While the community is actively improving it, it can
+be expected that cluster size will be a problem if K8S is used for
+large workloads or public PaaS infrastructure. While we can separate
+different tenants to different clusters, it would be good to have a
+unified view.
+
+Here are the functionality requirements derived from above use cases:
+
++ Clients of the federation control plane API server can register and deregister
+clusters.
++ Workloads should be spread to different clusters according to the
+ workload distribution policy.
++ Pods are able to discover and connect to services hosted in other
+ clusters (in cases where inter-cluster networking is necessary,
+ desirable and implemented).
++ Traffic to these pods should be spread across clusters (in a manner
+ similar to load balancing, although it might not be strictly
+ speaking balanced).
++ The control plane needs to know when a cluster is down, and migrate
+ the workloads to other clusters.
++ Clients have a unified view and a central control point for above
+ activities.
+
+## SCOPE
+
+It’s difficult to have a perfect design with one click that implements
+all the above requirements. Therefore we will go with an iterative
+approach to design and build the system. This document describes the
+phase one of the whole work. In phase one we will cover only the
+following objectives:
+
++ Define the basic building blocks and API objects of control plane
++ Implement a basic end-to-end workflow
+ + Clients register federated clusters
+ + Clients submit a workload
+ + The workload is distributed to different clusters
+ + Service discovery
+ + Load balancing
+
+The following parts are NOT covered in phase one:
+
++ Authentication and authorization (other than basic client
+ authentication against the ubernetes API, and from ubernetes control
+ plane to the underlying kubernetes clusters).
++ Deployment units other than replication controller and service
++ Complex distribution policy of workloads
++ Service affinity and migration
+
+## ARCHITECTURE
+
+The overall architecture of a control plane is shown as following:
+
+![Ubernetes Architecture](ubernetes-design.png)
+
+Some design principles we are following in this architecture:
+
+1. Keep the underlying K8S clusters independent. They should have no
+ knowledge of control plane or of each other.
+1. Keep the Ubernetes API interface compatible with K8S API as much as
+ possible.
+1. Re-use concepts from K8S as much as possible. This reduces
+customers’ learning curve and is good for adoption. Below is a brief
+description of each module contained in above diagram.
+
+## Ubernetes API Server
+
+The API Server in the Ubernetes control plane works just like the API
+Server in K8S. It talks to a distributed key-value store to persist,
+retrieve and watch API objects. This store is completely distinct
+from the kubernetes key-value stores (etcd) in the underlying
+kubernetes clusters. We still use `etcd` as the distributed
+storage so customers don’t need to learn and manage a different
+storage system, although it is envisaged that other storage systems
+(consol, zookeeper) will probably be developedand supported over
+time.
+
+## Ubernetes Scheduler
+
+The Ubernetes Scheduler schedules resources onto the underlying
+Kubernetes clusters. For example it watches for unscheduled Ubernetes
+replication controllers (those that have not yet been scheduled onto
+underlying Kubernetes clusters) and performs the global scheduling
+work. For each unscheduled replication controller, it calls policy
+engine to decide how to spit workloads among clusters. It creates a
+Kubernetes Replication Controller on one ore more underlying cluster,
+and post them back to `etcd` storage.
+
+One sublety worth noting here is that the scheduling decision is arrived at by
+combining the application-specific request from the user (which might
+include, for example, placement constraints), and the global policy specified
+by the federation administrator (for example, "prefer on-premise
+clusters over AWS clusters" or "spread load equally across clusters").
+
+## Ubernetes Cluster Controller
+
+The cluster controller
+performs the following two kinds of work:
+
+1. It watches all the sub-resources that are created by Ubernetes
+ components, like a sub-RC or a sub-service. And then it creates the
+ corresponding API objects on the underlying K8S clusters.
+1. It periodically retrieves the available resources metrics from the
+ underlying K8S cluster, and updates them as object status of the
+ `cluster` API object. An alternative design might be to run a pod
+ in each underlying cluster that reports metrics for that cluster to
+ the Ubernetes control plane. Which approach is better remains an
+ open topic of discussion.
+
+## Ubernetes Service Controller
+
+The Ubernetes service controller is a federation-level implementation
+of K8S service controller. It watches service resources created on
+control plane, creates corresponding K8S services on each involved K8S
+clusters. Besides interacting with services resources on each
+individual K8S clusters, the Ubernetes service controller also
+performs some global DNS registration work.
+
+## API OBJECTS
+
+## Cluster
+
+Cluster is a new first-class API object introduced in this design. For
+each registered K8S cluster there will be such an API resource in
+control plane. The way clients register or deregister a cluster is to
+send corresponding REST requests to following URL:
+`/api/{$version}/clusters`. Because control plane is behaving like a
+regular K8S client to the underlying clusters, the spec of a cluster
+object contains necessary properties like K8S cluster address and
+credentials. The status of a cluster API object will contain
+following information:
+
+1. Which phase of its lifecycle
+1. Cluster resource metrics for scheduling decisions.
+1. Other metadata like the version of cluster
+
+$version.clusterSpec
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Address<br>
+</td>
+<td style="padding:5px;">address of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">address<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Credential<br>
+</td>
+<td style="padding:5px;">the type (e.g. bearer token, client
+certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">string <br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+$version.clusterStatus
+
+<table style="border:1px solid #000000;border-collapse:collapse;">
+<tbody>
+<tr>
+<td style="padding:5px;"><b>Name</b><br>
+</td>
+<td style="padding:5px;"><b>Description</b><br>
+</td>
+<td style="padding:5px;"><b>Required</b><br>
+</td>
+<td style="padding:5px;"><b>Schema</b><br>
+</td>
+<td style="padding:5px;"><b>Default</b><br>
+</td>
+</tr>
+<tr>
+<td style="padding:5px;">Phase<br>
+</td>
+<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">enum<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">Capacity<br>
+</td>
+<td style="padding:5px;">represents the available resources of a cluster<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">any<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+<tr>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;">Other cluster metadata like the version<br>
+</td>
+<td style="padding:5px;">yes<br>
+</td>
+<td style="padding:5px;">ClusterMeta<br>
+</td>
+<td style="padding:5px;"><p></p></td>
+</tr>
+</tbody>
+</table>
+
+**For simplicity we didn’t introduce a separate “cluster metrics” API
+object here**. The cluster resource metrics are stored in cluster
+status section, just like what we did to nodes in K8S. In phase one it
+only contains available CPU resources and memory resources. The
+cluster controller will periodically poll the underlying cluster API
+Server to get cluster capability. In phase one it gets the metrics by
+simply aggregating metrics from all nodes. In future we will improve
+this with more efficient ways like leveraging heapster, and also more
+metrics will be supported. Similar to node phases in K8S, the “phase”
+field includes following values:
+
++ pending: newly registered clusters or clusters suspended by admin
+ for various reasons. They are not eligible for accepting workloads
++ running: clusters in normal status that can accept workloads
++ offline: clusters temporarily down or not reachable
++ terminated: clusters removed from federation
+
+Below is the state transition diagram.
+
+![Cluster State Transition Diagram](ubernetes-cluster-state.png)
+
+## Replication Controller
+
+A global workload submitted to control plane is represented as a
+ replication controller in the Cluster Federation control plane. When a replication controller
+is submitted to control plane, clients need a way to express its
+requirements or preferences on clusters. Depending on different use
+cases it may be complex. For example:
+
++ This workload can only be scheduled to cluster Foo. It cannot be
+ scheduled to any other clusters. (use case: sensitive workloads).
++ This workload prefers cluster Foo. But if there is no available
+ capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
+ (use case: workload )
++ Seventy percent of this workload should be scheduled to cluster Foo,
+ and thirty percent should be scheduled to cluster Bar (use case:
+ vendor lock-in avoidance). In phase one, we only introduce a
+ _clusterSelector_ field to filter acceptable clusters. In default
+ case there is no such selector and it means any cluster is
+ acceptable.
+
+Below is a sample of the YAML to create such a replication controller.
+
+```
+apiVersion: v1
+kind: ReplicationController
+metadata:
+ name: nginx-controller
+spec:
+ replicas: 5
+ selector:
+ app: nginx
+ template:
+ metadata:
+ labels:
+ app: nginx
+ spec:
+ containers:
+ - name: nginx
+ image: nginx
+ ports:
+ - containerPort: 80
+ clusterSelector:
+ name in (Foo, Bar)
+```
+
+Currently clusterSelector (implemented as a
+[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
+only supports a simple list of acceptable clusters. Workloads will be
+evenly distributed on these acceptable clusters in phase one. After
+phase one we will define syntax to represent more advanced
+constraints, like cluster preference ordering, desired number of
+splitted workloads, desired ratio of workloads spread on different
+clusters, etc.
+
+Besides this explicit “clusterSelector” filter, a workload may have
+some implicit scheduling restrictions. For example it defines
+“nodeSelector” which can only be satisfied on some particular
+clusters. How to handle this will be addressed after phase one.
+
+## Federated Services
+
+The Service API object exposed by the Cluster Federation is similar to service
+objects on Kubernetes. It defines the access to a group of pods. The
+federation service controller will create corresponding Kubernetes
+service objects on underlying clusters. These are detailed in a
+separate design document: [Federated Services](federated-services.md).
+
+## Pod
+
+In phase one we only support scheduling replication controllers. Pod
+scheduling will be supported in later phase. This is primarily in
+order to keep the Cluster Federation API compatible with the Kubernetes API.
+
+## ACTIVITY FLOWS
+
+## Scheduling
+
+The below diagram shows how workloads are scheduled on the Cluster Federation control\
+plane:
+
+1. A replication controller is created by the client.
+1. APIServer persists it into the storage.
+1. Cluster controller periodically polls the latest available resource
+ metrics from the underlying clusters.
+1. Scheduler is watching all pending RCs. It picks up the RC, make
+ policy-driven decisions and split it into different sub RCs.
+1. Each cluster control is watching the sub RCs bound to its
+ corresponding cluster. It picks up the newly created sub RC.
+1. The cluster controller issues requests to the underlying cluster
+API Server to create the RC. In phase one we don’t support complex
+distribution policies. The scheduling rule is basically:
+ 1. If a RC does not specify any nodeSelector, it will be scheduled
+ to the least loaded K8S cluster(s) that has enough available
+ resources.
+ 1. If a RC specifies _N_ acceptable clusters in the
+ clusterSelector, all replica will be evenly distributed among
+ these clusters.
+
+There is a potential race condition here. Say at time _T1_ the control
+plane learns there are _m_ available resources in a K8S cluster. As
+the cluster is working independently it still accepts workload
+requests from other K8S clients or even another Cluster Federation control
+plane. The Cluster Federation scheduling decision is based on this data of
+available resources. However when the actual RC creation happens to
+the cluster at time _T2_, the cluster may don’t have enough resources
+at that time. We will address this problem in later phases with some
+proposed solutions like resource reservation mechanisms.
+
+![Federated Scheduling](ubernetes-scheduling.png)
+
+## Service Discovery
+
+This part has been included in the section “Federated Service” of
+document
+“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”.
+Please refer to that document for details.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/federation.md b/contributors/design-proposals/federation.md
new file mode 100644
index 00000000..fc595123
--- /dev/null
+++ b/contributors/design-proposals/federation.md
@@ -0,0 +1,648 @@
+# Kubernetes Cluster Federation
+
+## (previously nicknamed "Ubernetes")
+
+## Requirements Analysis and Product Proposal
+
+## _by Quinton Hoole ([quinton@google.com](mailto:quinton@google.com))_
+
+_Initial revision: 2015-03-05_
+_Last updated: 2015-08-20_
+This doc: [tinyurl.com/ubernetesv2](http://tinyurl.com/ubernetesv2)
+Original slides: [tinyurl.com/ubernetes-slides](http://tinyurl.com/ubernetes-slides)
+Updated slides: [tinyurl.com/ubernetes-whereto](http://tinyurl.com/ubernetes-whereto)
+
+## Introduction
+
+Today, each Kubernetes cluster is a relatively self-contained unit,
+which typically runs in a single "on-premise" data centre or single
+availability zone of a cloud provider (Google's GCE, Amazon's AWS,
+etc).
+
+Several current and potential Kubernetes users and customers have
+expressed a keen interest in tying together ("federating") multiple
+clusters in some sensible way in order to enable the following kinds
+of use cases (intentionally vague):
+
+1. _"Preferentially run my workloads in my on-premise cluster(s), but
+ automatically overflow to my cloud-hosted cluster(s) if I run out
+ of on-premise capacity"_.
+1. _"Most of my workloads should run in my preferred cloud-hosted
+ cluster(s), but some are privacy-sensitive, and should be
+ automatically diverted to run in my secure, on-premise
+ cluster(s)"_.
+1. _"I want to avoid vendor lock-in, so I want my workloads to run
+ across multiple cloud providers all the time. I change my set of
+ such cloud providers, and my pricing contracts with them,
+ periodically"_.
+1. _"I want to be immune to any single data centre or cloud
+ availability zone outage, so I want to spread my service across
+ multiple such zones (and ideally even across multiple cloud
+ providers)."_
+
+The above use cases are by necessity left imprecisely defined. The
+rest of this document explores these use cases and their implications
+in further detail, and compares a few alternative high level
+approaches to addressing them. The idea of cluster federation has
+informally become known as _"Ubernetes"_.
+
+## Summary/TL;DR
+
+Four primary customer-driven use cases are explored in more detail.
+The two highest priority ones relate to High Availability and
+Application Portability (between cloud providers, and between
+on-premise and cloud providers).
+
+Four primary federation primitives are identified (location affinity,
+cross-cluster scheduling, service discovery and application
+migration). Fortunately not all four of these primitives are required
+for each primary use case, so incremental development is feasible.
+
+## What exactly is a Kubernetes Cluster?
+
+A central design concept in Kubernetes is that of a _cluster_. While
+loosely speaking, a cluster can be thought of as running in a single
+data center, or cloud provider availability zone, a more precise
+definition is that each cluster provides:
+
+1. a single Kubernetes API entry point,
+1. a consistent, cluster-wide resource naming scheme
+1. a scheduling/container placement domain
+1. a service network routing domain
+1. an authentication and authorization model.
+
+The above in turn imply the need for a relatively performant, reliable
+and cheap network within each cluster.
+
+There is also assumed to be some degree of failure correlation across
+a cluster, i.e. whole clusters are expected to fail, at least
+occasionally (due to cluster-wide power and network failures, natural
+disasters etc). Clusters are often relatively homogeneous in that all
+compute nodes are typically provided by a single cloud provider or
+hardware vendor, and connected by a common, unified network fabric.
+But these are not hard requirements of Kubernetes.
+
+Other classes of Kubernetes deployments than the one sketched above
+are technically feasible, but come with some challenges of their own,
+and are not yet common or explicitly supported.
+
+More specifically, having a Kubernetes cluster span multiple
+well-connected availability zones within a single geographical region
+(e.g. US North East, UK, Japan etc) is worthy of further
+consideration, in particular because it potentially addresses
+some of these requirements.
+
+## What use cases require Cluster Federation?
+
+Let's name a few concrete use cases to aid the discussion:
+
+## 1.Capacity Overflow
+
+_"I want to preferentially run my workloads in my on-premise cluster(s), but automatically "overflow" to my cloud-hosted cluster(s) when I run out of on-premise capacity."_
+
+This idea is known in some circles as "[cloudbursting](http://searchcloudcomputing.techtarget.com/definition/cloud-bursting)".
+
+**Clarifying questions:** What is the unit of overflow? Individual
+ pods? Probably not always. Replication controllers and their
+ associated sets of pods? Groups of replication controllers
+ (a.k.a. distributed applications)? How are persistent disks
+ overflowed? Can the "overflowed" pods communicate with their
+ brethren and sistren pods and services in the other cluster(s)?
+ Presumably yes, at higher cost and latency, provided that they use
+ external service discovery. Is "overflow" enabled only when creating
+ new workloads/replication controllers, or are existing workloads
+ dynamically migrated between clusters based on fluctuating available
+ capacity? If so, what is the desired behaviour, and how is it
+ achieved? How, if at all, does this relate to quota enforcement
+ (e.g. if we run out of on-premise capacity, can all or only some
+ quotas transfer to other, potentially more expensive off-premise
+ capacity?)
+
+It seems that most of this boils down to:
+
+1. **location affinity** (pods relative to each other, and to other
+ stateful services like persistent storage - how is this expressed
+ and enforced?)
+1. **cross-cluster scheduling** (given location affinity constraints
+ and other scheduling policy, which resources are assigned to which
+ clusters, and by what?)
+1. **cross-cluster service discovery** (how do pods in one cluster
+ discover and communicate with pods in another cluster?)
+1. **cross-cluster migration** (how do compute and storage resources,
+ and the distributed applications to which they belong, move from
+ one cluster to another)
+1. **cross-cluster load-balancing** (how does is user traffic directed
+ to an appropriate cluster?)
+1. **cross-cluster monitoring and auditing** (a.k.a. Unified Visibility)
+
+## 2. Sensitive Workloads
+
+_"I want most of my workloads to run in my preferred cloud-hosted
+cluster(s), but some are privacy-sensitive, and should be
+automatically diverted to run in my secure, on-premise cluster(s). The
+list of privacy-sensitive workloads changes over time, and they're
+subject to external auditing."_
+
+**Clarifying questions:**
+1. What kinds of rules determine which
+workloads go where?
+ 1. Is there in fact a requirement to have these rules be
+ declaratively expressed and automatically enforced, or is it
+ acceptable/better to have users manually select where to run
+ their workloads when starting them?
+ 1. Is a static mapping from container (or more typically,
+ replication controller) to cluster maintained and enforced?
+ 1. If so, is it only enforced on startup, or are things migrated
+ between clusters when the mappings change?
+
+This starts to look quite similar to "1. Capacity Overflow", and again
+seems to boil down to:
+
+1. location affinity
+1. cross-cluster scheduling
+1. cross-cluster service discovery
+1. cross-cluster migration
+1. cross-cluster monitoring and auditing
+1. cross-cluster load balancing
+
+## 3. Vendor lock-in avoidance
+
+_"My CTO wants us to avoid vendor lock-in, so she wants our workloads
+to run across multiple cloud providers at all times. She changes our
+set of preferred cloud providers and pricing contracts with them
+periodically, and doesn't want to have to communicate and manually
+enforce these policy changes across the organization every time this
+happens. She wants it centrally and automatically enforced, monitored
+and audited."_
+
+**Clarifying questions:**
+
+1. How does this relate to other use cases (high availability,
+capacity overflow etc), as they may all be across multiple vendors.
+It's probably not strictly speaking a separate
+use case, but it's brought up so often as a requirement, that it's
+worth calling out explicitly.
+1. Is a useful intermediate step to make it as simple as possible to
+ migrate an application from one vendor to another in a one-off fashion?
+
+Again, I think that this can probably be
+ reformulated as a Capacity Overflow problem - the fundamental
+ principles seem to be the same or substantially similar to those
+ above.
+
+## 4. "High Availability"
+
+_"I want to be immune to any single data centre or cloud availability
+zone outage, so I want to spread my service across multiple such zones
+(and ideally even across multiple cloud providers), and have my
+service remain available even if one of the availability zones or
+cloud providers "goes down"_.
+
+It seems useful to split this into multiple sets of sub use cases:
+
+1. Multiple availability zones within a single cloud provider (across
+ which feature sets like private networks, load balancing,
+ persistent disks, data snapshots etc are typically consistent and
+ explicitly designed to inter-operate).
+ 1. within the same geographical region (e.g. metro) within which network
+ is fast and cheap enough to be almost analogous to a single data
+ center.
+ 1. across multiple geographical regions, where high network cost and
+ poor network performance may be prohibitive.
+1. Multiple cloud providers (typically with inconsistent feature sets,
+ more limited interoperability, and typically no cheap inter-cluster
+ networking described above).
+
+The single cloud provider case might be easier to implement (although
+the multi-cloud provider implementation should just work for a single
+cloud provider). Propose high-level design catering for both, with
+initial implementation targeting single cloud provider only.
+
+**Clarifying questions:**
+**How does global external service discovery work?** In the steady
+ state, which external clients connect to which clusters? GeoDNS or
+ similar? What is the tolerable failover latency if a cluster goes
+ down? Maybe something like (make up some numbers, notwithstanding
+ some buggy DNS resolvers, TTL's, caches etc) ~3 minutes for ~90% of
+ clients to re-issue DNS lookups and reconnect to a new cluster when
+ their home cluster fails is good enough for most Kubernetes users
+ (or at least way better than the status quo), given that these sorts
+ of failure only happen a small number of times a year?
+
+**How does dynamic load balancing across clusters work, if at all?**
+ One simple starting point might be "it doesn't". i.e. if a service
+ in a cluster is deemed to be "up", it receives as much traffic as is
+ generated "nearby" (even if it overloads). If the service is deemed
+ to "be down" in a given cluster, "all" nearby traffic is redirected
+ to some other cluster within some number of seconds (failover could
+ be automatic or manual). Failover is essentially binary. An
+ improvement would be to detect when a service in a cluster reaches
+ maximum serving capacity, and dynamically divert additional traffic
+ to other clusters. But how exactly does all of this work, and how
+ much of it is provided by Kubernetes, as opposed to something else
+ bolted on top (e.g. external monitoring and manipulation of GeoDNS)?
+
+**How does this tie in with auto-scaling of services?** More
+ specifically, if I run my service across _n_ clusters globally, and
+ one (or more) of them fail, how do I ensure that the remaining _n-1_
+ clusters have enough capacity to serve the additional, failed-over
+ traffic? Either:
+
+1. I constantly over-provision all clusters by 1/n (potentially expensive), or
+1. I "manually" (or automatically) update my replica count configurations in the
+ remaining clusters by 1/n when the failure occurs, and Kubernetes
+ takes care of the rest for me, or
+1. Auto-scaling in the remaining clusters takes
+ care of it for me automagically as the additional failed-over
+ traffic arrives (with some latency). Note that this implies that
+ the cloud provider keeps the necessary resources on hand to
+ accommodate such auto-scaling (e.g. via something similar to AWS reserved
+ and spot instances)
+
+Up to this point, this use case ("Unavailability Zones") seems materially different from all the others above. It does not require dynamic cross-cluster service migration (we assume that the service is already running in more than one cluster when the failure occurs). Nor does it necessarily involve cross-cluster service discovery or location affinity. As a result, I propose that we address this use case somewhat independently of the others (although I strongly suspect that it will become substantially easier once we've solved the others).
+
+All of the above (regarding "Unavailability Zones") refers primarily
+to already-running user-facing services, and minimizing the impact on
+end users of those services becoming unavailable in a given cluster.
+What about the people and systems that deploy Kubernetes services
+(devops etc)? Should they be automatically shielded from the impact
+of the cluster outage? i.e. have their new resource creation requests
+automatically diverted to another cluster during the outage? While
+this specific requirement seems non-critical (manual fail-over seems
+relatively non-arduous, ignoring the user-facing issues above), it
+smells a lot like the first three use cases listed above ("Capacity
+Overflow, Sensitive Services, Vendor lock-in..."), so if we address
+those, we probably get this one free of charge.
+
+## Core Challenges of Cluster Federation
+
+As we saw above, a few common challenges fall out of most of the use
+cases considered above, namely:
+
+## Location Affinity
+
+Can the pods comprising a single distributed application be
+partitioned across more than one cluster? More generally, how far
+apart, in network terms, can a given client and server within a
+distributed application reasonably be? A server need not necessarily
+be a pod, but could instead be a persistent disk housing data, or some
+other stateful network service. What is tolerable is typically
+application-dependent, primarily influenced by network bandwidth
+consumption, latency requirements and cost sensitivity.
+
+For simplicity, let's assume that all Kubernetes distributed
+applications fall into one of three categories with respect to relative
+location affinity:
+
+1. **"Strictly Coupled"**: Those applications that strictly cannot be
+ partitioned between clusters. They simply fail if they are
+ partitioned. When scheduled, all pods _must_ be scheduled to the
+ same cluster. To move them, we need to shut the whole distributed
+ application down (all pods) in one cluster, possibly move some
+ data, and then bring the up all of the pods in another cluster. To
+ avoid downtime, we might bring up the replacement cluster and
+ divert traffic there before turning down the original, but the
+ principle is much the same. In some cases moving the data might be
+ prohibitively expensive or time-consuming, in which case these
+ applications may be effectively _immovable_.
+1. **"Strictly Decoupled"**: Those applications that can be
+ indefinitely partitioned across more than one cluster, to no
+ disadvantage. An embarrassingly parallel YouTube porn detector,
+ where each pod repeatedly dequeues a video URL from a remote work
+ queue, downloads and chews on the video for a few hours, and
+ arrives at a binary verdict, might be one such example. The pods
+ derive no benefit from being close to each other, or anything else
+ (other than the source of YouTube videos, which is assumed to be
+ equally remote from all clusters in this example). Each pod can be
+ scheduled independently, in any cluster, and moved at any time.
+1. **"Preferentially Coupled"**: Somewhere between Coupled and
+ Decoupled. These applications prefer to have all of their pods
+ located in the same cluster (e.g. for failure correlation, network
+ latency or bandwidth cost reasons), but can tolerate being
+ partitioned for "short" periods of time (for example while
+ migrating the application from one cluster to another). Most small
+ to medium sized LAMP stacks with not-very-strict latency goals
+ probably fall into this category (provided that they use sane
+ service discovery and reconnect-on-fail, which they need to do
+ anyway to run effectively, even in a single Kubernetes cluster).
+
+From a fault isolation point of view, there are also opposites of the
+above. For example, a master database and its slave replica might
+need to be in different availability zones. We'll refer to this a
+anti-affinity, although it is largely outside the scope of this
+document.
+
+Note that there is somewhat of a continuum with respect to network
+cost and quality between any two nodes, ranging from two nodes on the
+same L2 network segment (lowest latency and cost, highest bandwidth)
+to two nodes on different continents (highest latency and cost, lowest
+bandwidth). One interesting point on that continuum relates to
+multiple availability zones within a well-connected metro or region
+and single cloud provider. Despite being in different data centers,
+or areas within a mega data center, network in this case is often very fast
+and effectively free or very cheap. For the purposes of this network location
+affinity discussion, this case is considered analogous to a single
+availability zone. Furthermore, if a given application doesn't fit
+cleanly into one of the above, shoe-horn it into the best fit,
+defaulting to the "Strictly Coupled and Immovable" bucket if you're
+not sure.
+
+And then there's what I'll call _absolute_ location affinity. Some
+applications are required to run in bounded geographical or network
+topology locations. The reasons for this are typically
+political/legislative (data privacy laws etc), or driven by network
+proximity to consumers (or data providers) of the application ("most
+of our users are in Western Europe, U.S. West Coast" etc).
+
+**Proposal:** First tackle Strictly Decoupled applications (which can
+ be trivially scheduled, partitioned or moved, one pod at a time).
+ Then tackle Preferentially Coupled applications (which must be
+ scheduled in totality in a single cluster, and can be moved, but
+ ultimately in total, and necessarily within some bounded time).
+ Leave strictly coupled applications to be manually moved between
+ clusters as required for the foreseeable future.
+
+## Cross-cluster service discovery
+
+I propose having pods use standard discovery methods used by external
+clients of Kubernetes applications (i.e. DNS). DNS might resolve to a
+public endpoint in the local or a remote cluster. Other than Strictly
+Coupled applications, software should be largely oblivious of which of
+the two occurs.
+
+_Aside:_ How do we avoid "tromboning" through an external VIP when DNS
+resolves to a public IP on the local cluster? Strictly speaking this
+would be an optimization for some cases, and probably only matters to
+high-bandwidth, low-latency communications. We could potentially
+eliminate the trombone with some kube-proxy magic if necessary. More
+detail to be added here, but feel free to shoot down the basic DNS
+idea in the mean time. In addition, some applications rely on private
+networking between clusters for security (e.g. AWS VPC or more
+generally VPN). It should not be necessary to forsake this in
+order to use Cluster Federation, for example by being forced to use public
+connectivity between clusters.
+
+## Cross-cluster Scheduling
+
+This is closely related to location affinity above, and also discussed
+there. The basic idea is that some controller, logically outside of
+the basic Kubernetes control plane of the clusters in question, needs
+to be able to:
+
+1. Receive "global" resource creation requests.
+1. Make policy-based decisions as to which cluster(s) should be used
+ to fulfill each given resource request. In a simple case, the
+ request is just redirected to one cluster. In a more complex case,
+ the request is "demultiplexed" into multiple sub-requests, each to
+ a different cluster. Knowledge of the (albeit approximate)
+ available capacity in each cluster will be required by the
+ controller to sanely split the request. Similarly, knowledge of
+ the properties of the application (Location Affinity class --
+ Strictly Coupled, Strictly Decoupled etc, privacy class etc) will
+ be required. It is also conceivable that knowledge of service
+ SLAs and monitoring thereof might provide an input into
+ scheduling/placement algorithms.
+1. Multiplex the responses from the individual clusters into an
+ aggregate response.
+
+There is of course a lot of detail still missing from this section,
+including discussion of:
+
+1. admission control
+1. initial placement of instances of a new
+service vs. scheduling new instances of an existing service in response
+to auto-scaling
+1. rescheduling pods due to failure (response might be
+different depending on if it's failure of a node, rack, or whole AZ)
+1. data placement relative to compute capacity,
+etc.
+
+## Cross-cluster Migration
+
+Again this is closely related to location affinity discussed above,
+and is in some sense an extension of Cross-cluster Scheduling. When
+certain events occur, it becomes necessary or desirable for the
+cluster federation system to proactively move distributed applications
+(either in part or in whole) from one cluster to another. Examples of
+such events include:
+
+1. A low capacity event in a cluster (or a cluster failure).
+1. A change of scheduling policy ("we no longer use cloud provider X").
+1. A change of resource pricing ("cloud provider Y dropped their
+ prices - let's migrate there").
+
+Strictly Decoupled applications can be trivially moved, in part or in
+whole, one pod at a time, to one or more clusters (within applicable
+policy constraints, for example "PrivateCloudOnly").
+
+For Preferentially Decoupled applications, the federation system must
+first locate a single cluster with sufficient capacity to accommodate
+the entire application, then reserve that capacity, and incrementally
+move the application, one (or more) resources at a time, over to the
+new cluster, within some bounded time period (and possibly within a
+predefined "maintenance" window). Strictly Coupled applications (with
+the exception of those deemed completely immovable) require the
+federation system to:
+
+1. start up an entire replica application in the destination cluster
+1. copy persistent data to the new application instance (possibly
+ before starting pods)
+1. switch user traffic across
+1. tear down the original application instance
+
+It is proposed that support for automated migration of Strictly
+Coupled applications be deferred to a later date.
+
+## Other Requirements
+
+These are often left implicit by customers, but are worth calling out explicitly:
+
+1. Software failure isolation between Kubernetes clusters should be
+ retained as far as is practically possible. The federation system
+ should not materially increase the failure correlation across
+ clusters. For this reason the federation control plane software
+ should ideally be completely independent of the Kubernetes cluster
+ control software, and look just like any other Kubernetes API
+ client, with no special treatment. If the federation control plane
+ software fails catastrophically, the underlying Kubernetes clusters
+ should remain independently usable.
+1. Unified monitoring, alerting and auditing across federated Kubernetes clusters.
+1. Unified authentication, authorization and quota management across
+ clusters (this is in direct conflict with failure isolation above,
+ so there are some tough trade-offs to be made here).
+
+## Proposed High-Level Architectures
+
+Two distinct potential architectural approaches have emerged from discussions
+thus far:
+
+1. An explicitly decoupled and hierarchical architecture, where the
+ Federation Control Plane sits logically above a set of independent
+ Kubernetes clusters, each of which is (potentially) unaware of the
+ other clusters, and of the Federation Control Plane itself (other
+ than to the extent that it is an API client much like any other).
+ One possible example of this general architecture is illustrated
+ below, and will be referred to as the "Decoupled, Hierarchical"
+ approach.
+1. A more monolithic architecture, where a single instance of the
+ Kubernetes control plane itself manages a single logical cluster
+ composed of nodes in multiple availability zones and cloud
+ providers.
+
+A very brief, non-exhaustive list of pro's and con's of the two
+approaches follows. (In the interest of full disclosure, the author
+prefers the Decoupled Hierarchical model for the reasons stated below).
+
+1. **Failure isolation:** The Decoupled Hierarchical approach provides
+ better failure isolation than the Monolithic approach, as each
+ underlying Kubernetes cluster, and the Federation Control Plane,
+ can operate and fail completely independently of each other. In
+ particular, their software and configurations can be updated
+ independently. Such updates are, in our experience, the primary
+ cause of control-plane failures, in general.
+1. **Failure probability:** The Decoupled Hierarchical model incorporates
+ numerically more independent pieces of software and configuration
+ than the Monolithic one. But the complexity of each of these
+ decoupled pieces is arguably better contained in the Decoupled
+ model (per standard arguments for modular rather than monolithic
+ software design). Which of the two models presents higher
+ aggregate complexity and consequent failure probability remains
+ somewhat of an open question.
+1. **Scalability:** Conceptually the Decoupled Hierarchical model wins
+ here, as each underlying Kubernetes cluster can be scaled
+ completely independently w.r.t. scheduling, node state management,
+ monitoring, network connectivity etc. It is even potentially
+ feasible to stack federations of clusters (i.e. create
+ federations of federations) should scalability of the independent
+ Federation Control Plane become an issue (although the author does
+ not envision this being a problem worth solving in the short
+ term).
+1. **Code complexity:** I think that an argument can be made both ways
+ here. It depends on whether you prefer to weave the logic for
+ handling nodes in multiple availability zones and cloud providers
+ within a single logical cluster into the existing Kubernetes
+ control plane code base (which was explicitly not designed for
+ this), or separate it into a decoupled Federation system (with
+ possible code sharing between the two via shared libraries). The
+ author prefers the latter because it:
+ 1. Promotes better code modularity and interface design.
+ 1. Allows the code
+ bases of Kubernetes and the Federation system to progress
+ largely independently (different sets of developers, different
+ release schedules etc).
+1. **Administration complexity:** Again, I think that this could be argued
+ both ways. Superficially it would seem that administration of a
+ single Monolithic multi-zone cluster might be simpler by virtue of
+ being only "one thing to manage", however in practise each of the
+ underlying availability zones (and possibly cloud providers) has
+ its own capacity, pricing, hardware platforms, and possibly
+ bureaucratic boundaries (e.g. "our EMEA IT department manages those
+ European clusters"). So explicitly allowing for (but not
+ mandating) completely independent administration of each
+ underlying Kubernetes cluster, and the Federation system itself,
+ in the Decoupled Hierarchical model seems to have real practical
+ benefits that outweigh the superficial simplicity of the
+ Monolithic model.
+1. **Application development and deployment complexity:** It's not clear
+ to me that there is any significant difference between the two
+ models in this regard. Presumably the API exposed by the two
+ different architectures would look very similar, as would the
+ behavior of the deployed applications. It has even been suggested
+ to write the code in such a way that it could be run in either
+ configuration. It's not clear that this makes sense in practise
+ though.
+1. **Control plane cost overhead:** There is a minimum per-cluster
+ overhead -- two possibly virtual machines, or more for redundant HA
+ deployments. For deployments of very small Kubernetes
+ clusters with the Decoupled Hierarchical approach, this cost can
+ become significant.
+
+### The Decoupled, Hierarchical Approach - Illustrated
+
+![image](federation-high-level-arch.png)
+
+## Cluster Federation API
+
+It is proposed that this look a lot like the existing Kubernetes API
+but be explicitly multi-cluster.
+
++ Clusters become first class objects, which can be registered,
+ listed, described, deregistered etc via the API.
++ Compute resources can be explicitly requested in specific clusters,
+ or automatically scheduled to the "best" cluster by the Cluster
+ Federation control system (by a
+ pluggable Policy Engine).
++ There is a federated equivalent of a replication controller type (or
+ perhaps a [deployment](deployment.md)),
+ which is multicluster-aware, and delegates to cluster-specific
+ replication controllers/deployments as required (e.g. a federated RC for n
+ replicas might simply spawn multiple replication controllers in
+ different clusters to do the hard work).
+
+## Policy Engine and Migration/Replication Controllers
+
+The Policy Engine decides which parts of each application go into each
+cluster at any point in time, and stores this desired state in the
+Desired Federation State store (an etcd or
+similar). Migration/Replication Controllers reconcile this against the
+desired states stored in the underlying Kubernetes clusters (by
+watching both, and creating or updating the underlying Replication
+Controllers and related Services accordingly).
+
+## Authentication and Authorization
+
+This should ideally be delegated to some external auth system, shared
+by the underlying clusters, to avoid duplication and inconsistency.
+Either that, or we end up with multilevel auth. Local readonly
+eventually consistent auth slaves in each cluster and in the Cluster
+Federation control system
+could potentially cache auth, to mitigate an SPOF auth system.
+
+## Data consistency, failure and availability characteristics
+
+The services comprising the Cluster Federation control plane) have to run
+ somewhere. Several options exist here:
+* For high availability Cluster Federation deployments, these
+ services may run in either:
+ * a dedicated Kubernetes cluster, not co-located in the same
+ availability zone with any of the federated clusters (for fault
+ isolation reasons). If that cluster/availability zone, and hence the Federation
+ system, fails catastrophically, the underlying pods and
+ applications continue to run correctly, albeit temporarily
+ without the Federation system.
+ * across multiple Kubernetes availability zones, probably with
+ some sort of cross-AZ quorum-based store. This provides
+ theoretically higher availability, at the cost of some
+ complexity related to data consistency across multiple
+ availability zones.
+ * For simpler, less highly available deployments, just co-locate the
+ Federation control plane in/on/with one of the underlying
+ Kubernetes clusters. The downside of this approach is that if
+ that specific cluster fails, all automated failover and scaling
+ logic which relies on the federation system will also be
+ unavailable at the same time (i.e. precisely when it is needed).
+ But if one of the other federated clusters fails, everything
+ should work just fine.
+
+There is some further thinking to be done around the data consistency
+ model upon which the Federation system is based, and it's impact
+ on the detailed semantics, failure and availability
+ characteristics of the system.
+
+## Proposed Next Steps
+
+Identify concrete applications of each use case and configure a proof
+of concept service that exercises the use case. For example, cluster
+failure tolerance seems popular, so set up an apache frontend with
+replicas in each of three availability zones with either an Amazon Elastic
+Load Balancer or Google Cloud Load Balancer pointing at them? What
+does the zookeeper config look like for N=3 across 3 AZs -- and how
+does each replica find the other replicas and how do clients find
+their primary zookeeper replica? And now how do I do a shared, highly
+available redis database? Use a few common specific use cases like
+this to flesh out the detailed API and semantics of Cluster Federation.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/federation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/flannel-integration.md b/contributors/design-proposals/flannel-integration.md
new file mode 100644
index 00000000..465ee5e6
--- /dev/null
+++ b/contributors/design-proposals/flannel-integration.md
@@ -0,0 +1,132 @@
+# Flannel integration with Kubernetes
+
+## Why?
+
+* Networking works out of the box.
+* Cloud gateway configuration is regulated by quota.
+* Consistent bare metal and cloud experience.
+* Lays foundation for integrating with networking backends and vendors.
+
+## How?
+
+Thus:
+
+```
+Master | Node1
+----------------------------------------------------------------------
+{192.168.0.0/16, 256 /24} | docker
+ | | | restart with podcidr
+apiserver <------------------ kubelet (sends podcidr)
+ | | | here's podcidr, mtu
+flannel-server:10253 <------------------ flannel-daemon
+Allocates a /24 ------------------> [config iptables, VXLan]
+ <------------------ [watch subnet leases]
+I just allocated ------------------> [config VXLan]
+another /24 |
+```
+
+## Proposal
+
+Explaining vxlan is out of the scope of this document, however it does take some basic understanding to grok the proposal. Assume some pod wants to communicate across nodes with the above setup. Check the flannel vxlan devices:
+
+```console
+node1 $ ip -d link show flannel.1
+4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT
+ link/ether a2:53:86:b5:5f:c1 brd ff:ff:ff:ff:ff:ff
+ vxlan
+node1 $ ip -d link show eth0
+2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000
+ link/ether 42:01:0a:f0:00:04 brd ff:ff:ff:ff:ff:ff
+
+node2 $ ip -d link show flannel.1
+4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN mode DEFAULT
+ link/ether 56:71:35:66:4a:d8 brd ff:ff:ff:ff:ff:ff
+ vxlan
+node2 $ ip -d link show eth0
+2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT qlen 1000
+ link/ether 42:01:0a:f0:00:03 brd ff:ff:ff:ff:ff:ff
+```
+
+Note that we're ignoring cbr0 for the sake of simplicity. Spin-up a container on each node. We're using raw docker for this example only because we want control over where the container lands:
+
+```
+node1 $ docker run -it radial/busyboxplus:curl /bin/sh
+[ root@5ca3c154cde3:/ ]$ ip addr show
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
+8: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue
+ link/ether 02:42:12:10:20:03 brd ff:ff:ff:ff:ff:ff
+ inet 192.168.32.3/24 scope global eth0
+ valid_lft forever preferred_lft forever
+
+node2 $ docker run -it radial/busyboxplus:curl /bin/sh
+[ root@d8a879a29f5d:/ ]$ ip addr show
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
+16: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue
+ link/ether 02:42:12:10:0e:07 brd ff:ff:ff:ff:ff:ff
+ inet 192.168.14.7/24 scope global eth0
+ valid_lft forever preferred_lft forever
+[ root@d8a879a29f5d:/ ]$ ping 192.168.32.3
+PING 192.168.32.3 (192.168.32.3): 56 data bytes
+64 bytes from 192.168.32.3: seq=0 ttl=62 time=1.190 ms
+```
+
+__What happened?__:
+
+From 1000 feet:
+* vxlan device driver starts up on node1 and creates a udp tunnel endpoint on 8472
+* container 192.168.32.3 pings 192.168.14.7
+ - what's the MAC of 192.168.14.0?
+ - L2 miss, flannel looks up MAC of subnet
+ - Stores `192.168.14.0 <-> 56:71:35:66:4a:d8` in neighbor table
+ - what's tunnel endpoint of this MAC?
+ - L3 miss, flannel looks up destination VM ip
+ - Stores `10.240.0.3 <-> 56:71:35:66:4a:d8` in bridge database
+* Sends `[56:71:35:66:4a:d8, 10.240.0.3][vxlan: port, vni][02:42:12:10:20:03, 192.168.14.7][icmp]`
+
+__But will it blend?__
+
+Kubernetes integration is fairly straight-forward once we understand the pieces involved, and can be prioritized as follows:
+* Kubelet understands flannel daemon in client mode, flannel server manages independent etcd store on master, node controller backs off CIDR allocation
+* Flannel server consults the Kubernetes master for everything network related
+* Flannel daemon works through network plugins in a generic way without bothering the kubelet: needs CNI x Kubernetes standardization
+
+The first is accomplished in this PR, while a timeline for 2. and 3. is TDB. To implement the flannel api we can either run a proxy per node and get rid of the flannel server, or service all requests in the flannel server with something like a go-routine per node:
+* `/network/config`: read network configuration and return
+* `/network/leases`:
+ - Post: Return a lease as understood by flannel
+ - Lookip node by IP
+ - Store node metadata from [flannel request] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L34) in annotations
+ - Return [Lease object] (https://github.com/coreos/flannel/blob/master/subnet/subnet.go#L40) reflecting node cidr
+ - Get: Handle a watch on leases
+* `/network/leases/subnet`:
+ - Put: This is a request for a lease. If the nodecontroller is allocating CIDRs we can probably just no-op.
+* `/network/reservations`: TDB, we can probably use this to accommodate node controller allocating CIDR instead of flannel requesting it
+
+The ick-iest part of this implementation is going to the `GET /network/leases`, i.e. the watch proxy. We can side-step by waiting for a more generic Kubernetes resource. However, we can also implement it as follows:
+* Watch all nodes, ignore heartbeats
+* On each change, figure out the lease for the node, construct a [lease watch result](https://github.com/coreos/flannel/blob/0bf263826eab1707be5262703a8092c7d15e0be4/subnet/subnet.go#L72), and send it down the watch with the RV from the node
+* Implement a lease list that does a similar translation
+
+I say this is gross without an api object because for each node->lease translation one has to store and retrieve the node metadata sent by flannel (eg: VTEP) from node annotations. [Reference implementation](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/flannel_server.go) and [watch proxy](https://github.com/bprashanth/kubernetes/blob/network_vxlan/pkg/kubelet/watch_proxy.go).
+
+# Limitations
+
+* Integration is experimental
+* Flannel etcd not stored in persistent disk
+* CIDR allocation does *not* flow from Kubernetes down to nodes anymore
+
+# Wishlist
+
+This proposal is really just a call for community help in writing a Kubernetes x flannel backend.
+
+* CNI plugin integration
+* Flannel daemon in privileged pod
+* Flannel server talks to apiserver, described in proposal above
+* HTTPs between flannel daemon/server
+* Investigate flannel server running on every node (as done in the reference implementation mentioned above)
+* Use flannel reservation mode to support node controller podcidr allocation
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/flannel-integration.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/garbage-collection.md b/contributors/design-proposals/garbage-collection.md
new file mode 100644
index 00000000..b24a7f21
--- /dev/null
+++ b/contributors/design-proposals/garbage-collection.md
@@ -0,0 +1,357 @@
+**Table of Contents**
+
+- [Overview](#overview)
+- [Cascading deletion with Garbage Collector](#cascading-deletion-with-garbage-collector)
+- [Orphaning the descendants with "orphan" finalizer](#orphaning-the-descendants-with-orphan-finalizer)
+ - [Part I. The finalizer framework](#part-i-the-finalizer-framework)
+ - [Part II. The "orphan" finalizer](#part-ii-the-orphan-finalizer)
+- [Related issues](#related-issues)
+ - [Orphan adoption](#orphan-adoption)
+ - [Upgrading a cluster to support cascading deletion](#upgrading-a-cluster-to-support-cascading-deletion)
+- [End-to-End Examples](#end-to-end-examples)
+ - [Life of a Deployment and its descendants](#life-of-a-deployment-and-its-descendants)
+- [Open Questions](#open-questions)
+- [Considered and Rejected Designs](#considered-and-rejected-designs)
+- [1. Tombstone + GC](#1-tombstone--gc)
+- [2. Recovering from abnormal cascading deletion](#2-recovering-from-abnormal-cascading-deletion)
+
+
+# Overview
+
+Currently most cascading deletion logic is implemented at client-side. For example, when deleting a replica set, kubectl uses a reaper to delete the created pods and then delete the replica set. We plan to move the cascading deletion to the server to simplify the client-side logic. In this proposal, we present the garbage collector which implements cascading deletion for all API resources in a generic way; we also present the finalizer framework, particularly the "orphan" finalizer, to enable flexible alternation between cascading deletion and orphaning.
+
+Goals of the design include:
+* Supporting cascading deletion at the server-side.
+* Centralizing the cascading deletion logic, rather than spreading in controllers.
+* Allowing optionally orphan the dependent objects.
+
+Non-goals include:
+* Releasing the name of an object immediately, so it can be reused ASAP.
+* Propagating the grace period in cascading deletion.
+
+# Cascading deletion with Garbage Collector
+
+## API Changes
+
+```
+type ObjectMeta struct {
+ ...
+ OwnerReferences []OwnerReference
+}
+```
+
+**ObjectMeta.OwnerReferences**:
+List of objects depended by this object. If ***all*** objects in the list have been deleted, this object will be garbage collected. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.OwnerReferences pointing to `D`, set by the deployment controller when `R` is created. This field can be updated by any client that has the privilege to both update ***and*** delete the object. For safety reasons, we can add validation rules to restrict what resources could be set as owners. For example, Events will likely be banned from being owners.
+
+```
+type OwnerReference struct {
+ // Version of the referent.
+ APIVersion string
+ // Kind of the referent.
+ Kind string
+ // Name of the referent.
+ Name string
+ // UID of the referent.
+ UID types.UID
+}
+```
+
+**OwnerReference struct**: OwnerReference contains enough information to let you identify an owning object. Please refer to the inline comments for the meaning of each field. Currently, an owning object must be in the same namespace as the dependent object, so there is no namespace field.
+
+## New components: the Garbage Collector
+
+The Garbage Collector is responsible to delete an object if none of the owners listed in the object's OwnerReferences exist.
+The Garbage Collector consists of a scanner, a garbage processor, and a propagator.
+* Scanner:
+ * Uses the discovery API to detect all the resources supported by the system.
+ * Periodically scans all resources in the system and adds each object to the *Dirty Queue*.
+
+* Garbage Processor:
+ * Consists of the *Dirty Queue* and workers.
+ * Each worker:
+ * Dequeues an item from *Dirty Queue*.
+ * If the item's OwnerReferences is empty, continues to process the next item in the *Dirty Queue*.
+ * Otherwise checks each entry in the OwnerReferences:
+ * If at least one owner exists, do nothing.
+ * If none of the owners exist, requests the API server to delete the item.
+
+* Propagator:
+ * The Propagator is for optimization, not for correctness.
+ * Consists of an *Event Queue*, a single worker, and a DAG of owner-dependent relations.
+ * The DAG stores only name/uid/orphan triplets, not the entire body of every item.
+ * Watches for create/update/delete events for all resources, enqueues the events to the *Event Queue*.
+ * Worker:
+ * Dequeues an item from the *Event Queue*.
+ * If the item is an creation or update, then updates the DAG accordingly.
+ * If the object has an owner and the owner doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
+ * If the item is a deletion, then removes the object from the DAG, and enqueues all its dependent objects to the *Dirty Queue*.
+ * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
+ * With the Propagator, we *only* need to run the Scanner when starting the GC to populate the DAG and the *Dirty Queue*.
+
+# Orphaning the descendants with "orphan" finalizer
+
+Users may want to delete an owning object (e.g., a replicaset) while orphaning the dependent object (e.g., pods), that is, leaving the dependent objects untouched. We support such use cases by introducing the "orphan" finalizer. Finalizer is a generic API that has uses other than supporting orphaning, so we first describe the generic finalizer framework, then describe the specific design of the "orphan" finalizer.
+
+## Part I. The finalizer framework
+
+## API changes
+
+```
+type ObjectMeta struct {
+ …
+ Finalizers []string
+}
+```
+
+**ObjectMeta.Finalizers**: List of finalizers that need to run before deleting the object. This list must be empty before the object is deleted from the registry. Each string in the list is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. For safety reasons, updating finalizers requires special privileges. To enforce the admission rules, we will expose finalizers as a subresource and disallow directly changing finalizers when updating the main resource.
+
+## New components
+
+* Finalizers:
+ * Like a controller, a finalizer is always running.
+ * A third party can develop and run their own finalizer in the cluster. A finalizer doesn't need to be registered with the API server.
+ * Watches for update events that meet two conditions:
+ 1. the updated object has the identifier of the finalizer in ObjectMeta.Finalizers;
+ 2. ObjectMeta.DeletionTimestamp is updated from nil to non-nil.
+ * Applies the finalizing logic to the object in the update event.
+ * After the finalizing logic is completed, removes itself from ObjectMeta.Finalizers.
+ * The API server deletes the object after the last finalizer removes itself from the ObjectMeta.Finalizers field.
+ * Because it's possible for the finalizing logic to be applied multiple times (e.g., the finalizer crashes after applying the finalizing logic but before being removed form ObjectMeta.Finalizers), the finalizing logic has to be idempotent.
+ * If a finalizer fails to act in a timely manner, users with proper privileges can manually remove the finalizer from ObjectMeta.Finalizers. We will provide a kubectl command to do this.
+
+## Changes to existing components
+
+* API server:
+ * Deletion handler:
+ * If the `ObjectMeta.Finalizers` of the object being deleted is non-empty, then updates the DeletionTimestamp, but does not delete the object.
+ * If the `ObjectMeta.Finalizers` is empty and the options.GracePeriod is zero, then deletes the object. If the options.GracePeriod is non-zero, then just updates the DeletionTimestamp.
+ * Update handler:
+ * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, and the DeletionGracePeriodSeconds is zero, then deletes the object from the registry.
+ * If the update removes the last finalizer, and the DeletionTimestamp is non-nil, but the DeletionGracePeriodSeconds is non-zero, then just updates the object.
+
+## Part II. The "orphan" finalizer
+
+## API changes
+
+```
+type DeleteOptions struct {
+ …
+ OrphanDependents bool
+}
+```
+
+**DeleteOptions.OrphanDependents**: allows a user to express whether the dependent objects should be orphaned. It defaults to true, because controllers before release 1.2 expect dependent objects to be orphaned.
+
+## Changes to existing components
+
+* API server:
+When handling a deletion request, depending on if DeleteOptions.OrphanDependents is true, the API server updates the object to add/remove the "orphan" finalizer to/from the ObjectMeta.Finalizers map.
+
+
+## New components
+
+Adding a fourth component to the Garbage Collector, the"orphan" finalizer:
+* Watches for update events as described in [Part I](#part-i-the-finalizer-framework).
+* Removes the object in the event from the `OwnerReferences` of its dependents.
+ * dependent objects can be found via the DAG kept by the GC, or by relisting the dependent resource and checking the OwnerReferences field of each potential dependent object.
+* Also removes any dangling owner references the dependent objects have.
+* At last, removes the itself from the `ObjectMeta.Finalizers` of the object.
+
+# Related issues
+
+## Orphan adoption
+
+Controllers are responsible for adopting orphaned dependent resources. To do so, controllers
+* Checks a potential dependent object’s OwnerReferences to determine if it is orphaned.
+* Fills the OwnerReferences if the object matches the controller’s selector and is orphaned.
+
+There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion.
+
+## Upgrading a cluster to support cascading deletion
+
+For the master, after upgrading to a version that supports cascading deletion, the OwnerReferences of existing objects remain empty, so the controllers will regard them as orphaned and start the adoption procedures. After the adoptions are done, server-side cascading will be effective for these existing objects.
+
+For nodes, cascading deletion does not affect them.
+
+For kubectl, we will keep the kubectl’s cascading deletion logic for one more release.
+
+# End-to-End Examples
+
+This section presents an example of all components working together to enforce the cascading deletion or orphaning.
+
+## Life of a Deployment and its descendants
+
+1. User creates a deployment `D1`.
+2. The Propagator of the GC observes the creation. It creates an entry of `D1` in the DAG.
+3. The deployment controller observes the creation of `D1`. It creates the replicaset `R1`, whose OwnerReferences field contains a reference to `D1`, and has the "orphan" finalizer in its ObjectMeta.Finalizers map.
+4. The Propagator of the GC observes the creation of `R1`. It creates an entry of `R1` in the DAG, with `D1` as its owner.
+5. The replicaset controller observes the creation of `R1` and creates Pods `P1`~`Pn`, all with `R1` in their OwnerReferences.
+6. The Propagator of the GC observes the creation of `P1`~`Pn`. It creates entries for them in the DAG, with `R1` as their owner.
+
+ ***In case the user wants to cascadingly delete `D1`'s descendants, then***
+
+7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=false`. API server checks if `D1` has "orphan" finalizer in its Finalizers map, if so, it updates `D1` to remove the "orphan" finalizer. Then API server deletes `D1`.
+8. The "orphan" finalizer does *not* take any action, because the observed deletion shows `D1` has an empty Finalizers map.
+9. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent object, replicaset `R1`, to the *dirty queue*.
+10. The Garbage Processor of the GC dequeues `R1` from the *dirty queue*. It finds `R1` has an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `R1`, with `DeleteOptions.OrphanDependents=false`. (The Garbage Processor should always set this field to false.)
+11. The API server updates `R1` to remove the "orphan" finalizer if it's in the `R1`'s Finalizers map. Then the API server deletes `R1`, as `R1` has an empty Finalizers map.
+12. The Propagator of the GC observes the deletion of `R1`. It deletes `R1` from the DAG. It adds its dependent objects, Pods `P1`~`Pn`, to the *Dirty Queue*.
+13. The Garbage Processor of the GC dequeues `Px` (1 <= x <= n) from the *Dirty Queue*. It finds that `Px` have an owner reference pointing to `D1`, and `D1` no longer exists, so it requests API server to delete `Px`, with `DeleteOptions.OrphanDependents=false`.
+14. API server deletes the Pods.
+
+ ***In case the user wants to orphan `D1`'s descendants, then***
+
+7. The user deletes the deployment `D1`, with `DeleteOptions.OrphanDependents=true`.
+8. The API server first updates `D1`, with DeletionTimestamp=now and DeletionGracePeriodSeconds=0, increments the Generation by 1, and add the "orphan" finalizer to ObjectMeta.Finalizers if it's not present yet. The API server does not delete `D1`, because its Finalizers map is not empty.
+9. The deployment controller observes the update, and acknowledges by updating the `D1`'s ObservedGeneration. The deployment controller won't create more replicasets on `D1`'s behalf.
+10. The "orphan" finalizer observes the update, and notes down the Generation. It waits until the ObservedGeneration becomes equal to or greater than the noted Generation. Then it updates `R1` to remove `D1` from its OwnerReferences. At last, it updates `D1`, removing itself from `D1`'s Finalizers map.
+11. The API server handles the update of `D1`, because *i)* DeletionTimestamp is non-nil, *ii)* the DeletionGracePeriodSeconds is zero, and *iii)* the last finalizer is removed from the Finalizers map, API server deletes `D1`.
+12. The Propagator of the GC observes the deletion of `D1`. It deletes `D1` from the DAG. It adds its dependent, replicaset `R1`, to the *Dirty Queue*.
+13. The Garbage Processor of the GC dequeues `R1` from the *Dirty Queue* and skips it, because its OwnerReferences is empty.
+
+# Open Questions
+
+1. In case an object has multiple owners, some owners are deleted with DeleteOptions.OrphanDependents=true, and some are deleted with DeleteOptions.OrphanDependents=false, what should happen to the object?
+
+ The presented design will respect the setting in the deletion request of last owner.
+
+2. How to propagate the grace period in a cascading deletion? For example, when deleting a ReplicaSet with grace period of 5s, a user may expect the same grace period to be applied to the deletion of the Pods controlled the ReplicaSet.
+
+ Propagating grace period in a cascading deletion is a ***non-goal*** of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.
+
+ [#25055](https://github.com/kubernetes/kubernetes/issues/25055) tracks this problem.
+
+3. How can a client know when the cascading deletion is completed?
+
+ A tentative solution is introducing a "completing cascading deletion" finalizer, which will be added to the finalizers list of the owning object, and removed by the GC when all dependents are deleted. The user can watch for the deletion event of the owning object to ensure the cascading deletion process has completed.
+
+
+---
+***THE REST IS FOR ARCHIVAL PURPOSES***
+---
+
+# Considered and Rejected Designs
+
+# 1. Tombstone + GC
+
+## Reasons of rejection
+
+* It likely would conflict with our plan in the future to use all resources as their own tombstones, once the registry supports multi-object transaction.
+* The TTL of the tombstone is hand-waving, there is no guarantee that the value of the TTL is long enough.
+* This design is essentially the same as the selected design, with the tombstone as an extra element. The benefit the extra complexity buys is that a parent object can be deleted immediately even if the user wants to orphan the children. The benefit doesn't justify the complexity.
+
+
+## API Changes
+
+```
+type DeleteOptions struct {
+ …
+ OrphanChildren bool
+}
+```
+
+**DeleteOptions.OrphanChildren**: allows a user to express whether the child objects should be orphaned.
+
+```
+type ObjectMeta struct {
+ ...
+ ParentReferences []ObjectReference
+}
+```
+
+**ObjectMeta.ParentReferences**: links the resource to the parent resources. For example, a replica set `R` created by a deployment `D` should have an entry in ObjectMeta.ParentReferences pointing to `D`. The link should be set when the child object is created. It can be updated after the creation.
+
+```
+type Tombstone struct {
+ unversioned.TypeMeta
+ ObjectMeta
+ UID types.UID
+}
+```
+
+**Tombstone**: a tombstone is created when an object is deleted and the user requires the children to be orphaned.
+**Tombstone.UID**: the UID of the original object.
+
+## New components
+
+The only new component is the Garbage Collector, which consists of a scanner, a garbage processor, and a propagator.
+* Scanner:
+ * Uses the discovery API to detect all the resources supported by the system.
+ * For performance reasons, resources can be marked as not participating cascading deletion in the discovery info, then the GC will not monitor them.
+ * Periodically scans all resources in the system and adds each object to the *Dirty Queue*.
+
+* Garbage Processor:
+ * Consists of the *Dirty Queue* and workers.
+ * Each worker:
+ * Dequeues an item from *Dirty Queue*.
+ * If the item's ParentReferences is empty, continues to process the next item in the *Dirty Queue*.
+ * Otherwise checks each entry in the ParentReferences:
+ * If a parent exists, continues to check the next parent.
+ * If a parent doesn't exist, checks if a tombstone standing for the parent exists.
+ * If the step above shows no parent nor tombstone exists, requests the API server to delete the item. That is, only if ***all*** parents are non-existent, and none of them have tombstones, the child object will be garbage collected.
+ * Otherwise removes the item's ParentReferences to non-existent parents.
+
+* Propagator:
+ * The Propagator is for optimization, not for correctness.
+ * Maintains a DAG of parent-child relations. This DAG stores only name/uid/orphan triplets, not the entire body of every item.
+ * Consists of an *Event Queue* and a single worker.
+ * Watches for create/update/delete events for all resources that participating cascading deletion, enqueues the events to the *Event Queue*.
+ * Worker:
+ * Dequeues an item from the *Event Queue*.
+ * If the item is an creation or update, then updates the DAG accordingly.
+ * If the object has a parent and the parent doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
+ * If the item is a deletion, then removes the object from the DAG, and enqueues all its children to the *Dirty Queue*.
+ * The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
+ * With the Propagator, we *only* need to run the Scanner when starting the Propagator to populate the DAG and the *Dirty Queue*.
+
+## Changes to existing components
+
+* Storage: we should add a REST storage for Tombstones. The index should be UID rather than namespace/name.
+
+* API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created.
+
+* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don’t have a parent should have the namespace object as the parent.
+
+## Comparison with the selected design
+
+The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry.
+
+* Advantages of "Tombstone + GC" design
+ * Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all children’s ObjectMeta.ParentReferences.
+* Advantages of "Finalizer Framework + GC"
+ * The finalizer framework is needed for other purposes as well.
+
+
+# 2. Recovering from abnormal cascading deletion
+
+## Reasons of rejection
+
+* Not a goal
+* Tons of work, not feasible in the near future
+
+In case the garbage collector is mistakenly deleting objects, we should provide mechanism to stop the garbage collector and restore the objects.
+
+* Stopping the garbage collector
+
+ We will add a "--enable-garbage-collector" flag to the controller manager binary to indicate if the garbage collector should be enabled. Admin can stop the garbage collector in a running cluster by restarting the kube-controller-manager with --enable-garbage-collector=false.
+
+* Restoring mistakenly deleted objects
+ * Guidelines
+ * The restoration should be implemented as a roll-forward rather than a roll-back, because likely the state of the cluster (e.g., available resources on a node) has changed since the object was deleted.
+ * Need to archive the complete specs of the deleted objects.
+ * The content of the archive is sensitive, so the access to the archive subjects to the same authorization policy enforced on the original resource.
+ * States should be stored in etcd. All components should remain stateless.
+
+ * A preliminary design
+
+ This is a generic design for “undoing a deletion”, not specific to undoing cascading deletion.
+ * Add a `/archive` sub-resource to every resource, it's used to store the spec of the deleted objects.
+ * Before an object is deleted from the registry, the API server clears fields like DeletionTimestamp, then creates the object in /archive and sets a TTL.
+ * Add a `kubectl restore` command, which takes a resource/name pair as input, creates the object with the spec stored in the /archive, and deletes the archived object.
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/garbage-collection.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/gpu-support.md b/contributors/design-proposals/gpu-support.md
new file mode 100644
index 00000000..604f64bd
--- /dev/null
+++ b/contributors/design-proposals/gpu-support.md
@@ -0,0 +1,279 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [GPU support](#gpu-support)
+ - [Objective](#objective)
+ - [Background](#background)
+ - [Detailed discussion](#detailed-discussion)
+ - [Inventory](#inventory)
+ - [Scheduling](#scheduling)
+ - [The runtime](#the-runtime)
+ - [NVIDIA support](#nvidia-support)
+ - [Event flow](#event-flow)
+ - [Too complex for now: nvidia-docker](#too-complex-for-now-nvidia-docker)
+ - [Implementation plan](#implementation-plan)
+ - [V0](#v0)
+ - [Scheduling](#scheduling-1)
+ - [Runtime](#runtime)
+ - [Other](#other)
+ - [Future work](#future-work)
+ - [V1](#v1)
+ - [V2](#v2)
+ - [V3](#v3)
+ - [Undetermined](#undetermined)
+ - [Security considerations](#security-considerations)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# GPU support
+
+Author: @therc
+
+Date: Apr 2016
+
+Status: Design in progress, early implementation of requirements
+
+## Objective
+
+Users should be able to request GPU resources for their workloads, as easily as
+for CPU or memory. Kubernetes should keep an inventory of machines with GPU
+hardware, schedule containers on appropriate nodes and set up the container
+environment with all that's necessary to access the GPU. All of this should
+eventually be supported for clusters on either bare metal or cloud providers.
+
+## Background
+
+An increasing number of workloads, such as machine learning and seismic survey
+processing, benefits from offloading computations to graphic hardware. While not
+as tuned as traditional, dedicated high performance computing systems such as
+MPI, a Kubernetes cluster can still be a great environment for organizations
+that need a variety of additional, "classic" workloads, such as database, web
+serving, etc.
+
+GPU support is hard to provide extensively and will thus take time to tame
+completely, because
+
+- different vendors expose the hardware to users in different ways
+- some vendors require fairly tight coupling between the kernel driver
+controlling the GPU and the libraries/applications that access the hardware
+- it adds more resource types (whole GPUs, GPU cores, GPU memory)
+- it can introduce new security pitfalls
+- for systems with multiple GPUs, affinity matters, similarly to NUMA
+considerations for CPUs
+- running GPU code in containers is still a relatively novel idea
+
+## Detailed discussion
+
+Currently, this document is mostly focused on the basic use case: run GPU code
+on AWS `g2.2xlarge` EC2 machine instances using Docker. It constitutes a narrow
+enough scenario that it does not require large amounts of generic code yet. GCE
+doesn't support GPUs at all; bare metal systems throw a lot of extra variables
+into the mix.
+
+Later sections will outline future work to support a broader set of hardware,
+environments and container runtimes.
+
+### Inventory
+
+Before any scheduling can occur, we need to know what's available out there. In
+v0, we'll hardcode capacity detected by the kubelet based on a flag,
+`--experimental-nvidia-gpu`. This will result in the user-defined resource
+`alpha.kubernetes.io/nvidia-gpu` to be reported for `NodeCapacity` and
+`NodeAllocatable`, as well as as a node label.
+
+### Scheduling
+
+GPUs will be visible as first-class resources. In v0, we'll only assign whole
+devices; sharing among multiple pods is left to future implementations. It's
+probable that GPUs will exacerbate the need for [a rescheduler](rescheduler.md)
+or pod priorities, especially if the nodes in a cluster are not homogeneous.
+Consider these two cases:
+
+> Only half of the machines have a GPU and they're all busy with other
+workloads. The other half of the cluster is doing very little work. A GPU
+workload comes, but it can't schedule, because the devices are sitting idle on
+nodes that are running something else and the nodes with little load lack the
+hardware.
+
+> Some or all the machines have two graphic cards each. A number of jobs get
+scheduled, requesting one device per pod. The scheduler puts them all on
+different machines, spreading the load, perhaps by design. Then a new job comes
+in, requiring two devices per pod, but it can't schedule anywhere, because all
+we can find, at most, is one unused device per node.
+
+### The runtime
+
+Once we know where to run the container, it's time to set up its environment. At
+a minimum, we'll need to map the host device(s) into the container. Because each
+manufacturer exposes different device nodes (`/dev/ati/card0`, `/dev/nvidia0`,
+but also the required `/dev/nvidiactl` and `/dev/nvidia-uvm`), some of the logic
+needs to be hardware-specific, mapping from a logical device to a list of device
+nodes necessary for software to talk to it.
+
+Support binaries and libraries are often versioned along with the kernel module,
+so there should be further hooks to project those under `/bin` and some kind of
+`/lib` before the application is started. This can be done for Docker with the
+use of a versioned [Docker
+volume](https://docs.docker.com/engine/tutorials/dockervolumes/) or
+with upcoming Kubernetes-specific hooks such as init containers and volume
+containers. In v0, images are expected to bundle everything they need.
+
+#### NVIDIA support
+
+The first implementation and testing ground will be for NVIDIA devices, by far
+the most common setup.
+
+In v0, the `--experimental-nvidia-gpu` flag will also result in the host devices
+(limited to those required to drive the first card, `nvidia0`) to be mapped into
+the container by the dockertools library.
+
+### Event flow
+
+This is what happens before and after an user schedules a GPU pod.
+
+1. Administrator installs a number of Kubernetes nodes with GPUs. The correct
+kernel modules and device nodes under `/dev/` are present.
+
+1. Administrator makes sure the latest CUDA/driver versions are installed.
+
+1. Administrator enables `--experimental-nvidia-gpu` on kubelets
+
+1. Kubelets update node status with information about the GPU device, in addition
+to cAdvisor's usual data about CPU/memory/disk
+
+1. User creates a Docker image compiling their application for CUDA, bundling
+the necessary libraries. We ignore any versioning requirements in the image
+using labels based on [NVIDIA's
+conventions](https://github.com/NVIDIA/nvidia-docker/blob/64510511e3fd0d00168eb076623854b0fcf1507d/tools/src/nvidia-docker/utils.go#L13).
+
+1. User creates a pod using the image, requiring
+`alpha.kubernetes.io/nvidia-gpu: 1`
+
+1. Scheduler picks a node for the pod
+
+1. The kubelet notices the GPU requirement and maps the three devices. In
+Docker's engine-api, this means it'll add them to the Resources.Devices list.
+
+1. Docker runs the container to completion
+
+1. The scheduler notices that the device is available again
+
+### Too complex for now: nvidia-docker
+
+For v0, we discussed at length, but decided to leave aside initially the
+[nvidia-docker plugin](https://github.com/NVIDIA/nvidia-docker). The plugin is
+an officially supported solution, thus avoiding a lot of new low level code, as
+it takes care of functionality such as:
+
+- creating a Docker volume with binaries such as `nvidia-smi` and shared
+libraries
+- providing HTTP endpoints that monitoring tools can use to collect GPU metrics
+- abstracting details such as `/dev` entry names for each device, as well as
+control ones like `nvidiactl`
+
+The `nvidia-docker` wrapper also verifies that the CUDA version required by a
+given image is supported by the host drivers, through inspection of well-known
+image labels, if present. We should try to provide equivalent checks, either
+for CUDA or OpenCL.
+
+This is current sample output from `nvidia-docker-plugin`, wrapped for
+readability:
+
+ $ curl -s localhost:3476/docker/cli
+ --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0
+ --volume-driver=nvidia-docker
+ --volume=nvidia_driver_352.68:/usr/local/nvidia:ro
+
+It runs as a daemon listening for HTTP requests on port 3476. The endpoint above
+returns flags that need to be added to the Docker command line in order to
+expose GPUs to the containers. There are optional URL arguments to request
+specific devices if more than one are present on the system, as well as specific
+versions of the support software. An obvious improvement is an additional
+endpoint for JSON output.
+
+The unresolved question is whether `nvidia-docker-plugin` would run standalone
+as it does today (called over HTTP, perhaps with endpoints for a new Kubernetes
+resource API) or whether the relevant code from its `nvidia` package should be
+linked directly into kubelet. A partial list of tradeoffs:
+
+| | External binary | Linked in |
+|---------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
+| Use of cgo | Confined to binary | Linked into kubelet, but with lazy binding |
+| Expandibility | Limited if we run the plugin, increased if library is used to build a Kubernetes-tailored daemon. | Can reuse the `nvidia` library as we prefer |
+| Bloat | None | Larger kubelet, even for systems without GPUs |
+| Reliability | Need to handle the binary disappearing at any time | Fewer headeaches |
+| (Un)Marshalling | Need to talk over JSON | None |
+| Administration cost | One more daemon to install, configure and monitor | No extra work required, other than perhaps configuring flags |
+| Releases | Potentially on its own schedule | Tied to Kubernetes' |
+
+## Implementation plan
+
+### V0
+
+The first two tracks can progress in parallel.
+
+#### Scheduling
+
+1. Define new resource `alpha.kubernetes.io:nvidia-gpu` in `pkg/api/types.go`
+and co.
+1. Plug resource into feasability checks used by kubelet, scheduler and
+schedulercache. Maybe gated behind a flag?
+1. Plug resource into resource_helpers.go
+1. Plug resource into the limitranger
+
+#### Runtime
+
+1. Add kubelet config parameter to enable the resource
+1. Make kubelet's `setNodeStatusMachineInfo` report the resource
+1. Add a Devices list to container.RunContainerOptions
+1. Use it from DockerManager's runContainer
+1. Do the same for rkt (stretch goal)
+1. When a pod requests a GPU, add the devices to the container options
+
+#### Other
+
+1. Add new resource to `kubectl describe` output. Optional for non-GPU users?
+1. Administrator documentation, with sample scripts
+1. User documentation
+
+## Future work
+
+Above all, we need to collect feedback from real users and use that to set
+priorities for any of the items below.
+
+### V1
+
+- Perform real detection of the installed hardware
+- Figure a standard way to avoid bundling of shared libraries in images
+- Support fractional resources so multiple pods can share the same GPU
+- Support bare metal setups
+- Report resource usage
+
+### V2
+
+- Support multiple GPUs with resource hierarchies and affinities
+- Support versioning of resources (e.g. "CUDA v7.5+")
+- Build resource plugins into the kubelet?
+- Support other device vendors
+- Support Azure?
+- Support rkt?
+
+### V3
+
+- Support OpenCL (so images can be device-agnostic)
+
+### Undetermined
+
+It makes sense to turn the output of this project (external resource plugins,
+etc.) into a more generic abstraction at some point.
+
+
+## Security considerations
+
+There should be knobs for the cluster administrator to only allow certain users
+or roles to schedule GPU workloads. Overcommitting or sharing the same device
+across different pods is not considered safe. It should be possible to segregate
+such GPU-sharing pods by user, namespace or a combination thereof.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/gpu-support.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/ha_master.md b/contributors/design-proposals/ha_master.md
new file mode 100644
index 00000000..d4cf26a9
--- /dev/null
+++ b/contributors/design-proposals/ha_master.md
@@ -0,0 +1,236 @@
+# Automated HA master deployment
+
+**Author:** filipg@, jsz@
+
+# Introduction
+
+We want to allow users to easily replicate kubernetes masters to have highly available cluster,
+initially using `kube-up.sh` and `kube-down.sh`.
+
+This document describes technical design of this feature. It assumes that we are using aforementioned
+scripts for cluster deployment. All of the ideas described in the following sections should be easy
+to implement on GCE, AWS and other cloud providers.
+
+It is a non-goal to design a specific setup for bare-metal environment, which
+might be very different.
+
+# Overview
+
+In a cluster with replicated master, we will have N VMs, each running regular master components
+such as apiserver, etcd, scheduler or controller manager. These components will interact in the
+following way:
+* All etcd replicas will be clustered together and will be using master election
+ and quorum mechanism to agree on the state. All of these mechanisms are integral
+ parts of etcd and we will only have to configure them properly.
+* All apiserver replicas will be working independently talking to an etcd on
+ 127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master
+ (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)).
+* We will introduce provider specific solutions to load balance traffic between master replicas
+ (see section `load balancing`)
+* Controller manager, scheduler & cluster autoscaler will use lease mechanism and
+ only a single instance will be an active master. All other will be waiting in a standby mode.
+* All add-on managers will work independently and each of them will try to keep add-ons in sync
+
+# Detailed design
+
+## Components
+
+### etcd
+
+```
+Note: This design for etcd clustering is quite pet-set like - each etcd
+replica has its name which is explicitly used in etcd configuration etc. In
+medium-term future we would like to have the ability to run masters as part of
+autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas
+automatically. This is pretty tricky and this design does not cover this.
+It will be covered in a separate doc.
+```
+
+All etcd instances will be clustered together and one of them will be an elected master.
+In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
+configured in such a way that all writes and reads will go through the master (requests
+will be forwarded by the local etcd server such that it’s invisible for the user). It will
+affect latency for all operations, but it should not increase by much more than the network
+latency between master replicas (latency between GCE zones with a region is < 10ms).
+
+Currently etcd exposes port only using localhost interface. In order to allow clustering
+and inter-VM communication we will also have to use public interface. To secure the
+communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).
+
+When generating command line for etcd we will always assume it’s part of a cluster
+(initially of size 1) and list all existing kubernetes master replicas.
+Based on that, we will set the following flags:
+* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
+* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one):
+ * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty
+ * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty.
+
+This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs
+with master replicas will be generated in `kube-up.sh` script and passed to as a env variable
+`INITIAL_ETCD_CLUSTER`.
+
+### apiservers
+
+All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact
+etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the
+etcd leader. This functionality is completely hidden from the client (apiserver
+in our case).
+
+Caching mechanism, which is implemented in apiserver, will not be affected by
+replicating master because:
+* GET requests go directly to etcd
+* LIST requests go either directly to etcd or to cache populated via watch
+ (depending on the ResourceVersion in ListOptions). In the second scenario,
+ after a PUT/POST request, changes might not be visible in LIST response.
+ This is however not worse than it is with the current single master.
+* WATCH does not give any guarantees when change will be delivered.
+
+#### load balancing
+
+With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud
+providers have different capabilities and limitations, we will not try to find a common lowest
+denominator that will work everywhere. Instead we will document various options and apply different
+solution for different deployments. Below we list possible approaches:
+
+1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed
+automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS)
+or Google Cloud DNS (GCP). For load balancing we will have two options:
+ 1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately
+ 1.2. use round-robin DNS technique to access all apiservers directly
+2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries
+will be manually managed by the user. We will provide detailed documentation for the entries we
+expect.
+3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static
+external IP address that is later assigned to the master VM. When creating additional replicas we
+will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
+instead of a single master. When removing second to last replica we will reverse this operation (assign
+IP address to the remaining master VM and delete load balancer). That way user will not have to provide
+a domain name and all client configurations will keep working.
+
+This will also impact `kubelet <-> master` communication as it should use load
+balancing for it. Depending on the chosen method we will use it to properly configure
+kubelet.
+
+#### `kubernetes` service
+
+Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
+list of IP addresses for all apiservers. As it uses a command line flag
+`--apiserver-count` it is not very dynamic and would require restarting all
+masters to change number of master replicas.
+
+To allow dynamic changes to the number of apiservers in the cluster, we will
+introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration
+time for each apiserver (keyed by IP). Each apiserver will do three things:
+
+1. periodically update expiration time for it's own IP address
+2. remove all the stale IP addresses from the endpoints list
+3. add it's own IP address if it's not on the list yet.
+
+That way we will not only solve the problem of dynamically changing number
+of apiservers in the cluster, but also the problem of non-responsive apiservers
+that should be removed from the `kubernetes` service endpoints list.
+
+#### Certificates
+
+Certificate generation will work as today. In particular, on GCE, we will
+generate it for the public IP used to access the cluster (see `load balancing`
+section) and local IP of the master replica VM.
+
+That means that with multiple master replicas and a load balancer in front
+of them, accessing one of the replicas directly (using it's ephemeral public
+IP) will not work on GCE without appropriate flags:
+
+- `kubectl --insecure-skip-tls-verify=true`
+- `curl --insecure`
+- `wget --no-check-certificate`
+
+For other deployment tools and providers the details of certificate generation
+may be different, but it must be possible to access the cluster by using either
+the main cluster endpoint (DNS name or IP address) or internal service called
+`kubernetes` that points directly to the apiservers.
+
+### controller manager, scheduler & cluster autoscaler
+
+Controller manager and scheduler will by default use a lease mechanism to choose an active instance
+among all masters. Only one instance will be performing any operations.
+All other will be waiting in standby mode.
+
+We will use the same configuration in non-replicated mode to simplify deployment scripts.
+
+### add-on manager
+
+All add-on managers will be working independently. Each of them will observe current state of
+add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on
+can be updated multiple times in a row after upgrading the master. Long-term we should fix this
+by using a similar mechanisms as controller manager or scheduler. However, currently add-on
+manager is just a bash script and adding a master election mechanism would not be easy.
+
+## Adding replica
+
+Command to add new replica on GCE using kube-up script:
+
+```
+KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh
+```
+
+A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following:
+
+```
+1. If there is no load balancer for this cluster:
+ 1. Create load balancer using ephemeral IP address
+ 2. Add existing apiserver to the load balancer
+ 3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
+ 4. Update DNS to point to the load balancer.
+2. Clone existing master (create a new VM with the same configuration) including
+ all env variables (certificates, IP ranges etc), with the exception of
+ `INITIAL_ETCD_CLUSTER`.
+3. SSH to an existing master and run the following command to extend etcd cluster
+ with the new instance:
+ `curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'`
+4. Add IP address of the new apiserver to the load balancer.
+```
+
+A simplified algorithm for adding a new master replica and promoting master IP to the load balancer
+is identical to the one when using DNS, with a different step to setup load balancer:
+
+```
+1. If there is no load balancer for this cluster:
+ 1. Unassign IP from the existing master replica
+ 2. Create load balancer using static IP reclaimed in the previous step
+ 3. Add existing apiserver to the load balancer
+ 4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
+...
+```
+
+## Deleting replica
+
+Command to delete one replica on GCE using kube-up script:
+
+```
+KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
+```
+
+A pseudo-code for deleting an existing replica for the master is the following:
+
+```
+1. Remove replica IP address from the load balancer or DNS configuration
+2. SSH to one of the remaining masters and run the following command to remove replica from the cluster:
+ `curl etcd-0:4001/v2/members/<id> -XDELETE -L`
+3. Delete replica VM
+4. If load balancer has only a single target instance, then delete load balancer
+5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM.
+```
+
+## Upgrades
+
+Upgrading replicated master will be possible by upgrading them one by one using existing tools
+(e.g. upgrade.sh for GCE). This will work out of the box because:
+* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible.
+* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components
+will be in the same version.
+* Apiserver talks only to a local etcd replica which will be in a compatible version
+* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/high-availability.md b/contributors/design-proposals/high-availability.md
new file mode 100644
index 00000000..da2f4fc9
--- /dev/null
+++ b/contributors/design-proposals/high-availability.md
@@ -0,0 +1,8 @@
+# High Availability of Scheduling and Controller Components in Kubernetes
+
+This document is deprecated. For more details about running a highly available
+cluster master, please see the [admin instructions document](../../docs/admin/high-availability.md).
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/high-availability.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/horizontal-pod-autoscaler.md b/contributors/design-proposals/horizontal-pod-autoscaler.md
new file mode 100644
index 00000000..1ac9c24b
--- /dev/null
+++ b/contributors/design-proposals/horizontal-pod-autoscaler.md
@@ -0,0 +1,263 @@
+<h2>Warning! This document might be outdated.</h2>
+
+# Horizontal Pod Autoscaling
+
+## Preface
+
+This document briefly describes the design of the horizontal autoscaler for
+pods. The autoscaler (implemented as a Kubernetes API resource and controller)
+is responsible for dynamically controlling the number of replicas of some
+collection (e.g. the pods of a ReplicationController) to meet some objective(s),
+for example a target per-pod CPU utilization.
+
+This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md).
+
+## Overview
+
+The resource usage of a serving application usually varies over time: sometimes
+the demand for the application rises, and sometimes it drops. In Kubernetes
+version 1.0, a user can only manually set the number of serving pods. Our aim is
+to provide a mechanism for the automatic adjustment of the number of pods based
+on CPU utilization statistics (a future version will allow autoscaling based on
+other resources/metrics).
+
+## Scale Subresource
+
+In Kubernetes version 1.1, we are introducing Scale subresource and implementing
+horizontal autoscaling of pods based on it. Scale subresource is supported for
+replication controllers and deployments. Scale subresource is a Virtual Resource
+(does not correspond to an object stored in etcd). It is only present in the API
+as an interface that a controller (in this case the HorizontalPodAutoscaler) can
+use to dynamically scale the number of replicas controlled by some other API
+object (currently ReplicationController and Deployment) and to learn the current
+number of replicas. Scale is a subresource of the API object that it serves as
+the interface for. The Scale subresource is useful because whenever we introduce
+another type we want to autoscale, we just need to implement the Scale
+subresource for it. The wider discussion regarding Scale took place in issue
+[#1629](https://github.com/kubernetes/kubernetes/issues/1629).
+
+Scale subresource is in API for replication controller or deployment under the
+following paths:
+
+`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
+
+`apis/extensions/v1beta1/deployments/mydeployment/scale`
+
+It has the following structure:
+
+```go
+// represents a scaling request for a resource.
+type Scale struct {
+ unversioned.TypeMeta
+ api.ObjectMeta
+
+ // defines the behavior of the scale.
+ Spec ScaleSpec
+
+ // current status of the scale.
+ Status ScaleStatus
+}
+
+// describes the attributes of a scale subresource
+type ScaleSpec struct {
+ // desired number of instances for the scaled object.
+ Replicas int `json:"replicas,omitempty"`
+}
+
+// represents the current status of a scale subresource.
+type ScaleStatus struct {
+ // actual number of observed instances of the scaled object.
+ Replicas int `json:"replicas"`
+
+ // label query over pods that should match the replicas count.
+ Selector map[string]string `json:"selector,omitempty"`
+}
+```
+
+Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment
+associated with the given Scale subresource. `ScaleStatus.Replicas` reports how
+many pods are currently running in the replication controller/deployment, and
+`ScaleStatus.Selector` returns selector for the pods.
+
+## HorizontalPodAutoscaler Object
+
+In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It
+is accessible under:
+
+`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
+
+It has the following structure:
+
+```go
+// configuration of a horizontal pod autoscaler.
+type HorizontalPodAutoscaler struct {
+ unversioned.TypeMeta
+ api.ObjectMeta
+
+ // behavior of autoscaler.
+ Spec HorizontalPodAutoscalerSpec
+
+ // current information about the autoscaler.
+ Status HorizontalPodAutoscalerStatus
+}
+
+// specification of a horizontal pod autoscaler.
+type HorizontalPodAutoscalerSpec struct {
+ // reference to Scale subresource; horizontal pod autoscaler will learn the current resource
+ // consumption from its status,and will set the desired number of pods by modifying its spec.
+ ScaleRef SubresourceReference
+ // lower limit for the number of pods that can be set by the autoscaler, default 1.
+ MinReplicas *int
+ // upper limit for the number of pods that can be set by the autoscaler.
+ // It cannot be smaller than MinReplicas.
+ MaxReplicas int
+ // target average CPU utilization (represented as a percentage of requested CPU) over all the pods;
+ // if not specified it defaults to the target CPU utilization at 80% of the requested resources.
+ CPUUtilization *CPUTargetUtilization
+}
+
+type CPUTargetUtilization struct {
+ // fraction of the requested CPU that should be utilized/used,
+ // e.g. 70 means that 70% of the requested CPU should be in use.
+ TargetPercentage int
+}
+
+// current status of a horizontal pod autoscaler
+type HorizontalPodAutoscalerStatus struct {
+ // most recent generation observed by this autoscaler.
+ ObservedGeneration *int64
+
+ // last time the HorizontalPodAutoscaler scaled the number of pods;
+ // used by the autoscaler to control how often the number of pods is changed.
+ LastScaleTime *unversioned.Time
+
+ // current number of replicas of pods managed by this autoscaler.
+ CurrentReplicas int
+
+ // desired number of replicas of pods managed by this autoscaler.
+ DesiredReplicas int
+
+ // current average CPU utilization over all pods, represented as a percentage of requested CPU,
+ // e.g. 70 means that an average pod is using now 70% of its requested CPU.
+ CurrentCPUUtilizationPercentage *int
+}
+```
+
+`ScaleRef` is a reference to the Scale subresource.
+`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler
+configuration. We are also introducing HorizontalPodAutoscalerList object to
+enable listing all autoscalers in a namespace:
+
+```go
+// list of horizontal pod autoscaler objects.
+type HorizontalPodAutoscalerList struct {
+ unversioned.TypeMeta
+ unversioned.ListMeta
+
+ // list of horizontal pod autoscaler objects.
+ Items []HorizontalPodAutoscaler
+}
+```
+
+## Autoscaling Algorithm
+
+The autoscaler is implemented as a control loop. It periodically queries pods
+described by `Status.PodSelector` of Scale subresource, and collects their CPU
+utilization. Then, it compares the arithmetic mean of the pods' CPU utilization
+with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of
+the Scale if needed to match the target (preserving condition: MinReplicas <=
+Replicas <= MaxReplicas).
+
+The period of the autoscaler is controlled by the
+`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The
+default value is 30 seconds.
+
+
+CPU utilization is the recent CPU usage of a pod (average across the last 1
+minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU
+usage is taken directly from Heapster. In future, there will be API on master
+for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
+
+The target number of pods is calculated from the following formula:
+
+```
+TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
+```
+
+Starting and stopping pods may introduce noise to the metric (for instance,
+starting may temporarily increase CPU). So, after each action, the autoscaler
+should wait some time for reliable data. Scale-up can only happen if there was
+no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from
+the last rescaling. Moreover any scaling will only be made if:
+`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1
+(10% tolerance). Such approach has two benefits:
+
+* Autoscaler works in a conservative way. If new user load appears, it is
+important for us to rapidly increase the number of pods, so that user requests
+will not be rejected. Lowering the number of pods is not that urgent.
+
+* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting
+decision if the load is not stable.
+
+## Relative vs. absolute metrics
+
+We chose values of the target metric to be relative (e.g. 90% of requested CPU
+resource) rather than absolute (e.g. 0.6 core) for the following reason. If we
+choose absolute metric, user will need to guarantee that the target is lower
+than the request. Otherwise, overloaded pods may not be able to consume more
+than the autoscaler's absolute target utilization, thereby preventing the
+autoscaler from seeing high enough utilization to trigger it to scale up. This
+may be especially troublesome when user changes requested resources for a pod
+because they would need to also change the autoscaler utilization threshold.
+Therefore, we decided to choose relative metric. For user, it is enough to set
+it to a value smaller than 100%, and further changes of requested resources will
+not invalidate it.
+
+## Support in kubectl
+
+To make manipulation of HorizontalPodAutoscaler object simpler, we added support
+for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In
+addition, in future, we are planning to add kubectl support for the following
+use-cases:
+* When creating a replication controller or deployment with
+`kubectl create [-f]`, there should be a possibility to specify an additional
+autoscaler object. (This should work out-of-the-box when creation of autoscaler
+is supported by kubectl as we may include multiple objects in the same config
+file).
+* *[future]* When running an image with `kubectl run`, there should be an
+additional option to create an autoscaler for it.
+* *[future]* We will add a new command `kubectl autoscale` that will allow for
+easy creation of an autoscaler object for already existing replication
+controller/deployment.
+
+## Next steps
+
+We list here some features that are not supported in Kubernetes version 1.1.
+However, we want to keep them in mind, as they will most probably be needed in
+the future.
+Our design is in general compatible with them.
+* *[future]* **Autoscale pods based on metrics different than CPU** (e.g.
+memory, network traffic, qps). This includes scaling based on a custom/application metric.
+* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler,
+instead of computing average for a target metric across pods, will use a single,
+external, metric (e.g. qps metric from load balancer). The metric will be
+aggregated while the target will remain per-pod (e.g. when observing 100 qps on
+load balancer while the target is 20 qps per pod, autoscaler will set the number
+of replicas to 5).
+* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers
+of pods for different metrics are different, choose the largest target number of
+pods.
+* *[future]* **Scale the number of pods starting from 0.** All pods can be
+turned-off, and then turned-on when there is a demand for them. When a request
+to service with no pods arrives, kube-proxy will generate an event for
+autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
+* *[future]* **When scaling down, make more educated decision which pods to
+kill.** E.g.: if two or more pods from the same replication controller are on
+the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301).
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/identifiers.md b/contributors/design-proposals/identifiers.md
new file mode 100644
index 00000000..a37411f9
--- /dev/null
+++ b/contributors/design-proposals/identifiers.md
@@ -0,0 +1,113 @@
+# Identifiers and Names in Kubernetes
+
+A summarization of the goals and recommendations for identifiers in Kubernetes.
+Described in GitHub issue [#199](http://issue.k8s.io/199).
+
+
+## Definitions
+
+`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time
+and space; intended to distinguish between historical occurrences of similar
+entities.
+
+`Name`: A non-empty string guaranteed to be unique within a given scope at a
+particular time; used in resource URLs; provided by clients at creation time and
+encouraged to be human friendly; intended to facilitate creation idempotence and
+space-uniqueness of singleton objects, distinguish distinct entities, and
+reference particular entities across operations.
+
+[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL):
+An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters,
+with the '-' character allowed anywhere except the first or last character,
+suitable for use as a hostname or segment in a domain name.
+
+[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN):
+One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum
+length of 253 characters.
+
+[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID):
+A 128 bit generated value that is extremely unlikely to collide across time and
+space and requires no central coordination.
+
+[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME):
+An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters,
+with the '-' character allowed anywhere except the first or the last character
+or adjacent to another '-' character, it must contain at least a (a-z)
+character.
+
+## Objectives for names and UIDs
+
+1. Uniquely identify (via a UID) an object across space and time.
+2. Uniquely name (via a name) an object across space.
+3. Provide human-friendly names in API operations and/or configuration files.
+4. Allow idempotent creation of API resources (#148) and enforcement of
+space-uniqueness of singleton objects.
+5. Allow DNS names to be automatically generated for some objects.
+
+
+## General design
+
+1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must
+be specified. Name must be non-empty and unique within the apiserver. This
+enables idempotent and space-unique creation operations. Parts of the system
+(e.g. replication controller) may join strings (e.g. a base name and a random
+suffix) to create a unique Name. For situations where generating a name is
+impractical, some or all objects may support a param to auto-generate a name.
+Generating random names will defeat idempotency.
+ * Examples: "guestbook.user", "backend-x4eb1"
+2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN?
+format TBD via #1114) may be specified. Depending on the API receiver,
+namespaces might be validated (e.g. apiserver might ensure that the namespace
+actually exists). If a namespace is not specified, one will be assigned by the
+API receiver. This assignment policy might vary across API receivers (e.g.
+apiserver might have a default, kubelet might generate something semi-random).
+ * Example: "api.k8s.example.com"
+3. Upon acceptance of an object via an API, the object is assigned a UID
+(a UUID). UID must be non-empty and unique across space and time.
+ * Example: "01234567-89ab-cdef-0123-456789abcdef"
+
+## Case study: Scheduling a pod
+
+Pods can be placed onto a particular node in a number of ways. This case study
+demonstrates how the above design can be applied to satisfy the objectives.
+
+### A pod scheduled by a user through the apiserver
+
+1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
+2. The apiserver validates the input.
+ 1. A default Namespace is assigned.
+ 2. The pod name must be space-unique within the Namespace.
+ 3. Each container within the pod has a name which must be space-unique within
+the pod.
+3. The pod is accepted.
+ 1. A new UID is assigned.
+4. The pod is bound to a node.
+ 1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
+5. Kubelet validates the input.
+6. Kubelet runs the pod.
+ 1. Each container is started up with enough metadata to distinguish the pod
+from whence it came.
+ 2. Each attempt to run a container is assigned a UID (a string) that is
+unique across time. * This may correspond to Docker's container ID.
+
+### A pod placed by a config file on the node
+
+1. A config file is stored on the node, containing a pod with UID="",
+Namespace="", and Name="cadvisor".
+2. Kubelet validates the input.
+ 1. Since UID is not provided, kubelet generates one.
+ 2. Since Namespace is not provided, kubelet generates one.
+ 1. The generated namespace should be deterministic and cluster-unique for
+the source, such as a hash of the hostname and file path.
+ * E.g. Namespace="file-f4231812554558a718a01ca942782d81"
+3. Kubelet runs the pod.
+ 1. Each container is started up with enough metadata to distinguish the pod
+from whence it came.
+ 2. Each attempt to run a container is assigned a UID (a string) that is
+unique across time.
+ 1. This may correspond to Docker's container ID.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/image-provenance.md b/contributors/design-proposals/image-provenance.md
new file mode 100644
index 00000000..7a5580d9
--- /dev/null
+++ b/contributors/design-proposals/image-provenance.md
@@ -0,0 +1,331 @@
+
+# Overview
+
+Organizations wish to avoid running "unapproved" images.
+
+The exact nature of "approval" is beyond the scope of Kubernetes, but may include reasons like:
+
+ - only run images that are scanned to confirm they do not contain vulnerabilities
+ - only run images that use a "required" base image
+ - only run images that contain binaries which were built from peer reviewed, checked-in source
+ by a trusted compiler toolchain.
+ - only allow images signed by certain public keys.
+
+ - etc...
+
+Goals of the design include:
+* Block creation of pods that would cause "unapproved" images to run.
+* Make it easy for users or partners to build "image provenance checkers" which check whether images are "approved".
+ * We expect there will be multiple implementations.
+* Allow users to request an "override" of the policy in a convenient way (subject to the override being allowed).
+ * "overrides" are needed to allow "emergency changes", but need to not happen accidentally, since they may
+ require tedious after-the-fact justification and affect audit controls.
+
+Non-goals include:
+* Encoding image policy into Kubernetes code.
+* Implementing objects in core kubernetes which describe complete policies for what images are approved.
+ * A third-party implementation of an image policy checker could optionally use ThirdPartyResource to store its policy.
+* Kubernetes core code dealing with concepts of image layers, build processes, source repositories, etc.
+ * We expect there will be multiple PaaSes and/or de-facto programming environments, each with different takes on
+ these concepts. At any rate, Kubernetes is not ready to be opinionated on these concepts.
+* Sending more information than strictly needed to a third-party service.
+ * Information sent by Kubernetes to a third-party service constitutes an API of Kubernetes, and we want to
+ avoid making these broader than necessary, as it restricts future evolution of Kubernetes, and makes
+ Kubernetes harder to reason about. Also, excessive information limits cache-ability of decisions. Caching
+ reduces latency and allows short outages of the backend to be tolerated.
+
+
+Detailed discussion in [Ensuring only images are from approved sources are run](
+https://github.com/kubernetes/kubernetes/issues/22888).
+
+# Implementation
+
+A new admission controller will be added. That will be the only change.
+
+## Admission controller
+
+An `ImagePolicyWebhook` admission controller will be written. The admission controller examines all pod objects which are
+created or updated. It can either admit the pod, or reject it. If it is rejected, the request sees a `403 FORBIDDEN`
+
+The admission controller code will go in `plugin/pkg/admission/imagepolicy`.
+
+There will be a cache of decisions in the admission controller.
+
+If the apiserver cannot reach the webhook backend, it will log a warning and either admit or deny the pod.
+A flag will control whether it admits or denies on failure.
+The rationale for deny is that an attacker could DoS the backend or wait for it to be down, and then sneak a
+bad pod into the system. The rationale for allow here is that, if the cluster admin also does
+after-the-fact auditing of what images were run (which we think will be common), this will catch
+any bad images run during periods of backend failure. With default-allow, the availability of Kubernetes does
+not depend on the availability of the backend.
+
+# Webhook Backend
+
+The admission controller code in that directory does not contain logic to make an admit/reject decision. Instead, it extracts
+relevant fields from the Pod creation/update request and sends those fields to a Backend (which we have been loosely calling "WebHooks"
+in Kubernetes). The request the admission controller sends to the backend is called a WebHook request to distinguish it from the
+request being admission-controlled. The server that accepts the WebHook request from Kubernetes is called the "Backend"
+to distinguish it from the WebHook request itself, and from the API server.
+
+The whole system will work similarly to the [Authentication WebHook](
+https://github.com/kubernetes/kubernetes/pull/24902
+) or the [AuthorizationWebHook](
+https://github.com/kubernetes/kubernetes/pull/20347).
+
+The WebHook request can optionally authenticate itself to its backend using a token from a `kubeconfig` file.
+
+The WebHook request and response are JSON, and correspond to the following `go` structures:
+
+```go
+// Filename: pkg/apis/imagepolicy.k8s.io/register.go
+package imagepolicy
+
+// ImageReview checks if the set of images in a pod are allowed.
+type ImageReview struct {
+ unversioned.TypeMeta
+
+ // Spec holds information about the pod being evaluated
+ Spec ImageReviewSpec
+
+ // Status is filled in by the backend and indicates whether the pod should be allowed.
+ Status ImageReviewStatus
+ }
+
+// ImageReviewSpec is a description of the pod creation request.
+type ImageReviewSpec struct {
+ // Containers is a list of a subset of the information in each container of the Pod being created.
+ Containers []ImageReviewContainerSpec
+ // Annotations is a list of key-value pairs extracted from the Pod's annotations.
+ // It only includes keys which match the pattern `*.image-policy.k8s.io/*`.
+ // It is up to each webhook backend to determine how to interpret these annotations, if at all.
+ Annotations map[string]string
+ // Namespace is the namespace the pod is being created in.
+ Namespace string
+}
+
+// ImageReviewContainerSpec is a description of a container within the pod creation request.
+type ImageReviewContainerSpec struct {
+ Image string
+ // In future, we may add command line overrides, exec health check command lines, and so on.
+}
+
+// ImageReviewStatus is the result of the token authentication request.
+type ImageReviewStatus struct {
+ // Allowed indicates that all images were allowed to be run.
+ Allowed bool
+ // Reason should be empty unless Allowed is false in which case it
+ // may contain a short description of what is wrong. Kubernetes
+ // may truncate excessively long errors when displaying to the user.
+ Reason string
+}
+```
+
+## Extending with Annotations
+
+All annotations on a Pod that match `*.image-policy.k8s.io/*` are sent to the webhook.
+Sending annotations allows users who are aware of the image policy backend to send
+extra information to it, and for different backends implementations to accept
+different information.
+
+Examples of information you might put here are
+
+- request to "break glass" to override a policy, in case of emergency.
+- a ticket number from a ticket system that documents the break-glass request
+- provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup
+
+In any case, the annotations are provided by the user and are not validated by Kubernetes in any way. In the future, if an annotation is determined to be widely
+useful, we may promote it to a named field of ImageReviewSpec.
+
+In the case of a Pod update, Kubernetes may send the backend either all images in the updated image, or only the ones that
+changed, at its discretion.
+
+## Interaction with Controllers
+
+In the case of a Deployment object, no image check is done when the Deployment object is created or updated.
+Likewise, no check happens when the Deployment controller creates a ReplicaSet. The check only happens
+when the ReplicaSet controller creates a Pod. Checking Pod is necessary since users can directly create pods,
+and since third-parties can write their own controllers, which kubernetes might not be aware of or even contain
+pod templates.
+
+The ReplicaSet, or other controller, is responsible for recognizing when a 403 has happened
+(whether due to user not having permission due to bad image, or some other permission reason)
+and throttling itself and surfacing the error in a way that CLIs and UIs can show to the user.
+
+Issue [22298](https://github.com/kubernetes/kubernetes/issues/22298) needs to be resolved to
+propagate Pod creation errors up through a stack of controllers.
+
+## Changes in policy over time
+
+The Backend might change the policy over time. For example, yesterday `redis:v1` was allowed, but today `redis:v1` is not allowed
+due to a CVE that just came out (fictional scenario). In this scenario:
+.
+
+- a newly created replicaSet will be unable to create Pods.
+- updating a deployment will be safe in the sense that it will detect that the new ReplicaSet is not scaling
+ up and not scale down the old one.
+- an existing replicaSet will be unable to create Pods that replace ones which are terminated. If this is due to
+ slow loss of nodes, then there should be time to react before significant loss of capacity.
+- For non-replicated things (size 1 ReplicaSet, StatefulSet), a single node failure may disable it.
+- a node rolling update will eventually check for liveness of replacements, and would be throttled if
+ in the case when the image was no longer allowed and so replacements could not be started.
+- rapid node restarts will cause existing pod objects to be restarted by kubelet.
+- slow node restarts or network partitions will cause node controller to delete pods and there will be no replacement
+
+It is up to the Backend implementor, and the cluster administrator who decides to use that backend, to decide
+whether the Backend should be allowed to change its mind. There is a tradeoff between responsiveness
+to changes in policy, versus keeping existing services running. The two models that make sense are:
+
+- never change a policy, unless some external process has ensured no active objects depend on the to-be-forbidden
+ images.
+- change a policy and assume that transition to new image happens faster than the existing pods decay.
+
+## Ubernetes
+
+If two clusters share an image policy backend, then they will have the same policies.
+
+The clusters can pass different tokens to the backend, and the backend can use this to distinguish
+between different clusters.
+
+## Image tags and IDs
+
+Image tags are like: `myrepo/myimage:v1`.
+
+Image IDs are like: `myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed`.
+You can see image IDs with `docker images --no-trunc`.
+
+The Backend needs to be able to resolve tags to IDs (by talking to the images repo).
+If the Backend resolves tags to IDs, there is some risk that the tag-to-ID mapping will be
+modified after approval by the Backend, but before Kubelet pulls the image. We will not address this
+race condition at this time.
+
+We will wait and see how much demand there is for closing this hole. If the community demands a solution,
+we may suggest one of these:
+
+1. Use a backend that refuses to accept images that are specified with tags, and require users to resolve to IDs
+ prior to creating a pod template.
+ - [kubectl could be modified to automate this process](https://github.com/kubernetes/kubernetes/issues/1697)
+ - a CI/CD system or templating system could be used that maps IDs to tags before Deployment modification/creation.
+1. Audit logs from kubelets to see image IDs were actually run, to see if any unapproved images slipped through.
+1. Monitor tag changes in image repository for suspicious activity, or restrict remapping of tags after initial application.
+
+If none of these works well, we could do the following:
+
+- Image Policy Admission Controller adds new field to Pod, e.g. `pod.spec.container[i].imageID` (or an annotation).
+ and kubelet will enforce that both the imageID and image match the image pulled.
+
+Since this adds complexity and interacts with imagePullPolicy, we avoid adding the above feature initially.
+
+### Caching
+
+There will be a cache of decisions in the admission controller.
+TTL will be user-controllable, but default to 1 hour for allows and 30s for denies.
+Low TTL for deny allows user to correct a setting on the backend and see the fix
+rapidly. It is assumed that denies are infrequent.
+Caching allows permits RC to scale up services even during short unavailability of the webhook backend.
+The ImageReviewSpec is used as the key to the cache.
+
+In the case of a cache miss and timeout talking to the backend, the default is to allow Pod creation.
+Keeping services running is more important than a hypothetical threat from an un-verified image.
+
+
+### Post-pod-creation audit
+
+There are several cases where an image not currently allowed might still run. Users wanting a
+complete audit solution are advised to also do after-the-fact auditing of what images
+ran. This can catch:
+
+- images allowed due to backend not reachable
+- images that kept running after policy change (e.g. CVE discovered)
+- images started via local files or http option of kubelet
+- checking SHA of images allowed by a tag which was remapped
+
+This proposal does not include post-pod-creation audit.
+
+## Alternatives considered
+
+### Admission Control on Controller Objects
+
+We could have done admission control on Deployments, Jobs, ReplicationControllers, and anything else that creates a Pod, directly or indirectly.
+This approach is good because it provides immediate feedback to the user that the image is not allowed. However, we do not expect disallowed images
+to be used often. And controllers need to be able to surface problems creating pods for a variety of other reasons anyways.
+
+Other good things about this alternative are:
+
+- Fewer calls to Backend, once per controller rather than once per pod creation. Caching in backend should be able to help with this, though.
+- End user that created the object is seen, rather than the user of the controller process. This can be fixed by implementing `Impersonate-User` for controllers.
+
+Other problems are:
+
+- Works only with "core" controllers. Need to update admission controller if we add more "core" controllers. Won't work with "third party controllers", e.g. how we open-source distributed systems like hadoop, spark, zookeeper, etc running on kubernetes. Because those controllers don't have config that can be "admission controlled", or if they do, schema is not known to admission controller, have to "search" for pod templates in json. Yuck.
+- How would it work if user created pod directly, which is allowed, and the recommended way to run something at most once.
+
+### Sending User to Backend
+
+We could have sent the username of the pod creator to the backend. The username could be used to allow different users to run
+different categories of images. This would require propagating the username from e.g. Deployment creation, through to
+Pod creation via, e.g. the `Impersonate-User:` header. This feature is [not ready](https://github.com/kubernetes/kubernetes/issues/27152).
+ When it is, we will re-evaluate adding user as a field of `ImagePolicyRequest`.
+
+### Enforcement at Docker level
+
+Docker supports plugins which can check any container creation before it happens. For example the [twistlock/authz](https://github.com/twistlock/authz)
+Docker plugin can audit the full request sent to the Docker daemon and approve or deny it. This could include checking if the image is allowed.
+
+We reject this option because:
+- it requires all nodes to be able to configured with how to reach the Backend, which complicates node setup.
+- it may not work with other runtimes
+- propagating error messages back to the user is more difficult
+- it requires plumbing additional information about requests to nodes (if we later want to consider `User` in policy).
+
+### Policy Stored in API
+
+We decided to store policy about what SecurityContexts a pod can have in the API, via PodSecurityPolicy.
+This is because Pods are a Kubernetes object, and the Policy is very closely tied to the definition of Pods,
+and grows in step as the Pods API grows.
+
+For Image policy, the connection is not as strong. To Kubernetes API, and Image is just a string, and it
+does not know any of the image metadata, which lives outside the API.
+
+Image policy may depend on the Dockerfile, the source code, the source repo, the source review tools,
+vulnerability databases, and so on. Kubernetes does not have these as built-in concepts or have plans to add
+them anytime soon.
+
+### Registry whitelist/blacklist
+
+We considered a whitelist/blacklist of registries and/or repositories. Basically, a prefix match on image strings.
+ The problem of approving images would be then pushed to a problem of controlling who has access to push to a
+trusted registry/repository. That approach is simple for kubernetes. Problems with it are:
+
+- tricky to allow users to share a repository but have different image policies per user or per namespace.
+- tricky to do things after image push, such as scan image for vulnerabilities (such as Docker Nautilus), and have those results considered by policy
+- tricky to block "older" versions from running, whose interaction with current system may not be well understood.
+- how to allow emergency override?
+- hard to change policy decision over time.
+
+We still want to use rkt trust, docker content trust, etc for any registries used. We just need additional
+image policy checks beyond what trust can provide.
+
+### Send every Request to a Generic Admission Control Backend
+
+Instead of just sending a subset of PodSpec to an Image Provenance backed, we could have sent every object
+that is created or updated (or deleted?) to one or ore Generic Admission Control Backends.
+
+This might be a good idea, but needs quite a bit more thought. Some questions with that approach are:
+It will not be a generic webhook. A generic webhook would need a lot more discussion:
+
+- a generic webhook needs to touch all objects, not just pods. So it won't have a fixed schema. How to express this in our IDL? Harder to write clients
+ that interpret unstructured data rather than a fixed schema. Harder to version, and to detect errors.
+- a generic webhook client needs to ignore kinds it does not care about, or the apiserver needs to know which backends care about which kinds. How
+ to specify which backends see which requests. Sending all requests including high-rate requests like events and pod-status updated, might be
+ too high a rate for some backends?
+
+Additionally, just sending all the fields of just the Pod kind also has problems:
+- it exposes our whole API to a webhook backend without giving us (the project) any chance to review or understand how it is being used.
+- because we do not know which fields of an object are inspected by the backend, caching of decisions is not effective. Sending fewer fields allows caching.
+- sending fewer fields makes it possible to rev the version of the webhook request slower than the version of our internal obejcts (e.g. pod v2 could still use imageReview v1.)
+probably lots more reasons.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/image-provenance.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/images/.gitignore b/contributors/design-proposals/images/.gitignore
new file mode 100644
index 00000000..e69de29b
--- /dev/null
+++ b/contributors/design-proposals/images/.gitignore
diff --git a/contributors/design-proposals/indexed-job.md b/contributors/design-proposals/indexed-job.md
new file mode 100644
index 00000000..5a089c22
--- /dev/null
+++ b/contributors/design-proposals/indexed-job.md
@@ -0,0 +1,900 @@
+# Design: Indexed Feature of Job object
+
+
+## Summary
+
+This design extends kubernetes with user-friendly support for
+running embarrassingly parallel jobs.
+
+Here, *parallel* means on multiple nodes, which means multiple pods.
+By *embarrassingly parallel*, it is meant that the pods
+have no dependencies between each other. In particular, neither
+ordering between pods nor gang scheduling are supported.
+
+Users already have two other options for running embarrassingly parallel
+Jobs (described in the next section), but both have ease-of-use issues.
+
+Therefore, this document proposes extending the Job resource type to support
+a third way to run embarrassingly parallel programs, with a focus on
+ease of use.
+
+This new style of Job is called an *indexed job*, because each Pod of the Job
+is specialized to work on a particular *index* from a fixed length array of work
+items.
+
+## Background
+
+The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
+the embarrassingly parallel use case through *workqueue jobs*.
+While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
+flexible, they can be difficult to use. They: (1) typically require running a
+message queue or other database service, (2) typically require modifications
+to existing binaries and images and (3) subtle race conditions are easy to
+ overlook.
+
+Users also have another option for parallel jobs: creating [multiple Job objects
+from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
+Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
+objects at once. But, that approach also has its drawbacks: (1) for large levels
+of parallelism (hundreds or thousands of pods) this approach means that listing
+all jobs presents too much information, (2) users want a single source of
+information about the success or failure of what the user views as a single
+logical process.
+
+Indexed job fills provides a third option with better ease-of-use for common
+use cases.
+
+## Requirements
+
+### User Requirements
+
+- Users want an easy way to run a Pod to completion *for each* item within a
+[work list](#example-use-cases).
+
+- Users want to run these pods in parallel for speed, but to vary the level of
+parallelism as needed, independent of the number of work items.
+
+- Users want to do this without requiring changes to existing images,
+or source-to-image pipelines.
+
+- Users want a single object that encompasses the lifetime of the parallel
+program. Deleting it should delete all dependent objects. It should report the
+status of the overall process. Users should be able to wait for it to complete,
+and can refer to it from other resource types, such as
+[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
+
+
+### Example Use Cases
+
+Here are several examples of *work lists*: lists of command lines that the user
+wants to run, each line its own Pod. (Note that in practice, a work list may not
+ever be written out in this form, but it exists in the mind of the Job creator,
+and it is a useful way to talk about the intent of the user when discussing
+alternatives for specifying Indexed Jobs).
+
+Note that we will not have the user express their requirements in work list
+form; it is just a format for presenting use cases. Subsequent discussion will
+reference these work lists.
+
+#### Work List 1
+
+Process several files with the same program:
+
+```
+/usr/local/bin/process_file 12342.dat
+/usr/local/bin/process_file 97283.dat
+/usr/local/bin/process_file 38732.dat
+```
+
+#### Work List 2
+
+Process a matrix (or image, etc) in rectangular blocks:
+
+```
+/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
+/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
+/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
+/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
+```
+
+#### Work List 3
+
+Build a program at several different git commits:
+
+```
+HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
+HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH
+HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
+```
+
+#### Work List 4
+
+Render several frames of a movie:
+
+```
+./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
+./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2
+./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3
+```
+
+#### Work List 5
+
+Render several blocks of frames (Render blocks to avoid Pod startup overhead for
+every frame):
+
+```
+./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
+./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200
+./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300
+```
+
+## Design Discussion
+
+### Converting Work Lists into Indexed Jobs.
+
+Given a work list, like in the [work list examples](#work-list-examples),
+the information from the work list needs to get into each Pod of the Job.
+
+Users will typically not want to create a new image for each job they
+run. They will want to use existing images. So, the image is not the place
+for the work list.
+
+A work list can be stored on networked storage, and mounted by pods of the job.
+Also, as a shortcut, for small worklists, it can be included in an annotation on
+the Job object, which is then exposed as a volume in the pod via the downward
+API.
+
+### What Varies Between Pods of a Job
+
+Pods need to differ in some way to do something different. (They do not differ
+in the work-queue style of Job, but that style has ease-of-use issues).
+
+A general approach would be to allow pods to differ from each other in arbitrary
+ways. For example, the Job object could have a list of PodSpecs to run.
+However, this is so general that it provides little value. It would:
+
+- make the Job Spec very verbose, especially for jobs with thousands of work
+items
+- Job becomes such a vague concept that it is hard to explain to users
+- in practice, we do not see cases where many pods which differ across many
+fields of their specs, and need to run as a group, with no ordering constraints.
+- CLIs and UIs need to support more options for creating Job
+- it is useful for monitoring and accounting databases want to aggregate data
+for pods with the same controller. However, pods with very different Specs may
+not make sense to aggregate.
+- profiling, debugging, accounting, auditing and monitoring tools cannot assume
+common images/files, behaviors, provenance and so on between Pods of a Job.
+
+Also, variety has another cost. Pods which differ in ways that affect scheduling
+(node constraints, resource requirements, labels) prevent the scheduler from
+treating them as fungible, which is an important optimization for the scheduler.
+
+Therefore, we will not allow Pods from the same Job to differ arbitrarily
+(anyway, users can use multiple Job objects for that case). We will try to
+allow as little as possible to differ between pods of the same Job, while still
+allowing users to express common parallel patterns easily. For users who need to
+run jobs which differ in other ways, they can create multiple Jobs, and manage
+them as a group using labels.
+
+From the above work lists, we see a need for Pods which differ in their command
+lines, and in their environment variables. These work lists do not require the
+pods to differ in other ways.
+
+Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
+has shown this model to be applicable to a very broad range of problems, despite
+this restriction.
+
+Therefore we to allow pods in the same Job to differ **only** in the following
+ aspects:
+- command line
+- environment variables
+
+### Composition of existing images
+
+The docker image that is used in a job may not be maintained by the person
+running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD.
+If we require people to specify the complete command line to use Indexed Job,
+then they will not automatically pick up changes in the default
+command or args.
+
+This needs more thought.
+
+### Running Ad-Hoc Jobs using kubectl
+
+A user should be able to easily start an Indexed Job using `kubectl`. For
+example to run [work list 1](#work-list-1), a user should be able to type
+something simple like:
+
+```
+kubectl run process-files --image=myfileprocessor \
+ --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
+ --restart=OnFailure \
+ -- \
+ /usr/local/bin/process_file '$F'
+```
+
+In the above example:
+
+- `--restart=OnFailure` implies creating a job instead of replicationController.
+- Each pods command line is `/usr/local/bin/process_file $F`.
+- `--per-completion-env=` implies the jobs `.spec.completions` is set to the
+length of the argument array (3 in the example).
+- `--per-completion-env=F=<values>` causes env var with `F` to be available in
+the environment when the command line is evaluated.
+
+How exactly this happens is discussed later in the doc: this is a sketch of the
+user experience.
+
+In practice, the list of files might be much longer and stored in a file on the
+users local host, like:
+
+```
+$ cat files-to-process.txt
+12342.dat
+97283.dat
+38732.dat
+...
+```
+
+So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.
+
+However, `kubectl` should also support a format like:
+ `--per-completion-env=F=@files-to-process.txt`.
+That allows `kubectl` to parse the file, point out any syntax errors, and would
+not run up against command line length limits (2MB is common, as low as 4kB is
+POSIX compliant).
+
+One case we do not try to handle is where the file of work is stored on a cloud
+filesystem, and not accessible from the users local host. Then we cannot easily
+use indexed job, because we do not know the number of completions. The user
+needs to copy the file locally first or use the Work-Queue style of Job (already
+supported).
+
+Another case we do not try to handle is where the input file does not exist yet
+because this Job is to be run at a future time, or depends on another job. The
+workflow and scheduled job proposal need to consider this case. For that case,
+you could use an indexed job which runs a program which shards the input file
+(map-reduce-style).
+
+#### Multiple parameters
+
+The user may also have multiple parameters, like in [work list 2](#work-list-2).
+One way is to just list all the command lines already expanded, one per line, in
+a file, like this:
+
+```
+$ cat matrix-commandlines.txt
+/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
+/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
+/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
+/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
+```
+
+and run the Job like this:
+
+```
+kubectl run process-matrix --image=my/matrix \
+ --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \
+ --restart=OnFailure \
+ -- \
+ 'eval "$COMMAND_LINE"'
+```
+
+However, this may have some subtleties with shell escaping. Also, it depends on
+the user knowing all the correct arguments to the docker image being used (more
+on this later).
+
+Instead, kubectl should support multiple instances of the `--per-completion-env`
+flag. For example, to implement work list 2, a user could do:
+
+```
+kubectl run process-matrix --image=my/matrix \
+ --per-completion-env=SR="0 16 0 16" \
+ --per-completion-env=ER="15 31 15 31" \
+ --per-completion-env=SC="0 0 16 16" \
+ --per-completion-env=EC="15 15 31 31" \
+ --restart=OnFailure \
+ -- \
+ /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC
+```
+
+### Composition With Workflows and ScheduledJob
+
+A user should be able to create a job (Indexed or not) which runs at a specific
+time(s). For example:
+
+```
+$ kubectl run process-files --image=myfileprocessor \
+ --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
+ --restart=OnFailure \
+ --runAt=2015-07-21T14:00:00Z
+ -- \
+ /usr/local/bin/process_file '$F'
+created "scheduledJob/process-files-37dt3"
+```
+
+Kubectl should build the same JobSpec, and then put it into a ScheduledJob
+(#11980) and create that.
+
+For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
+complete workflow from a single command line would be messy, because of the need
+to specify all the arguments multiple times.
+
+For that use case, the user could create a workflow message by hand. Or the user
+could create a job template, and then make a workflow from the templates,
+perhaps like this:
+
+```
+$ kubectl run process-files --image=myfileprocessor \
+ --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
+ --restart=OnFailure \
+ --asTemplate \
+ -- \
+ /usr/local/bin/process_file '$F'
+created "jobTemplate/process-files"
+$ kubectl run merge-files --image=mymerger \
+ --restart=OnFailure \
+ --asTemplate \
+ -- \
+ /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \
+created "jobTemplate/merge-files"
+$ kubectl create-workflow process-and-merge \
+ --job=jobTemplate/process-files
+ --job=jobTemplate/merge-files
+ --dependency=process-files:merge-files
+created "workflow/process-and-merge"
+```
+
+### Completion Indexes
+
+A JobSpec specifies the number of times a pod needs to complete successfully,
+through the `job.Spec.Completions` field. The number of completions will be
+equal to the number of work items in the work list.
+
+Each pod that the job controller creates is intended to complete one work item
+from the work list. Since a pod may fail, several pods may, serially, attempt to
+complete the same index. Therefore, we call it a *completion index* (or just
+*index*), but not a *pod index*.
+
+For each completion index, in the range 1 to `.job.Spec.Completions`, the job
+controller will create a pod with that index, and keep creating them on failure,
+until each index is completed.
+
+An dense integer index, rather than a sparse string index (e.g. using just
+`metadata.generate-name`) makes it easy to use the index to lookup parameters
+in, for example, an array in shared storage.
+
+### Pod Identity and Template Substitution in Job Controller
+
+The JobSpec contains a single pod template. When the job controller creates a
+particular pod, it copies the pod template and modifies it in some way to make
+that pod distinctive. Whatever is distinctive about that pod is its *identity*.
+
+We consider several options.
+
+#### Index Substitution Only
+
+The job controller substitutes only the *completion index* of the pod into the
+pod template when creating it. The JSON it POSTs differs only in a single
+fields.
+
+We would put the completion index as a stringified integer, into an annotation
+of the pod. The user can extract it from the annotation into an env var via the
+downward API, or put it in a file via a Downward API volume, and parse it
+himself.
+
+Once it is an environment variable in the pod (say `$INDEX`), then one of two
+things can happen.
+
+First, the main program can know how to map from an integer index to what it
+needs to do. For example, from Work List 4 above:
+
+```
+./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
+```
+
+Second, a shell script can be prepended to the original command line which maps
+the index to one or more string parameters. For example, to implement Work List
+5 above, you could do:
+
+```
+/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
+```
+
+In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
+and exports `$START_FRAME` and `$END_FRAME`.
+
+The shell could be part of the image, but more usefully, it could be generated
+by a program and stuffed in an annotation or a configMap, and from there added
+to a volume.
+
+The first approach may require the user to modify an existing image (see next
+section) to be able to accept an `$INDEX` env var or argument. The second
+approach requires that the image have a shell. We think that together these two
+options cover a wide range of use cases (though not all).
+
+#### Multiple Substitution
+
+In this option, the JobSpec is extended to include a list of values to
+substitute, and which fields to substitute them into. For example, a worklist
+like this:
+
+```
+FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
+FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
+FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
+```
+
+Can be broken down into a template like this, with three parameters:
+
+```
+<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
+```
+
+and a list of parameter tuples, like this:
+
+```
+("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds")
+("FRUIT_COLOR=yellow", "-f banana.txt", "")
+("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
+```
+
+The JobSpec can be extended to hold a list of parameter tuples (which are more
+easily expressed as a list of lists of individual parameters). For example:
+
+```
+apiVersion: extensions/v1beta1
+kind: Job
+...
+spec:
+ completions: 3
+ ...
+ template:
+ ...
+ perCompletionArgs:
+ container: 0
+ -
+ - "-f apple.txt"
+ - "-f banana.txt"
+ - "-f cherry.txt"
+ -
+ - "--remove-seeds"
+ - ""
+ - "--remove-pit"
+ perCompletionEnvVars:
+ - name: "FRUIT_COLOR"
+ - "green"
+ - "yellow"
+ - "red"
+```
+
+However, just providing custom env vars, and not arguments, is sufficient for
+many use cases: parameter can be put into env vars, and then substituted on the
+command line.
+
+#### Comparison
+
+The multiple substitution approach:
+
+- keeps the *per completion parameters* in the JobSpec.
+- Drawback: makes the job spec large for job with thousands of completions. (But
+for very large jobs, the work-queue style or another type of controller, such as
+map-reduce or spark, may be a better fit.)
+- Drawback: is a form of server-side templating, which we want in Kubernetes but
+have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
+
+The index-only approach:
+
+- Requires that the user keep the *per completion parameters* in a separate
+storage, such as a configData or networked storage.
+- Makes no changes to the JobSpec.
+- Drawback: while in separate storage, they could be mutated, which would have
+unexpected effects.
+- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
+- Drawback: CLIs and UIs are limited to using the "index" as the identity of a
+pod from a job. They cannot easily say, for example `repeated failures on the
+pod processing banana.txt`.
+
+Index-only approach relies on at least one of the following being true:
+
+1. Image containing a shell and certain shell commands (not all images have
+this).
+1. Use directly consumes the index from annotations (file or env var) and
+expands to specific behavior in the main program.
+
+Also Using the index-only approach from non-kubectl clients requires that they
+mimic the script-generation step, or only use the second style.
+
+#### Decision
+
+It is decided to implement the Index-only approach now. Once the server-side
+templating design is complete for Kubernetes, and we have feedback from users,
+we can consider if Multiple Substitution.
+
+## Detailed Design
+
+#### Job Resource Schema Changes
+
+No changes are made to the JobSpec.
+
+
+The JobStatus is also not changed. The user can gauge the progress of the job by
+the `.status.succeeded` count.
+
+
+#### Job Spec Compatilibity
+
+A job spec written before this change will work exactly the same as before with
+the new controller. The Pods it creates will have the same environment as
+before. They will have a new annotation, but pod are expected to tolerate
+unfamiliar annotations.
+
+However, if the job controller version is reverted, to a version before this
+change, the jobs whose pod specs depend on the new annotation will fail.
+This is okay for a Beta resource.
+
+#### Job Controller Changes
+
+The Job controller will maintain for each Job a data structed which
+indicates the status of each completion index. We call this the
+*scoreboard* for short. It is an array of length `.spec.completions`.
+Elements of the array are `enum` type with possible values including
+`complete`, `running`, and `notStarted`.
+
+The scoreboard is stored in Job Controller memory for efficiency. In either
+case, the Status can be reconstructed from watching pods of the job (such as on
+a controller manager restart). The index of the pods can be extracted from the
+pod annotation.
+
+When Job controller sees that the number of running pods is less than the
+desired parallelism of the job, it finds the first index in the scoreboard with
+value `notRunning`. It creates a pod with this creation index.
+
+When it creates a pod with creation index `i`, it makes a copy of the
+`.spec.template`, and sets
+`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
+`i`. It does this in both the index-only and multiple-substitutions options.
+
+Then it creates the pod.
+
+When the controller notices that a pod has completed or is running or failed,
+it updates the scoreboard.
+
+When all entries in the scoreboard are `complete`, then the job is complete.
+
+
+#### Downward API Changes
+
+The downward API is changed to support extracting specific key names into a
+single environment variable. So, the following would be supported:
+
+```
+kind: Pod
+version: v1
+spec:
+ containers:
+ - name: foo
+ env:
+ - name: MY_INDEX
+ valueFrom:
+ fieldRef:
+ fieldPath: metadata.annotations[kubernetes.io/job/completion-index]
+```
+
+This requires kubelet changes.
+
+Users who fail to upgrade their kubelets at the same time as they upgrade their
+controller manager will see a failure for pods to run when they are created by
+the controller. The Kubelet will send an event about failure to create the pod.
+The `kubectl describe job` will show many failed pods.
+
+
+#### Kubectl Interface Changes
+
+The `--completions` and `--completion-index-var-name` flags are added to
+kubectl.
+
+For example, this command:
+
+```
+kubectl run say-number --image=busybox \
+ --completions=3 \
+ --completion-index-var-name=I \
+ -- \
+ sh -c 'echo "My index is $I" && sleep 5'
+```
+
+will run 3 pods to completion, each printing one of the following lines:
+
+```
+My index is 1
+My index is 2
+My index is 0
+```
+
+Kubectl would create the following pod:
+
+
+
+Kubectl will also support the `--per-completion-env` flag, as described
+previously. For example, this command:
+
+```
+kubectl run say-fruit --image=busybox \
+ --per-completion-env=FRUIT="apple banana cherry" \
+ --per-completion-env=COLOR="green yellow red" \
+ -- \
+ sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
+```
+
+or equivalently:
+
+```
+echo "apple banana cherry" > fruits.txt
+echo "green yellow red" > colors.txt
+
+kubectl run say-fruit --image=busybox \
+ --per-completion-env=FRUIT="$(cat fruits.txt)" \
+ --per-completion-env=COLOR="$(cat fruits.txt)" \
+ -- \
+ sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
+```
+
+or similarly:
+
+```
+kubectl run say-fruit --image=busybox \
+ --per-completion-env=FRUIT=@fruits.txt \
+ --per-completion-env=COLOR=@fruits.txt \
+ -- \
+ sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
+```
+
+will all run 3 pods in parallel. Index 0 pod will log:
+
+```
+Have a nice grenn apple
+```
+
+and so on.
+
+
+Notes:
+
+- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
+quoted space separated list or `@` and the name of a text file containing a
+list.
+- `--per-completion-env=` can be specified several times, but all must have the
+same length list.
+- `--completions=N` with `N` equal to list length is implied.
+- The flag `--completions=3` sets `job.spec.completions=3`.
+- The flag `--completion-index-var-name=I` causes an env var to be created named
+I in each pod, with the index in it.
+- The flag `--restart=OnFailure` is implied by `--completions` or any
+job-specific arguments. The user can also specify `--restart=Never` if they
+desire but may not specify `--restart=Always` with job-related flags.
+- Setting any of these flags in turn tells kubectl to create a Job, not a
+replicationController.
+
+#### How Kubectl Creates Job Specs.
+
+To pass in the parameters, kubectl will generate a shell script which
+can:
+- parse the index from the annotation
+- hold all the parameter lists.
+- lookup the correct index in each parameter list and set an env var.
+
+For example, consider this command:
+
+```
+kubectl run say-fruit --image=busybox \
+ --per-completion-env=FRUIT="apple banana cherry" \
+ --per-completion-env=COLOR="green yellow red" \
+ -- \
+ sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
+```
+
+First, kubectl generates the PodSpec as it normally does for `kubectl run`.
+
+But, then it will generate this script:
+
+```sh
+#!/bin/sh
+# Generated by kubectl run ...
+# Check for needed commands
+if [[ ! type cat ]]
+then
+ echo "$0: Image does not include required command: cat"
+ exit 2
+fi
+if [[ ! type grep ]]
+then
+ echo "$0: Image does not include required command: grep"
+ exit 2
+fi
+# Check that annotations are mounted from downward API
+if [[ ! -e /etc/annotations ]]
+then
+ echo "$0: Cannot find /etc/annotations"
+ exit 2
+fi
+# Get our index from annotations file
+I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index"
+export I
+
+# Our parameter lists are stored inline in this script.
+FRUIT_0="apple"
+FRUIT_1="banana"
+FRUIT_2="cherry"
+# Extract the right parameter value based on our index.
+# This works on any Bourne-based shell.
+FRUIT=$(eval echo \$"FRUIT_$I")
+export FRUIT
+
+COLOR_0="green"
+COLOR_1="yellow"
+COLOR_2="red"
+
+COLOR=$(eval echo \$"FRUIT_$I")
+export COLOR
+```
+
+Then it POSTs this script, encoded, inside a ConfigData.
+It attaches this volume to the PodSpec.
+
+Then it will edit the command line of the Pod to run this script before the rest of
+the command line.
+
+Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this:
+It also appends the Secret (later configData) volume with the script in it.
+
+So, the Pod template that kubectl creates (inside the job template) looks like this:
+
+```
+apiVersion: v1
+kind: Job
+...
+spec:
+ ...
+ template:
+ ...
+ spec:
+ containers:
+ - name: c
+ image: gcr.io/google_containers/busybox
+ command:
+ - 'sh'
+ - '-c'
+ - '/etc/job-params.sh; echo "this is the rest of the command"'
+ volumeMounts:
+ - name: annotations
+ mountPath: /etc
+ - name: script
+ mountPath: /etc
+ volumes:
+ - name: annotations
+ downwardAPI:
+ items:
+ - path: "annotations"
+ ieldRef:
+ fieldPath: metadata.annotations
+ - name: script
+ secret:
+ secretName: jobparams-abc123
+```
+
+###### Alternatives
+
+Kubectl could append a `valueFrom` line like this to
+get the index into the environment:
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: Job
+metadata:
+ ...
+spec:
+ ...
+ template:
+ ...
+ spec:
+ containers:
+ - name: foo
+ ...
+ env:
+ # following block added:
+ - name: I
+ valueFrom:
+ fieldRef:
+ fieldPath: metadata.annotations."kubernetes.io/job-idx"
+```
+
+However, in order to inject other env vars from parameter list,
+kubectl still needs to edit the command line.
+
+Parameter lists could be passed via a configData volume instead of a secret.
+Kubectl can be changed to work that way once the configData implementation is
+complete.
+
+Parameter lists could be passed inside an EnvVar. This would have length
+limitations, would pollute the output of `kubectl describe pods` and `kubectl
+get pods -o json`.
+
+Parameter lists could be passed inside an annotation. This would have length
+limitations, would pollute the output of `kubectl describe pods` and `kubectl
+get pods -o json`. Also, currently annotations can only be extracted into a
+single file. Complex logic is then needed to filter out exactly the desired
+annotation data.
+
+Bash array variables could simplify extraction of a particular parameter from a
+list of parameters. However, some popular base images do not include
+`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation
+that does not support array syntax.
+
+Kubelet does support [expanding variables without a
+shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not
+allow for recursive substitution, which is required to extract the correct
+parameter from a list based on the completion index of the pod. The syntax
+could be extended, but doing so seems complex and will be an unfamiliar syntax
+for users.
+
+Putting all the command line editing into a script and running that causes
+the least pollution to the original command line, and it allows
+for complex error handling.
+
+Kubectl could store the script in an [Inline Volume](
+https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
+is approved. That would remove the need to manage the lifetime of the
+configData/secret, and prevent the case where someone changes the
+configData mid-job, and breaks things in a hard-to-debug way.
+
+
+## Interactions with other features
+
+#### Supporting Work Queue Jobs too
+
+For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
+to be greater than it, and pods have no identity. So, the job controller should
+not create a scoreboard in the JobStatus, just a count. Therefore, we need to
+add one of the following to JobSpec:
+
+- allow unset `.spec.completions` to indicate no scoreboard, and no index for
+tasks (identical tasks).
+- allow `.spec.completions=-1` to indicate the same.
+- add `.spec.indexed` to job to indicate need for scoreboard.
+
+#### Interaction with vertical autoscaling
+
+Since pods of the same job will not be created with different resources,
+a vertical autoscaler will need to:
+
+- if it has index-specific initial resource suggestions, suggest those at
+admission time; it will need to understand indexes.
+- mutate resource requests on already created pods based on usage trend or
+previous container failures.
+- modify the job template, affecting all indexes.
+
+#### Comparison to StatefulSets (previously named PetSets)
+
+The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b.
+The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more
+restrictive and thus less verbose.
+
+It would be easier for users if Indexed Job and StatefulSet are similar where
+possible. However, StatefulSet differs in several key respects:
+
+- StatefulSet is for ones to tens of instances. Indexed job should work with tens of
+thousands of instances.
+- When you have few instances, you may want to give them names. When you have many instances,
+integer indexes make more sense.
+- When you have thousands of instances, storing the work-list in the JobSpec
+is verbose. For StatefulSet, this is less of a problem.
+- StatefulSets (apparently) need to differ in more fields than indexed Jobs.
+
+This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is
+intended to support ones to tens of things.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/initial-resources.md b/contributors/design-proposals/initial-resources.md
new file mode 100644
index 00000000..f383f14a
--- /dev/null
+++ b/contributors/design-proposals/initial-resources.md
@@ -0,0 +1,75 @@
+## Abstract
+
+Initial Resources is a data-driven feature that based on historical data tries to estimate resource usage of a container without Resources specified
+and set them before the container is run. This document describes design of the component.
+
+## Motivation
+
+Since we want to make Kubernetes as simple as possible for its users we don’t want to require setting [Resources](../design/resource-qos.md) for container by its owner.
+On the other hand having Resources filled is critical for scheduling decisions.
+Current solution to set up Resources to hardcoded value has obvious drawbacks.
+We need to implement a component which will set initial Resources to a reasonable value.
+
+## Design
+
+InitialResources component will be implemented as an [admission plugin](../../plugin/pkg/admission/) and invoked right before
+[LimitRanger](https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/cluster/gce/config-default.sh#L91).
+For every container without Resources specified it will try to predict amount of resources that should be sufficient for it.
+So that a pod without specified resources will be treated as
+.
+
+InitialResources will set only [request](../design/resource-qos.md#requests-and-limits) (independently for each resource type: cpu, memory) field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources).
+To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined.
+It will prevent from situation when the pod is rejected due to too low or too high estimation.
+
+The container won’t be marked as managed by this component in any way, however appropriate event will be exported.
+The predicting algorithm should have very low latency to not increase significantly e2e pod startup latency
+[#3954](https://github.com/kubernetes/kubernetes/pull/3954).
+
+### Predicting algorithm details
+
+In the first version estimation will be made based on historical data for the Docker image being run in the container (both the name and the tag matters).
+CPU/memory usage of each container is exported periodically (by default with 1 minute resolution) to the backend (see more in [Monitoring pipeline](#monitoring-pipeline)).
+
+InitialResources will set Request for both cpu/mem as the 90th percentile of the first (in the following order) set of samples defined in the following way:
+
+* 7 days same image:tag, assuming there is at least 60 samples (1 hour)
+* 30 days same image:tag, assuming there is at least 60 samples (1 hour)
+* 30 days same image, assuming there is at least 1 sample
+
+If there is still no data the default value will be set by LimitRanger. Same parameters will be configurable with appropriate flags.
+
+#### Example
+
+If we have at least 60 samples from image:tag over the past 7 days, we will use the 90th percentile of all of the samples of image:tag over the past 7 days.
+Otherwise, if we have at least 60 samples from image:tag over the past 30 days, we will use the 90th percentile of all of the samples over of image:tag the past 30 days.
+Otherwise, if we have at least 1 sample from image over the past 30 days, we will use that the 90th percentile of all of the samples of image over the past 30 days.
+Otherwise we will use default value.
+
+### Monitoring pipeline
+
+In the first version there will be available 2 options for backend for predicting algorithm:
+
+* [InfluxDB](../../docs/user-guide/monitoring.md#influxdb-and-grafana) - aggregation will be made in SQL query
+* [GCM](../../docs/user-guide/monitoring.md#google-cloud-monitoring) - since GCM is not as powerful as InfluxDB some aggregation will be made on the client side
+
+Both will be hidden under an abstraction layer, so it would be easy to add another option.
+The code will be a part of Initial Resources component to not block development, however in the future it should be a part of Heapster.
+
+
+## Next steps
+
+The first version will be quite simple so there is a lot of possible improvements. Some of them seem to have high priority
+and should be introduced shortly after the first version is done:
+
+* observe OOM and then react to it by increasing estimation
+* add possibility to specify if estimation should be made, possibly as ```InitialResourcesPolicy``` with options: *always*, *if-not-set*, *never*
+* add other features to the model like *namespace*
+* remember predefined values for the most popular images like *mysql*, *nginx*, *redis*, etc.
+* dry mode, which allows to ask system for resource recommendation for a container without running it
+* add estimation as annotations for those containers that already has resources set
+* support for other data sources like [Hawkular](http://www.hawkular.org/)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/initial-resources.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/job.md b/contributors/design-proposals/job.md
new file mode 100644
index 00000000..160b38dd
--- /dev/null
+++ b/contributors/design-proposals/job.md
@@ -0,0 +1,159 @@
+# Job Controller
+
+## Abstract
+
+A proposal for implementing a new controller - Job controller - which will be responsible
+for managing pod(s) that require running once to completion even if the machine
+the pod is running on fails, in contrast to what ReplicationController currently offers.
+
+Several existing issues and PRs were already created regarding that particular subject:
+* Job Controller [#1624](https://github.com/kubernetes/kubernetes/issues/1624)
+* New Job resource [#7380](https://github.com/kubernetes/kubernetes/pull/7380)
+
+
+## Use Cases
+
+1. Be able to start one or several pods tracked as a single entity.
+1. Be able to run batch-oriented workloads on Kubernetes.
+1. Be able to get the job status.
+1. Be able to specify the number of instances performing a job at any one time.
+1. Be able to specify the number of successfully finished instances required to finish a job.
+
+
+## Motivation
+
+Jobs are needed for executing multi-pod computation to completion; a good example
+here would be the ability to implement any type of batch oriented tasks.
+
+
+## Implementation
+
+Job controller is similar to replication controller in that they manage pods.
+This implies they will follow the same controller framework that replication
+controllers already defined. The biggest difference between a `Job` and a
+`ReplicationController` object is the purpose; `ReplicationController`
+ensures that a specified number of Pods are running at any one time, whereas
+`Job` is responsible for keeping the desired number of Pods to a completion of
+a task. This difference will be represented by the `RestartPolicy` which is
+required to always take value of `RestartPolicyNever` or `RestartOnFailure`.
+
+
+The new `Job` object will have the following content:
+
+```go
+// Job represents the configuration of a single job.
+type Job struct {
+ TypeMeta
+ ObjectMeta
+
+ // Spec is a structure defining the expected behavior of a job.
+ Spec JobSpec
+
+ // Status is a structure describing current status of a job.
+ Status JobStatus
+}
+
+// JobList is a collection of jobs.
+type JobList struct {
+ TypeMeta
+ ListMeta
+
+ Items []Job
+}
+```
+
+`JobSpec` structure is defined to contain all the information how the actual job execution
+will look like.
+
+```go
+// JobSpec describes how the job execution will look like.
+type JobSpec struct {
+
+ // Parallelism specifies the maximum desired number of pods the job should
+ // run at any given time. The actual number of pods running in steady state will
+ // be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
+ // i.e. when the work left to do is less than max parallelism.
+ Parallelism *int
+
+ // Completions specifies the desired number of successfully finished pods the
+ // job should be run with. Defaults to 1.
+ Completions *int
+
+ // Selector is a label query over pods running a job.
+ Selector map[string]string
+
+ // Template is the object that describes the pod that will be created when
+ // executing a job.
+ Template *PodTemplateSpec
+}
+```
+
+`JobStatus` structure is defined to contain information about pods executing
+specified job. The structure holds information about pods currently executing
+the job.
+
+```go
+// JobStatus represents the current state of a Job.
+type JobStatus struct {
+ Conditions []JobCondition
+
+ // CreationTime represents time when the job was created
+ CreationTime unversioned.Time
+
+ // StartTime represents time when the job was started
+ StartTime unversioned.Time
+
+ // CompletionTime represents time when the job was completed
+ CompletionTime unversioned.Time
+
+ // Active is the number of actively running pods.
+ Active int
+
+ // Successful is the number of pods successfully completed their job.
+ Successful int
+
+ // Unsuccessful is the number of pods failures, this applies only to jobs
+ // created with RestartPolicyNever, otherwise this value will always be 0.
+ Unsuccessful int
+}
+
+type JobConditionType string
+
+// These are valid conditions of a job.
+const (
+ // JobComplete means the job has completed its execution.
+ JobComplete JobConditionType = "Complete"
+)
+
+// JobCondition describes current state of a job.
+type JobCondition struct {
+ Type JobConditionType
+ Status ConditionStatus
+ LastHeartbeatTime unversioned.Time
+ LastTransitionTime unversioned.Time
+ Reason string
+ Message string
+}
+```
+
+## Events
+
+Job controller will be emitting the following events:
+* JobStart
+* JobFinish
+
+## Future evolution
+
+Below are the possible future extensions to the Job controller:
+* Be able to limit the execution time for a job, similarly to ActiveDeadlineSeconds for Pods. *now implemented*
+* Be able to create a chain of jobs dependent one on another. *will be implemented in a separate type called Workflow*
+* Be able to specify the work each of the workers should execute (see type 1 from
+ [this comment](https://github.com/kubernetes/kubernetes/issues/1624#issuecomment-97622142))
+* Be able to inspect Pods running a Job, especially after a Job has finished, e.g.
+ by providing pointers to Pods in the JobStatus ([see comment](https://github.com/kubernetes/kubernetes/pull/11746/files#r37142628)).
+* help users avoid non-unique label selectors ([see this proposal](../../docs/design/selector-generation.md))
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/job.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubectl-login.md b/contributors/design-proposals/kubectl-login.md
new file mode 100644
index 00000000..a333e9dc
--- /dev/null
+++ b/contributors/design-proposals/kubectl-login.md
@@ -0,0 +1,220 @@
+# Kubectl Login Subcommand
+
+**Authors**: Eric Chiang (@ericchiang)
+
+## Goals
+
+`kubectl login` is an entrypoint for any user attempting to connect to an
+existing server. It should provide a more tailored experience than the existing
+`kubectl config` including config validation, auth challenges, and discovery.
+
+Short term the subcommand should recognize and attempt to help:
+
+* New users with an empty configuration trying to connect to a server.
+* Users with no credentials, by prompt for any required information.
+* Fully configured users who want to validate credentials.
+* Users trying to switch servers.
+* Users trying to reauthenticate as the same user because credentials have expired.
+* Authenticate as a different user to the same server.
+
+Long term `kubectl login` should enable authentication strategies to be
+discoverable from a master to avoid the end-user having to know how their
+sysadmin configured the Kubernetes cluster.
+
+## Design
+
+The "login" subcommand helps users move towards a fully functional kubeconfig by
+evaluating the current state of the kubeconfig and trying to prompt the user for
+and validate the necessary information to login to the kubernetes cluster.
+
+This is inspired by a similar tools such as:
+
+ * [os login](https://docs.openshift.org/latest/cli_reference/get_started_cli.html#basic-setup-and-login)
+ * [gcloud auth login](https://cloud.google.com/sdk/gcloud/reference/auth/login)
+ * [aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html)
+
+The steps taken are:
+
+1. If no cluster configured, prompt user for cluster information.
+2. If no user is configured, discover the authentication strategies supported by the API server.
+3. Prompt the user for some information based on the authentication strategy they choose.
+4. Attempt to login as a user, including authentication challenges such as OAuth2 flows, and display user info.
+
+Importantly, each step is skipped if the existing configuration is validated or
+can be supplied without user interaction (refreshing an OAuth token, redeeming
+a Kerberos ticket, etc.). Users with fully configured kubeconfigs will only see
+the user they're logged in as, useful for opaque credentials such as X509 certs
+or bearer tokens.
+
+The command differs from `kubectl config` by:
+
+* Communicating with the API server to determine if the user is supplying valid auth events.
+* Validating input and being opinionated about the input it asks for.
+* Triggering authentication challenges for example:
+ * Basic auth: Actually try to communicate with the API server.
+ * OpenID Connect: Create an OAuth2 redirect.
+
+However `kubectl login` should still be seen as a supplement to, not a
+replacement for, `kubectl config` by helping validate any kubeconfig generated
+by the latter command.
+
+## Credential validation
+
+When clusters utilize authorization plugins access decisions are based on the
+correct configuration of an auth-N plugin, an auth-Z plugin, and client side
+credentials. Being rejected then begs several questions. Is the user's
+kubeconfig misconfigured? Is the authorization plugin setup wrong? Is the user
+authenticating as a different user than the one they assume?
+
+To help `kubectl login` diagnose misconfigured credentials, responses from the
+API server to authenticated requests SHOULD include the `Authentication-Info`
+header as defined in [RFC 7615](https://tools.ietf.org/html/rfc7615). The value
+will hold name value pairs for `username` and `uid`. Since usernames and IDs
+can be arbitrary strings, these values will be escaped using the `quoted-string`
+format noted in the RFC.
+
+```
+HTTP/1.1 200 OK
+Authentication-Info: username="janedoe@example.com", uid="123456"
+```
+
+If the user successfully authenticates this header will be set, regardless of
+auth-Z decisions. For example a 401 Unauthorized (user didn't provide valid
+credentials) would lack this header, while a 403 Forbidden response would
+contain it.
+
+## Authentication discovery
+
+A long term goal of `kubectl login` is to facilitate a customized experience
+for clusters configured with different auth providers. This will require some
+way for the API server to indicate to `kubectl` how a user is expected to
+login.
+
+Currently, this document doesn't propose a specific implementation for
+discovery. While it'd be preferable to utilize an existing standard (such as the
+`WWW-Authenticate` HTTP header), discovery may require a solution custom to the
+API server, such as an additional discovery endpoint with a custom type.
+
+## Use in non-interactive session
+
+For the initial implementation, if `kubectl login` requires prompting and is
+called from a non-interactive session (determined by if the session is using a
+TTY) it errors out, recommending using `kubectl config` instead. In future
+updates `kubectl login` may include options for non-interactive sessions so
+auth strategies which require custom behavior not built into `kubectl config`,
+such as the exchanges in Kerberos or OpenID Connect, can be triggered from
+scripts.
+
+## Examples
+
+If kubeconfig isn't configured, `kubectl login` will attempt to fully configure
+and validate the client's credentials.
+
+```
+$ kubectl login
+Cluster URL []: https://172.17.4.99:443
+Cluster CA [(defaults to host certs)]: ${PWD}/ssl/ca.pem
+Cluster Name ["cluster-1"]:
+
+The kubernetes server supports the following methods:
+
+ 1. Bearer token
+ 2. Username and password
+ 3. Keystone
+ 4. OpenID Connect
+ 5. TLS client certificate
+
+Enter login method [1]: 4
+
+Logging in using OpenID Connect.
+
+Issuer ["valuefromdiscovery"]: https://accounts.google.com
+Issuer CA [(defaults to host certs)]:
+Scopes ["profile email"]:
+Client ID []: client@localhost:foobar
+Client Secret []: *****
+
+Open the following address in a browser.
+
+ https://accounts.google.com/o/oauth2/v2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scopes=openid%20email&access_type=offline&...
+
+Enter security code: ****
+
+Logged in as "janedoe@gmail.com"
+```
+
+Human readable names are provided by a combination of the auth providers
+understood by `kubectl login` and the authenticator discovery. For instance,
+Keystone uses basic auth credentials in the same way as a static user file, but
+if the discovery indicates that the Keystone plugin is being used it should be
+presented to the user differently.
+
+Users with configured credentials will simply auth against the API server and see
+who they are. Running this command again simply validates the user's credentials.
+
+```
+$ kubectl login
+Logged in as "janedoe@gmail.com"
+```
+
+Users who are halfway through the flow will start where they left off. For
+instance if a user has configured the cluster field but on a user field, they will
+be prompted for credentials.
+
+```
+$ kubectl login
+No auth type configured. The kubernetes server supports the following methods:
+
+ 1. Bearer token
+ 2. Username and password
+ 3. Keystone
+ 4. OpenID Connect
+ 5. TLS client certificate
+
+Enter login method [1]: 2
+
+Logging in with basic auth. Enter the following fields.
+
+Username: janedoe
+Password: ****
+
+Logged in as "janedoe@gmail.com"
+```
+
+Users who wish to switch servers can provide the `--switch-cluster` flag which
+will prompt the user for new cluster details and switch the current context. It
+behaves identically to `kubectl login` when a cluster is not set.
+
+```
+$ kubectl login --switch-cluster
+# ...
+```
+
+Switching users goes through a similar flow attempting to prompt the user for
+new credentials to the same server.
+
+```
+$ kubectl login --switch-user
+# ...
+```
+
+## Work to do
+
+Phase 1:
+
+* Provide a simple dialog for configuring authentication.
+* Kubectl can trigger authentication actions such as trigging OAuth2 redirects.
+* Validation of user credentials thought the `Authentication-Info` endpoint.
+
+Phase 2:
+
+* Update proposal with auth provider discovery mechanism.
+* Customize dialog using discovery data.
+
+Further improvements will require adding more authentication providers, and
+adapting existing plugins to take advantage of challenge based authentication.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubectl-login.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-auth.md b/contributors/design-proposals/kubelet-auth.md
new file mode 100644
index 00000000..c4d35dd9
--- /dev/null
+++ b/contributors/design-proposals/kubelet-auth.md
@@ -0,0 +1,106 @@
+# Kubelet Authentication / Authorization
+
+Author: Jordan Liggitt (jliggitt@redhat.com)
+
+## Overview
+
+The kubelet exposes endpoints which give access to data of varying sensitivity,
+and allow performing operations of varying power on the node and within containers.
+There is no built-in way to limit or subdivide access to those endpoints,
+so deployers must secure the kubelet API using external, ad-hoc methods.
+
+This document proposes a method for authenticating and authorizing access
+to the kubelet API, using interfaces and methods that complement the existing
+authentication and authorization used by the API server.
+
+## Preliminaries
+
+This proposal assumes the existence of:
+
+* a functioning API server
+* the SubjectAccessReview and TokenReview APIs
+
+It also assumes each node is additionally provisioned with the following information:
+
+1. Location of the API server
+2. Any CA certificates necessary to trust the API server's TLS certificate
+3. Client credentials authorized to make SubjectAccessReview and TokenReview API calls
+
+## API Changes
+
+None
+
+## Kubelet Authentication
+
+Enable starting the kubelet with one or more of the following authentication methods:
+
+* x509 client certificate
+* bearer token
+* anonymous (current default)
+
+For backwards compatibility, the default is to enable anonymous authentication.
+
+### x509 client certificate
+
+Add a new `--client-ca-file=[file]` option to the kubelet.
+When started with this option, the kubelet authenticates incoming requests using x509
+client certificates, validated against the root certificates in the provided bundle.
+The kubelet will reuse the x509 authenticator already used by the API server.
+
+The master API server can already be started with `--kubelet-client-certificate` and
+`--kubelet-client-key` options in order to make authenticated requests to the kubelet.
+
+### Bearer token
+
+Add a new `--authentication-token-webhook=[true|false]` option to the kubelet.
+When true, the kubelet authenticates incoming requests with bearer tokens by making
+`TokenReview` API calls to the API server.
+
+The kubelet will reuse the webhook authenticator already used by the API server, configured
+to call the API server using the connection information already provided to the kubelet.
+
+To improve performance of repeated requests with the same bearer token, the
+`--authentication-token-webhook-cache-ttl` option supported by the API server
+would be supported.
+
+### Anonymous
+
+Add a new `--anonymous-auth=[true|false]` option to the kubelet.
+When true, requests to the secure port that are not rejected by other configured
+authentication methods are treated as anonymous requests, and given a username
+of `system:anonymous` and a group of `system:unauthenticated`.
+
+## Kubelet Authorization
+
+Add a new `--authorization-mode` option to the kubelet, specifying one of the following modes:
+* `Webhook`
+* `AlwaysAllow` (current default)
+
+For backwards compatibility, the authorization mode defaults to `AlwaysAllow`.
+
+### Webhook
+
+Webhook mode converts the request to authorization attributes, and makes a `SubjectAccessReview`
+API call to check if the authenticated subject is allowed to make a request with those attributes.
+This enables authorization policy to be centrally managed by the authorizer configured for the API server.
+
+The kubelet will reuse the webhook authorizer already used by the API server, configured
+to call the API server using the connection information already provided to the kubelet.
+
+To improve performance of repeated requests with the same authenticated subject and request attributes,
+the same webhook authorizer caching options supported by the API server would be supported:
+
+* `--authorization-webhook-cache-authorized-ttl`
+* `--authorization-webhook-cache-unauthorized-ttl`
+
+### AlwaysAllow
+
+This mode allows any authenticated request.
+
+## Future Work
+
+* Add support for CRL revocation for x509 client certificate authentication (http://issue.k8s.io/18982)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-auth.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-cri-logging.md b/contributors/design-proposals/kubelet-cri-logging.md
new file mode 100644
index 00000000..8cc6fac1
--- /dev/null
+++ b/contributors/design-proposals/kubelet-cri-logging.md
@@ -0,0 +1,269 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
+ width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# CRI: Log management for container stdout/stderr streams
+
+
+## Goals and non-goals
+
+Container Runtime Interface (CRI) is an ongoing project to allow container
+runtimes to integrate with kubernetes via a newly-defined API. The goal of this
+proposal is to define how container's *stdout/stderr* log streams should be
+handled in CRI.
+
+The explicit non-goal is to define how (non-stdout/stderr) application logs
+should be handled. Collecting and managing arbitrary application logs is a
+long-standing issue [1] in kubernetes and is worth a proposal of its own. Even
+though this proposal does not touch upon these logs, the direction of
+this proposal is aligned with one of the most-discussed solutions, logging
+volumes [1], for general logging management.
+
+*In this proposal, “logs” refer to the stdout/stderr streams of the
+containers, unless specified otherwise.*
+
+Previous CRI logging issues:
+ - Tracking issue: https://github.com/kubernetes/kubernetes/issues/30709
+ - Proposal (by @tmrtfs): https://github.com/kubernetes/kubernetes/pull/33111
+
+The scope of this proposal is narrower than the #33111 proposal, and hopefully
+this will encourage a more focused discussion.
+
+
+## Background
+
+Below is a brief overview of logging in kubernetes with docker, which is the
+only container runtime with fully functional integration today.
+
+**Log lifecycle and management**
+
+Docker supports various logging drivers (e.g., syslog, journal, and json-file),
+and allows users to configure the driver by passing flags to the docker daemon
+at startup. Kubernetes defaults to the "json-file" logging driver, in which
+docker writes the stdout/stderr streams to a file in the json format as shown
+below.
+
+```
+{“log”: “The actual log line”, “stream”: “stderr”, “time”: “2016-10-05T00:00:30.082640485Z”}
+```
+
+Docker deletes the log files when the container is removed, and a cron-job (or
+systemd timer-based job) on the node is responsible to rotate the logs (using
+`logrotate`). To preserve the logs for introspection and debuggability, kubelet
+keeps the terminated container until the pod object has been deleted from the
+apiserver.
+
+**Container log retrieval**
+
+The kubernetes CLI tool, kubectl, allows users to access the container logs
+using [`kubectl logs`]
+(http://kubernetes.io/docs/user-guide/kubectl/kubectl_logs/) command.
+`kubectl logs` supports flags such as `--since` that requires understanding of
+the format and the metadata (i.e., timestamps) of the logs. In the current
+implementation, kubelet calls `docker logs` with parameters to return the log
+content. As of now, docker only supports `log` operations for the “journal” and
+“json-file” drivers [2]. In other words, *the support of `kubectl logs` is not
+universal in all kuernetes deployments*.
+
+**Cluster logging support**
+
+In a production cluster, logs are usually collected, aggregated, and shipped to
+a remote store where advanced analysis/search/archiving functions are
+supported. In kubernetes, the default cluster-addons includes a per-node log
+collection daemon, `fluentd`. To facilitate the log collection, kubelet creates
+symbolic links to all the docker containers logs under `/var/log/containers`
+with pod and container metadata embedded in the filename.
+
+```
+/var/log/containers/<pod_name>_<pod_namespace>_<container_name>-<container_id>.log`
+```
+
+The fluentd daemon watches the `/var/log/containers/` directory and extract the
+metadata associated with the log from the path. Note that this integration
+requires kubelet to know where the container runtime stores the logs, and will
+not be directly applicable to CRI.
+
+
+## Requirements
+
+ 1. **Provide ways for CRI-compliant runtimes to support all existing logging
+ features, i.e., `kubectl logs`.**
+
+ 2. **Allow kubelet to manage the lifecycle of the logs to pave the way for
+ better disk management in the future.** This implies that the lifecycle
+ of containers and their logs need to be decoupled.
+
+ 3. **Allow log collectors to easily integrate with Kubernetes across
+ different container runtimes while preserving efficient storage and
+ retrieval.**
+
+Requirement (1) provides opportunities for runtimes to continue support
+`kubectl logs --since` and related features. Note that even though such
+features are only supported today for a limited set of log drivers, this is an
+important usability tool for a fresh, basic kubernetes cluster, and should not
+be overlooked. Requirement (2) stems from the fact that disk is managed by
+kubelet as a node-level resource (not per-pod) today, hence it is difficult to
+delegate to the runtime by enforcing per-pod disk quota policy. In addition,
+container disk quota is not well supported yet, and such limitation may not
+even be well-perceived by users. Requirement (1) is crucial to the kubernetes'
+extensibility and usability across all deployments.
+
+## Proposed solution
+
+This proposal intends to satisfy the requirements by
+
+ 1. Enforce where the container logs should be stored on the host
+ filesystem. Both kubelet and the log collector can interact with
+ the log files directly.
+
+ 2. Ask the runtime to decorate the logs in a format that kubelet understands.
+
+**Log directories and structures**
+
+Kubelet will be configured with a root directory (e.g., `/var/log/pods` or
+`/var/lib/kubelet/logs/) to store all container logs. Below is an example of a
+path to the log of a container in a pod.
+
+```
+/var/log/pods/<podUID>/<containerName>_<instance#>.log
+```
+
+In CRI, this is implemented by setting the pod-level log directory when
+creating the pod sandbox, and passing the relative container log path
+when creating a container.
+
+```
+PodSandboxConfig.LogDirectory: /var/log/pods/<podUID>/
+ContainerConfig.LogPath: <containerName>_<instance#>.log
+```
+
+Because kubelet determines where the logs are stores and can access them
+directly, this meets requirement (1). As for requirement (2), the log collector
+can easily extract basic pod metadata (e.g., pod UID, container name) from
+the paths, and watch the directly for any changes. In the future, we can
+extend this by maintaining a metada file in the pod directory.
+
+**Log format**
+
+The runtime should decorate each log entry with a RFC 3339Nano timestamp
+prefix, the stream type (i.e., "stdout" or "stderr"), and ends with a newline.
+
+```
+2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1
+2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2
+```
+
+With the knowledge, kubelet can parses the logs and serve them for `kubectl
+logs` requests. This meets requirement (3). Note that the format is defined
+deliberately simple to provide only information necessary to serve the requests.
+We do not intend for kubelet to host various logging plugins. It is also worth
+mentioning again that the scope of this proposal is restricted to stdout/stderr
+streams of the container, and we impose no restriction to the logging format of
+arbitrary container logs.
+
+**Who should rotate the logs?**
+
+We assume that a separate task (e.g., cron job) will be configured on the node
+to rotate the logs periodically, similar to today’s implementation.
+
+We do not rule out the possibility of letting kubelet or a per-node daemon
+(`DaemonSet`) to take up the responsibility, or even declare rotation policy
+in the kubernetes API as part of the `PodSpec`, but it is beyond the scope of
+the this proposal.
+
+**What about non-supported log formats?**
+
+If a runtime chooses to store logs in non-supported formats, it essentially
+opts out of `kubectl logs` features, which is backed by kubelet today. It is
+assumed that the user can rely on the advanced, cluster logging infrastructure
+to examine the logs.
+
+It is also possible that in the future, `kubectl logs` can contact the cluster
+logging infrastructure directly to serve logs [1a]. Note that this does not
+eliminate the need to store the logs on the node locally for reliability.
+
+
+**How can existing runtimes (docker/rkt) comply to the logging requirements?**
+
+In the short term, the ongoing docker-CRI integration [3] will support the
+proposed solution only partially by (1) creating symbolic links for kubelet
+to access, but not manage the logs, and (2) add support for json format in
+kubelet. A more sophisticated solution that either involves using a custom
+plugin or launching a separate process to copy and decorate the log will be
+considered as a mid-term solution.
+
+For rkt, implementation will rely on providing external file-descriptors for
+stdout/stderr to applications via systemd [4]. Those streams are currently
+managed by a journald sidecar, which collects stream outputs and store them
+in the journal file of the pod. This will replaced by a custom sidecar which
+can produce logs in the format expected by this specification and can handle
+clients attaching as well.
+
+## Alternatives
+
+There are ad-hoc solutions/discussions that addresses one or two of the
+requirements, but no comprehensive solution for CRI specifically has been
+proposed so far (with the excpetion of @tmrtfs's proposal
+[#33111](https://github.com/kubernetes/kubernetes/pull/33111), which has a much
+wider scope). It has come up in discussions that kubelet can delegate all the
+logging management to the runtime to allow maximum flexibility. However, it is
+difficult for this approach to meet either requirement (1) or (2), without
+defining complex logging API.
+
+There are also possibilities to implement the current proposal by imposing the
+log file paths, while leveraging the runtime to access and/or manage logs. This
+requires the runtime to expose knobs in CRI to retrieve, remove, and examine
+the disk usage of logs. The upside of this approach is that kubelet needs not
+mandate the logging format, assuming runtime already includes plugins for
+various logging formats. Unfortunately, this is not true for existing runtimes
+such as docker, which supports log retrieval only for a very limited number of
+log drivers [2]. On the other hand, the downside is that we would be enforcing
+more requirements on the runtime through log storage location on the host, and
+a potentially premature logging API that may change as the disk management
+evolves.
+
+## References
+
+[1] Log management issues:
+ - a. https://github.com/kubernetes/kubernetes/issues/17183
+ - b. https://github.com/kubernetes/kubernetes/issues/24677
+ - c. https://github.com/kubernetes/kubernetes/pull/13010
+
+[2] Docker logging drivers:
+ - https://docs.docker.com/engine/admin/logging/overview/
+
+[3] Docker CRI integration:
+ - https://github.com/kubernetes/kubernetes/issues/31459
+
+[4] rkt support: https://github.com/systemd/systemd/pull/4179
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-cri-logging.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-eviction.md b/contributors/design-proposals/kubelet-eviction.md
new file mode 100644
index 00000000..233956b8
--- /dev/null
+++ b/contributors/design-proposals/kubelet-eviction.md
@@ -0,0 +1,462 @@
+# Kubelet - Eviction Policy
+
+**Authors**: Derek Carr (@derekwaynecarr), Vishnu Kannan (@vishh)
+
+**Status**: Proposed (memory evictions WIP)
+
+This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
+
+## Goals
+
+The node needs a mechanism to preserve stability when available compute resources are low.
+
+This is especially important when dealing with incompressible compute resources such
+as memory or disk. If either resource is exhausted, the node would become unstable.
+
+The `kubelet` has some support for influencing system behavior in response to a system OOM by
+having the system OOM killer see higher OOM score adjust scores for containers that have consumed
+the largest amount of memory relative to their request. System OOM events are very compute
+intensive, and can stall the node until the OOM killing process has completed. In addition,
+the system is prone to return to an unstable state since the containers that are killed due to OOM
+are either restarted or a new pod is scheduled on to the node.
+
+Instead, we would prefer a system where the `kubelet` can pro-actively monitor for
+and prevent against total starvation of a compute resource, and in cases of where it
+could appear to occur, pro-actively fail one or more pods, so the workload can get
+moved and scheduled elsewhere when/if its backing controller creates a new pod.
+
+## Scope of proposal
+
+This proposal defines a pod eviction policy for reclaiming compute resources.
+
+As of now, memory and disk based evictions are supported.
+The proposal focuses on a simple default eviction strategy
+intended to cover the broadest class of user workloads.
+
+## Eviction Signals
+
+The `kubelet` will support the ability to trigger eviction decisions on the following signals.
+
+| Eviction Signal | Description |
+|------------------|---------------------------------------------------------------------------------|
+| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
+| nodefs.available | nodefs.available := node.stats.fs.available |
+| nodefs.inodesFree | nodefs.inodesFree := node.stats.fs.inodesFree |
+| imagefs.available | imagefs.available := node.stats.runtime.imagefs.available |
+| imagefs.inodesFree | imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree |
+
+Each of the above signals support either a literal or percentage based value. The percentage based value
+is calculated relative to the total capacity associated with each signal.
+
+`kubelet` supports only two filesystem partitions.
+
+1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
+1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
+
+`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor.
+`kubelet` does not care about any other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is *not OK* to store volumes and logs in a dedicated `imagefs`.
+
+## Eviction Thresholds
+
+The `kubelet` will support the ability to specify eviction thresholds.
+
+An eviction threshold is of the following form:
+
+`<eviction-signal><operator><quantity | int%>`
+
+* valid `eviction-signal` tokens as defined above.
+* valid `operator` tokens are `<`
+* valid `quantity` tokens must match the quantity representation used by Kubernetes
+* an eviction threshold can be expressed as a percentage if ends with `%` token.
+
+If threshold criteria are met, the `kubelet` will take pro-active action to attempt
+to reclaim the starved compute resource associated with the eviction signal.
+
+The `kubelet` will support soft and hard eviction thresholds.
+
+For example, if a node has `10Gi` of memory, and the desire is to induce eviction
+if available memory falls below `1Gi`, an eviction signal can be specified as either
+of the following (but not both).
+
+* `memory.available<10%`
+* `memory.available<1Gi`
+
+### Soft Eviction Thresholds
+
+A soft eviction threshold pairs an eviction threshold with a required
+administrator specified grace period. No action is taken by the `kubelet`
+to reclaim resources associated with the eviction signal until that grace
+period has been exceeded. If no grace period is provided, the `kubelet` will
+error on startup.
+
+In addition, if a soft eviction threshold has been met, an operator can
+specify a maximum allowed pod termination grace period to use when evicting
+pods from the node. If specified, the `kubelet` will use the lesser value among
+the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
+If not specified, the `kubelet` will kill pods immediately with no graceful
+termination.
+
+To configure soft eviction thresholds, the following flags will be supported:
+
+```
+--eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction.
+--eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
+--eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
+```
+
+### Hard Eviction Thresholds
+
+A hard eviction threshold has no grace period, and if observed, the `kubelet`
+will take immediate action to reclaim the associated starved resource. If a
+hard eviction threshold is met, the `kubelet` will kill the pod immediately
+with no graceful termination.
+
+To configure hard eviction thresholds, the following flag will be supported:
+
+```
+--eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
+```
+
+## Eviction Monitoring Interval
+
+The `kubelet` will initially evaluate eviction thresholds at the same
+housekeeping interval as `cAdvisor` housekeeping.
+
+In Kubernetes 1.2, this was defaulted to `10s`.
+
+It is a goal to shrink the monitoring interval to a much shorter window.
+This may require changes to `cAdvisor` to let alternate housekeeping intervals
+be specified for selected data (https://github.com/google/cadvisor/issues/1247)
+
+For the purposes of this proposal, we expect the monitoring interval to be no
+more than `10s` to know when a threshold has been triggered, but we will strive
+to reduce that latency time permitting.
+
+## Node Conditions
+
+The `kubelet` will support a node condition that corresponds to each eviction signal.
+
+If a hard eviction threshold has been met, or a soft eviction threshold has been met
+independent of its associated grace period, the `kubelet` will report a condition that
+reflects the node is under pressure.
+
+The following node conditions are defined that correspond to the specified eviction signal.
+
+| Node Condition | Eviction Signal | Description |
+|----------------|------------------|------------------------------------------------------------------|
+| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
+| DiskPressure | nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
+
+The `kubelet` will continue to report node status updates at the frequency specified by
+`--node-status-update-frequency` which defaults to `10s`.
+
+### Oscillation of node conditions
+
+If a node is oscillating above and below a soft eviction threshold, but not exceeding
+its associated grace period, it would cause the corresponding node condition to
+constantly oscillate between true and false, and could cause poor scheduling decisions
+as a consequence.
+
+To protect against this oscillation, the following flag is defined to control how
+long the `kubelet` must wait before transitioning out of a pressure condition.
+
+```
+--eviction-pressure-transition-period=5m0s: Duration for which the kubelet has to wait
+before transitioning out of an eviction pressure condition.
+```
+
+The `kubelet` would ensure that it has not observed an eviction threshold being met
+for the specified pressure condition for the period specified before toggling the
+condition back to `false`.
+
+## Eviction scenarios
+
+### Memory
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-hard="memory.available<100Mi"
+--eviction-soft="memory.available<300Mi"
+--eviction-soft-grace-period="memory.available=30s"
+```
+
+The `kubelet` will run a sync loop that looks at the available memory
+on the node as reported from `cAdvisor` by calculating (capacity - workingSet).
+If available memory is observed to drop below 100Mi, the `kubelet` will immediately
+initiate eviction. If available memory is observed as falling below `300Mi`,
+it will record when that signal was observed internally in a cache. If at the next
+sync, that criteria was no longer satisfied, the cache is cleared for that
+signal. If that signal is observed as being satisfied for longer than the
+specified period, the `kubelet` will initiate eviction to attempt to
+reclaim the resource that has met its eviction threshold.
+
+### Disk
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-hard="nodefs.available<1Gi,nodefs.inodesFree<1,imagefs.available<10Gi,imagefs.inodesFree<10"
+--eviction-soft="nodefs.available<1.5Gi,nodefs.inodesFree<10,imagefs.available<20Gi,imagefs.inodesFree<100"
+--eviction-soft-grace-period="nodefs.available=1m,imagefs.available=2m"
+```
+
+The `kubelet` will run a sync loop that looks at the available disk
+on the node's supported partitions as reported from `cAdvisor`.
+If available disk space on the node's primary filesystem is observed to drop below 1Gi
+or the free inodes on the node's primary filesystem is less than 1,
+the `kubelet` will immediately initiate eviction.
+If available disk space on the node's image filesystem is observed to drop below 10Gi
+or the free inodes on the node's primary image filesystem is less than 10,
+the `kubelet` will immediately initiate eviction.
+
+If available disk space on the node's primary filesystem is observed as falling below `1.5Gi`,
+or if the free inodes on the node's primary filesystem is less than 10,
+or if available disk space on the node's image filesystem is observed as falling below `20Gi`,
+or if the free inodes on the node's image filesystem is less than 100,
+it will record when that signal was observed internally in a cache. If at the next
+sync, that criterion was no longer satisfied, the cache is cleared for that
+signal. If that signal is observed as being satisfied for longer than the
+specified period, the `kubelet` will initiate eviction to attempt to
+reclaim the resource that has met its eviction threshold.
+
+## Eviction of Pods
+
+If an eviction threshold has been met, the `kubelet` will initiate the
+process of evicting pods until it has observed the signal has gone below
+its defined threshold.
+
+The eviction sequence works as follows:
+
+* for each monitoring interval, if eviction thresholds have been met
+ * find candidate pod
+ * fail the pod
+ * block until pod is terminated on node
+
+If a pod is not terminated because a container does not happen to die
+(i.e. processes stuck in disk IO for example), the `kubelet` may select
+an additional pod to fail instead. The `kubelet` will invoke the `KillPod`
+operation exposed on the runtime interface. If an error is returned,
+the `kubelet` will select a subsequent pod.
+
+## Eviction Strategy
+
+The `kubelet` will implement a default eviction strategy oriented around
+the pod quality of service class.
+
+It will target pods that are the largest consumers of the starved compute
+resource relative to their scheduling request. It ranks pods within a
+quality of service tier in the following order.
+
+* `BestEffort` pods that consume the most of the starved resource are failed
+first.
+* `Burstable` pods that consume the greatest amount of the starved resource
+relative to their request for that resource are killed first. If no pod
+has exceeded its request, the strategy targets the largest consumer of the
+starved resource.
+* `Guaranteed` pods that consume the greatest amount of the starved resource
+relative to their request are killed first. If no pod has exceeded its request,
+the strategy targets the largest consumer of the starved resource.
+
+A guaranteed pod is guaranteed to never be evicted because of another pod's
+resource consumption. That said, guarantees are only as good as the underlying
+foundation they are built upon. If a system daemon
+(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than
+were reserved via `system-reserved` or `kube-reserved` allocations, and the node
+only has guaranteed pod(s) remaining, then the node must choose to evict a
+guaranteed pod in order to preserve node stability, and to limit the impact
+of the unexpected consumption to other guaranteed pod(s).
+
+## Disk based evictions
+
+### With Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete logs
+1. Evict Pods if required.
+
+If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete unused images
+1. Evict Pods if required.
+
+### Without Imagefs
+
+If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+
+1. Delete logs
+1. Delete unused images
+1. Evict Pods if required.
+
+Let's explore the different options for freeing up disk space.
+
+### Delete logs of dead pods/containers
+
+As of today, logs are tied to a container's lifetime. `kubelet` keeps dead containers around,
+to provide access to logs.
+In the future, if we store logs of dead containers outside of the container itself, then
+`kubelet` can delete these logs to free up disk space.
+Once the lifetime of containers and logs are split, kubelet can support more user friendly policies
+around log evictions. `kubelet` can delete logs of the oldest containers first.
+Since logs from the first and the most recent incarnation of a container is the most important for most applications,
+kubelet can try to preserve these logs and aggressively delete logs from other container incarnations.
+
+Until logs are split from container's lifetime, `kubelet` can delete dead containers to free up disk space.
+
+### Delete unused images
+
+`kubelet` performs image garbage collection based on thresholds today. It uses a high and a low watermark.
+Whenever disk usage exceeds the high watermark, it removes images until the low watermark is reached.
+`kubelet` employs a LRU policy when it comes to deleting images.
+
+The existing policy will be replaced with a much simpler policy.
+Images will be deleted based on eviction thresholds. If kubelet can delete logs and keep disk space availability
+above eviction thresholds, then kubelet will not delete any images.
+If `kubelet` decides to delete unused images, it will delete *all* unused images.
+
+### Evict pods
+
+There is no ability to specify disk limits for pods/containers today.
+Disk is a best effort resource. When necessary, `kubelet` can evict pods one at a time.
+`kubelet` will follow the [Eviction Strategy](#eviction-strategy) mentioned above for making eviction decisions.
+`kubelet` will evict the pod that will free up the maximum amount of disk space on the filesystem that has hit eviction thresholds.
+Within each QoS bucket, `kubelet` will sort pods according to their disk usage.
+`kubelet` will sort pods in each bucket as follows:
+
+#### Without Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
+- local volumes + logs & writable layer of all its containers.
+
+#### With Imagefs
+
+If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
+- local volumes + logs of all its containers.
+
+If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
+
+## Minimum eviction reclaim
+
+In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
+`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
+ is time consuming.
+
+To mitigate these issues, `kubelet` will have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
+resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource.
+
+Following are the flags through which `minimum-reclaim` can be configured for each evictable resource:
+
+`--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
+
+The default `eviction-minimum-reclaim` is `0` for all resources.
+
+## Deprecation of existing features
+
+`kubelet` has been freeing up disk space on demand to keep the node stable. As part of this proposal,
+some of the existing features/flags around disk space retrieval will be deprecated in-favor of this proposal.
+
+| Existing Flag | New Flag | Rationale |
+| ------------- | -------- | --------- |
+| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | existing eviction signals can capture image garbage collection |
+| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | eviction reclaims achieve the same behavior |
+| `--maximum-dead-containers` | | deprecated once old logs are stored outside of container's context |
+| `--maximum-dead-containers-per-container` | | deprecated once old logs are stored outside of container's context |
+| `--minimum-container-ttl-duration` | | deprecated once old logs are stored outside of container's context |
+| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | this use case is better handled by this proposal |
+| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | make the flag generic to suit all compute resources |
+
+## Kubelet Admission Control
+
+### Feasibility checks during kubelet admission
+
+#### Memory
+
+The `kubelet` will reject `BestEffort` pods if any of the memory
+eviction thresholds have been exceeded independent of the configured
+grace period.
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-soft="memory.available<256Mi"
+--eviction-soft-grace-period="memory.available=30s"
+```
+
+If the `kubelet` sees that it has less than `256Mi` of memory available
+on the node, but the `kubelet` has not yet initiated eviction since the
+grace period criteria has not yet been met, the `kubelet` will still immediately
+fail any incoming best effort pods.
+
+The reasoning for this decision is the expectation that the incoming pod is
+likely to further starve the particular compute resource and the `kubelet` should
+return to a steady state before accepting new workloads.
+
+#### Disk
+
+The `kubelet` will reject all pods if any of the disk eviction thresholds have been met.
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-soft="nodefs.available<1500Mi"
+--eviction-soft-grace-period="nodefs.available=30s"
+```
+
+If the `kubelet` sees that it has less than `1500Mi` of disk available
+on the node, but the `kubelet` has not yet initiated eviction since the
+grace period criteria has not yet been met, the `kubelet` will still immediately
+fail any incoming pods.
+
+The rationale for failing **all** pods instead of just best effort is because disk is currently
+a best effort resource for all QoS classes.
+
+Kubelet will apply the same policy even if there is a dedicated `image` filesystem.
+
+## Scheduler
+
+The node will report a condition when a compute resource is under pressure. The
+scheduler should view that condition as a signal to dissuade placing additional
+best effort pods on the node.
+
+In this case, the `MemoryPressure` condition if true should dissuade the scheduler
+from placing new best effort pods on the node since they will be rejected by the `kubelet` in admission.
+
+On the other hand, the `DiskPressure` condition if true should dissuade the scheduler from
+placing **any** new pods on the node since they will be rejected by the `kubelet` in admission.
+
+## Best Practices
+
+### DaemonSet
+
+It is never desired for a `kubelet` to evict a pod that was derived from
+a `DaemonSet` since the pod will immediately be recreated and rescheduled
+back to the same node.
+
+At the moment, the `kubelet` has no ability to distinguish a pod created
+from `DaemonSet` versus any other object. If/when that information is
+available, the `kubelet` could pro-actively filter those pods from the
+candidate set of pods provided to the eviction strategy.
+
+In general, it should be strongly recommended that `DaemonSet` not
+create `BestEffort` pods to avoid being identified as a candidate pod
+for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
+
+## Known issues
+
+### kubelet may evict more pods than needed
+
+The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
+the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
+
+### How kubelet ranks pods for eviction in response to inode exhaustion
+
+At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
+inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
+to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
+by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
+that pod over others.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-hypercontainer-runtime.md b/contributors/design-proposals/kubelet-hypercontainer-runtime.md
new file mode 100644
index 00000000..c3da7d9a
--- /dev/null
+++ b/contributors/design-proposals/kubelet-hypercontainer-runtime.md
@@ -0,0 +1,45 @@
+Kubelet HyperContainer Container Runtime
+=======================================
+
+Authors: Pengfei Ni (@feiskyer), Harry Zhang (@resouer)
+
+## Abstract
+
+This proposal aims to support [HyperContainer](http://hypercontainer.io) container
+runtime in Kubelet.
+
+## Motivation
+
+HyperContainer is a Hypervisor-agnostic Container Engine that allows you to run Docker images using
+hypervisors (KVM, Xen, etc.). By running containers within separate VM instances, it offers a
+hardware-enforced isolation, which is required in multi-tenant environments.
+
+## Goals
+
+1. Complete pod/container/image lifecycle management with HyperContainer.
+2. Setup network by network plugins.
+3. 100% Pass node e2e tests.
+4. Easy to deploy for both local dev/test and production clusters.
+
+## Design
+
+The HyperContainer runtime will make use of the kubelet Container Runtime Interface. [Fakti](https://github.com/kubernetes/frakti) implements the CRI interface and exposes
+a local endpoint to Kubelet. Fakti communicates with [hyperd](https://github.com/hyperhq/hyperd)
+with its gRPC API to manage the lifecycle of sandboxes, containers and images.
+
+![frakti](https://cloud.githubusercontent.com/assets/676637/18940978/6e3e5384-863f-11e6-9132-b638d862fd09.png)
+
+## Limitations
+
+Since pods are running directly inside hypervisor, host network is not supported in HyperContainer
+runtime.
+
+## Development
+
+The HyperContainer runtime is maintained by <https://github.com/kubernetes/frakti>.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-hypercontainer-runtime.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-rkt-runtime.md b/contributors/design-proposals/kubelet-rkt-runtime.md
new file mode 100644
index 00000000..84aac8cc
--- /dev/null
+++ b/contributors/design-proposals/kubelet-rkt-runtime.md
@@ -0,0 +1,103 @@
+Next generation rkt runtime integration
+=======================================
+
+Authors: Euan Kemp (@euank), Yifan Gu (@yifan-gu)
+
+## Abstract
+
+This proposal describes the design and road path for integrating rkt with kubelet with the new container runtime interface.
+
+## Background
+
+Currently, the Kubernetes project supports rkt as a container runtime via an implementation under [pkg/kubelet/rkt package](https://github.com/kubernetes/kubernetes/tree/v1.5.0-alpha.0/pkg/kubelet/rkt).
+
+This implementation, for historical reasons, has required implementing a large amount of logic shared by the original Docker implementation.
+
+In order to make additional container runtime integrations easier, more clearly defined, and more consistent, a new [Container Runtime Interface](https://github.com/kubernetes/kubernetes/blob/v1.5.0-alpha.0/pkg/kubelet/api/v1alpha1/runtime/api.proto) (CRI) is being designed.
+The existing runtimes, in order to both prove the correctness of the interface and reduce maintenance burden, are incentivized to move to this interface.
+
+This document proposes how the rkt runtime integration will transition to using the CRI.
+
+## Goals
+
+### Full-featured
+
+The CRI integration must work as well as the existing integration in terms of features.
+
+Until that's the case, the existing integration will continue to be maintained.
+
+### Easy to Deploy
+
+The new integration should not be any more difficult to deploy and configure than the existing integration.
+
+### Easy to Develop
+
+This iteration should be as easy to work and iterate on as the original one.
+
+It will be available in an initial usable form quickly in order to validate the CRI.
+
+## Design
+
+In order to fulfill the above goals, the rkt CRI integration will make the following choices:
+
+### Remain in-process with Kubelet
+
+The current rkt container runtime integration is able to be deployed simply by deploying the kubelet binary.
+
+This is, in no small part, to make it *Easy to Deploy*.
+
+Remaining in-process also helps this integration not regress on performance, one axis of being *Full-Featured*.
+
+### Communicate through gRPC
+
+Although the kubelet and rktlet will be compiled together, the runtime and kubelet will still communicate through gRPC interface for better API abstraction.
+
+For the near short term, they will still talk through a unix socket before we implement a custom gRPC connection that skips the network stack.
+
+### Developed as a Separate Repository
+
+Brian Grant's discussion on splitting the Kubernetes project into [separate repos](https://github.com/kubernetes/kubernetes/issues/24343) is a compelling argument for why it makes sense to split this work into a separate repo.
+
+In order to be *Easy to Develop*, this iteration will be maintained as a separate repository, and re-vendored back in.
+
+This choice will also allow better long-term growth in terms of better issue-management, testing pipelines, and so on.
+
+Unfortunately, in the short term, it's possible that some aspects of this will also cause pain and it's very difficult to weight each side correctly.
+
+### Exec the rkt binary (initially)
+
+While significant work on the rkt [api-service](https://coreos.com/rkt/docs/latest/subcommands/api-service.html) has been made,
+it has also been a source of problems and additional complexity,
+and was never transitioned to entirely.
+
+In addition, the rkt cli has historically been the primary interface to the rkt runtime.
+
+The initial integration will execute the rkt binary directly for app creation/start/stop/removal, as well as image pulling/removal.
+
+The creation of pod sanbox is also done via rkt command line, but it will run under `systemd-run` so it's monitored by the init process.
+
+In the future, some of these decisions are expected to be changed such that rkt is vendored as a library dependency for all operations, and other init systems will be supported as well.
+
+
+## Roadmap and Milestones
+
+1. rktlet integrate with kubelet to support basic pod/container lifecycle (pod creation, container creation/start/stop, pod stop/removal) [[Done]](https://github.com/kubernetes-incubator/rktlet/issues/9)
+2. rktlet integrate with kubelet to support more advanced features:
+ - Support kubelet networking, host network
+ - Support mount / volumes [[#33526]](https://github.com/kubernetes/kubernetes/issues/33526)
+ - Support exposing ports
+ - Support privileged containers
+ - Support selinux options [[#33139]](https://github.com/kubernetes/kubernetes/issues/33139)
+ - Support attach [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579)
+ - Support exec [[#29579]](https://github.com/kubernetes/kubernetes/issues/29579)
+ - Support logging [[#33111]](https://github.com/kubernetes/kubernetes/pull/33111)
+
+3. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with nspawn stage1.
+4. rktlet integrate with kubelet, pass 100% e2e and node e2e tests, with kvm stage1.
+5. Revendor rktlet into `pkg/kubelet/rktshim`, and start deprecating the `pkg/kubelet/rkt` package.
+6. Eventually replace the current `pkg/kubelet/rkt` package.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-rkt-runtime.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-systemd.md b/contributors/design-proposals/kubelet-systemd.md
new file mode 100644
index 00000000..b4277cfa
--- /dev/null
+++ b/contributors/design-proposals/kubelet-systemd.md
@@ -0,0 +1,407 @@
+# Kubelet and systemd interaction
+
+**Author**: Derek Carr (@derekwaynecarr)
+
+**Status**: Proposed
+
+## Motivation
+
+Many Linux distributions have either adopted, or plan to adopt `systemd` as their init system.
+
+This document describes how the node should be configured, and a set of enhancements that should
+be made to the `kubelet` to better integrate with these distributions independent of container
+runtime.
+
+## Scope of proposal
+
+This proposal does not account for running the `kubelet` in a container.
+
+## Background on systemd
+
+To help understand this proposal, we first provide a brief summary of `systemd` behavior.
+
+### systemd units
+
+`systemd` manages a hierarchy of `slice`, `scope`, and `service` units.
+
+* `service` - application on the server that is launched by `systemd`; how it should start/stop;
+when it should be started; under what circumstances it should be restarted; and any resource
+controls that should be applied to it.
+* `scope` - a process or group of processes which are not launched by `systemd` (i.e. fork), like
+a service, resource controls may be applied
+* `slice` - organizes a hierarchy in which `scope` and `service` units are placed. a `slice` may
+contain `slice`, `scope`, or `service` units; processes are attached to `service` and `scope`
+units only, not to `slices`. The hierarchy is intended to be unified, meaning a process may
+only belong to a single leaf node.
+
+### cgroup hierarchy: split versus unified hierarchies
+
+Classical `cgroup` hierarchies were split per resource group controller, and a process could
+exist in different parts of the hierarchy.
+
+For example, a process `p1` could exist in each of the following at the same time:
+
+* `/sys/fs/cgroup/cpu/important/`
+* `/sys/fs/cgroup/memory/unimportant/`
+* `/sys/fs/cgroup/cpuacct/unimportant/`
+
+In addition, controllers for one resource group could depend on another in ways that were not
+always obvious.
+
+For example, the `cpu` controller depends on the `cpuacct` controller yet they were treated
+separately.
+
+Many found it confusing for a single process to belong to different nodes in the `cgroup` hierarchy
+across controllers.
+
+The Kernel direction for `cgroup` support is to move toward a unified `cgroup` hierarchy, where the
+per-controller hierarchies are eliminated in favor of hierarchies like the following:
+
+* `/sys/fs/cgroup/important/`
+* `/sys/fs/cgroup/unimportant/`
+
+In a unified hierarchy, a process may only belong to a single node in the `cgroup` tree.
+
+### cgroupfs single writer
+
+The Kernel direction for `cgroup` management is to promote a single-writer model rather than
+allowing multiple processes to independently write to parts of the file-system.
+
+In distributions that run `systemd` as their init system, the cgroup tree is managed by `systemd`
+by default since it implicitly interacts with the cgroup tree when starting units. Manual changes
+made by other cgroup managers to the cgroup tree are not guaranteed to be preserved unless `systemd`
+is made aware. `systemd` can be told to ignore sections of the cgroup tree by configuring the unit
+to have the `Delegate=` option.
+
+See: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=
+
+### cgroup management with systemd and container runtimes
+
+A `slice` corresponds to an inner-node in the `cgroup` file-system hierarchy.
+
+For example, the `system.slice` is represented as follows:
+
+`/sys/fs/cgroup/<controller>/system.slice`
+
+A `slice` is nested in the hierarchy by its naming convention.
+
+For example, the `system-foo.slice` is represented as follows:
+
+`/sys/fs/cgroup/<controller>/system.slice/system-foo.slice/`
+
+A `service` or `scope` corresponds to leaf nodes in the `cgroup` file-system hierarchy managed by
+`systemd`. Services and scopes can have child nodes managed outside of `systemd` if they have been
+delegated with the `Delegate=` option.
+
+For example, if the `docker.service` is associated with the `system.slice`, it is
+represented as follows:
+
+`/sys/fs/cgroup/<controller>/system.slice/docker.service/`
+
+To demonstrate the use of `scope` units using the `docker` container runtime, if a
+user launches a container via `docker run -m 100M busybox`, a `scope` will be created
+because the process was not launched by `systemd` itself. The `scope` is parented by
+the `slice` associated with the launching daemon.
+
+For example:
+
+`/sys/fs/cgroup/<controller>/system.slice/docker-<container-id>.scope`
+
+`systemd` defines a set of slices. By default, service and scope units are placed in
+`system.slice`, virtual machines and containers registered with `systemd-machined` are
+found in `machine.slice`, and user sessions handled by `systemd-logind` in `user.slice`.
+
+## Node Configuration on systemd
+
+### kubelet cgroup driver
+
+The `kubelet` reads and writes to the `cgroup` tree during bootstrapping
+of the node. In the future, it will write to the `cgroup` tree to satisfy other
+purposes around quality of service, etc.
+
+The `kubelet` must cooperate with `systemd` in order to ensure proper function of the
+system. The bootstrapping requirements for a `systemd` system are different than one
+without it.
+
+The `kubelet` will accept a new flag to control how it interacts with the `cgroup` tree.
+
+* `--cgroup-driver=` - cgroup driver used by the kubelet. `cgroupfs` or `systemd`.
+
+By default, the `kubelet` should default `--cgroup-driver` to `systemd` on `systemd` distributions.
+
+The `kubelet` should associate node bootstrapping semantics to the configured
+`cgroup driver`.
+
+### Node allocatable
+
+The proposal makes no changes to the definition as presented here:
+https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/node-allocatable.md
+
+The node will report a set of allocatable compute resources defined as follows:
+
+`[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]`
+
+### Node capacity
+
+The `kubelet` will continue to interface with `cAdvisor` to determine node capacity.
+
+### System reserved
+
+The node may set aside a set of designated resources for non-Kubernetes components.
+
+The `kubelet` accepts the followings flags that support this feature:
+
+* `--system-reserved=` - A set of `ResourceName`=`ResourceQuantity` pairs that
+describe resources reserved for host daemons.
+* `--system-container=` - Optional resource-only container in which to place all
+non-kernel processes that are not already in a container. Empty for no container.
+Rolling back the flag requires a reboot. (Default: "").
+
+The current meaning of `system-container` is inadequate on `systemd` environments.
+The `kubelet` should use the flag to know the location that has the processes that
+are associated with `system-reserved`, but it should not modify the cgroups of
+existing processes on the system during bootstrapping of the node. This is
+because `systemd` is the `cgroup manager` on the host and it has not delegated
+authority to the `kubelet` to change how it manages `units`.
+
+The following describes the type of things that can happen if this does not change:
+https://bugzilla.redhat.com/show_bug.cgi?id=1202859
+
+As a result, the `kubelet` needs to distinguish placement of non-kernel processes
+based on the cgroup driver, and only do its current behavior when not on `systemd`.
+
+The flag should be modified as follows:
+
+* `--system-container=` - Name of resource-only container that holds all
+non-kernel processes whose resource consumption is accounted under
+system-reserved. The default value is cgroup driver specific. systemd
+defaults to system, cgroupfs defines no default. Rolling back the flag
+requires a reboot.
+
+The `kubelet` will error if the defined `--system-container` does not exist
+on `systemd` environments. It will verify that the appropriate `cpu` and `memory`
+controllers are enabled.
+
+### Kubernetes reserved
+
+The node may set aside a set of resources for Kubernetes components:
+
+* `--kube-reserved=:` - A set of `ResourceName`=`ResourceQuantity` pairs that
+describe resources reserved for host daemons.
+
+The `kubelet` does not enforce `--kube-reserved` at this time, but the ability
+to distinguish the static reservation from observed usage is important for node accounting.
+
+This proposal asserts that `kubernetes.slice` is the default slice associated with
+the `kubelet` and `kube-proxy` service units defined in the project. Keeping it
+separate from `system.slice` allows for accounting to be distinguished separately.
+
+The `kubelet` will detect its `cgroup` to track `kube-reserved` observed usage on `systemd`.
+If the `kubelet` detects that its a child of the `system-container` based on the observed
+`cgroup` hierarchy, it will warn.
+
+If the `kubelet` is launched directly from a terminal, it's most likely destination will
+be in a `scope` that is a child of `user.slice` as follows:
+
+`/sys/fs/cgroup/<controller>/user.slice/user-1000.slice/session-1.scope`
+
+In this context, the parent `scope` is what will be used to facilitate local developer
+debugging scenarios for tracking `kube-reserved` usage.
+
+The `kubelet` has the following flag:
+
+* `--resource-container="/kubelet":` Absolute name of the resource-only container to create
+and run the Kubelet in (Default: /kubelet).
+
+This flag will not be supported on `systemd` environments since the init system has already
+spawned the process and placed it in the corresponding container associated with its unit.
+
+### Kubernetes container runtime reserved
+
+This proposal asserts that the reservation of compute resources for any associated
+container runtime daemons is tracked by the operator under the `system-reserved` or
+`kubernetes-reserved` values and any enforced limits are set by the
+operator specific to the container runtime.
+
+**Docker**
+
+If the `kubelet` is configured with the `container-runtime` set to `docker`, the
+`kubelet` will detect the `cgroup` associated with the `docker` daemon and use that
+to do local node accounting. If an operator wants to impose runtime limits on the
+`docker` daemon to control resource usage, the operator should set those explicitly in
+the `service` unit that launches `docker`. The `kubelet` will not set any limits itself
+at this time and will assume whatever budget was set aside for `docker` was included in
+either `--kube-reserved` or `--system-reserved` reservations.
+
+Many OS distributions package `docker` by default, and it will often belong to the
+`system.slice` hierarchy, and therefore operators will need to budget it for there
+by default unless they explicitly move it.
+
+**rkt**
+
+rkt has no client/server daemon, and therefore has no explicit requirements on container-runtime
+reservation.
+
+### kubelet cgroup enforcement
+
+The `kubelet` does not enforce the `system-reserved` or `kube-reserved` values by default.
+
+The `kubelet` should support an additional flag to turn on enforcement:
+
+* `--system-reserved-enforce=false` - Optional flag that if true tells the `kubelet`
+to enforce the `system-reserved` constraints defined (if any)
+* `--kube-reserved-enforce=false` - Optional flag that if true tells the `kubelet`
+to enforce the `kube-reserved` constraints defined (if any)
+
+Usage of this flag requires that end-user containers are launched in a separate part
+of cgroup hierarchy via `cgroup-root`.
+
+If this flag is enabled, the `kubelet` will continually validate that the configured
+resource constraints are applied on the associated `cgroup`.
+
+### kubelet cgroup-root behavior under systemd
+
+The `kubelet` supports a `cgroup-root` flag which is the optional root `cgroup` to use for pods.
+
+This flag should be treated as a pass-through to the underlying configured container runtime.
+
+If `--cgroup-enforce=true`, this flag warrants special consideration by the operator depending
+on how the node was configured. For example, if the container runtime is `docker` and its using
+the `systemd` cgroup driver, then `docker` will take the daemon wide default and launch containers
+in the same slice associated with the `docker.service`. By default, this would mean `system.slice`
+which could cause end-user pods to be launched in the same part of the cgroup hierarchy as system daemons.
+
+In those environments, it is recommended that `cgroup-root` is configured to be a subtree of `machine.slice`.
+
+### Proposed cgroup hierarchy
+
+```
+$ROOT
+ |
+ +- system.slice
+ | |
+ | +- sshd.service
+ | +- docker.service (optional)
+ | +- ...
+ |
+ +- kubernetes.slice
+ | |
+ | +- kubelet.service
+ | +- docker.service (optional)
+ |
+ +- machine.slice (container runtime specific)
+ | |
+ | +- docker-<container-id>.scope
+ |
+ +- user.slice
+ | +- ...
+```
+
+* `system.slice` corresponds to `--system-reserved`, and contains any services the
+operator brought to the node as normal configuration.
+* `kubernetes.slice` corresponds to the `--kube-reserved`, and contains kube specific
+daemons.
+* `machine.slice` should parent all end-user containers on the system and serve as the
+root of the end-user cluster workloads run on the system.
+* `user.slice` is not explicitly tracked by the `kubelet`, but it is possible that `ssh`
+sessions to the node where the user launches actions directly. Any resource accounting
+reserved for those actions should be part of `system-reserved`.
+
+The container runtime daemon, `docker` in this outline, must be accounted for in either
+`system.slice` or `kubernetes.slice`.
+
+In the future, the depth of the container hierarchy is not recommended to be rooted
+more than 2 layers below the root as it historically has caused issues with node performance
+in other `cgroup` aware systems (https://bugzilla.redhat.com/show_bug.cgi?id=850718). It
+is anticipated that the `kubelet` will parent containers based on quality of service
+in the future. In that environment, those changes will be relative to the configured
+`cgroup-root`.
+
+### Linux Kernel Parameters
+
+The `kubelet` will set the following:
+
+* `sysctl -w vm.overcommit_memory=1`
+* `sysctl -w vm.panic_on_oom=0`
+* `sysctl -w kernel/panic=10`
+* `sysctl -w kernel/panic_on_oops=1`
+
+### OOM Score Adjustment
+
+The `kubelet` at bootstrapping will set the `oom_score_adj` value for Kubernetes
+daemons, and any dependent container-runtime daemons.
+
+If `container-runtime` is set to `docker`, then set its `oom_score_adj=-999`
+
+## Implementation concerns
+
+### kubelet block-level architecture
+
+```
++----------+ +----------+ +----------+
+| | | | | Pod |
+| Node <-------+ Container<----+ Lifecycle|
+| Manager | | Manager | | Manager |
+| +-------> | | |
++---+------+ +-----+----+ +----------+
+ | |
+ | |
+ | +-----------------+
+ | | |
+ | | |
++---v--v--+ +-----v----+
+| cgroups | | container|
+| library | | runtimes |
++---+-----+ +-----+----+
+ | |
+ | |
+ +---------+----------+
+ |
+ |
+ +-----------v-----------+
+ | Linux Kernel |
+ +-----------------------+
+```
+
+The `kubelet` should move to an architecture that resembles the above diagram:
+
+* The `kubelet` should not interface directly with the `cgroup` file-system, but instead
+should use a common `cgroups library` that has the proper abstraction in place to
+work with either `cgroupfs` or `systemd`. The `kubelet` should just use `libcontainer`
+abstractions to facilitate this requirement. The `libcontainer` abstractions as
+currently defined only support an `Apply(pid)` pattern, and we need to separate that
+abstraction to allow cgroup to be created and then later joined.
+* The existing `ContainerManager` should separate node bootstrapping into a separate
+`NodeManager` that is dependent on the configured `cgroup-driver`.
+* The `kubelet` flags for cgroup paths will convert internally as part of cgroup library,
+i.e. `/foo/bar` will just convert to `foo-bar.slice`
+
+### kubelet accounting for end-user pods
+
+This proposal re-enforces that it is inappropriate at this time to depend on `--cgroup-root` as the
+primary mechanism to distinguish and account for end-user pod compute resource usage.
+
+Instead, the `kubelet` can and should sum the usage of each running `pod` on the node to account for
+end-user pod usage separate from system-reserved and kubernetes-reserved accounting via `cAdvisor`.
+
+## Known issues
+
+### Docker runtime support for --cgroup-parent
+
+Docker versions <= 1.0.9 did not have proper support for `-cgroup-parent` flag on `systemd`. This
+was fixed in this PR (https://github.com/docker/docker/pull/18612). As result, it's expected
+that containers launched by the `docker` daemon may continue to go in the default `system.slice` and
+appear to be counted under system-reserved node usage accounting.
+
+If operators run with later versions of `docker`, they can avoid this issue via the use of `cgroup-root`
+flag on the `kubelet`, but this proposal makes no requirement on operators to do that at this time, and
+this can be revisited if/when the project adopts docker 1.10.
+
+Some OS distributions will fix this bug in versions of docker <= 1.0.9, so operators should
+be aware of how their version of `docker` was packaged when using this feature.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-systemd.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubelet-tls-bootstrap.md b/contributors/design-proposals/kubelet-tls-bootstrap.md
new file mode 100644
index 00000000..fbd98413
--- /dev/null
+++ b/contributors/design-proposals/kubelet-tls-bootstrap.md
@@ -0,0 +1,243 @@
+# Kubelet TLS bootstrap
+
+Author: George Tankersley (george.tankersley@coreos.com)
+
+## Preface
+
+This document describes a method for a kubelet to bootstrap itself
+into a TLS-secured cluster. Crucially, it automates the provision and
+distribution of signed certificates.
+
+## Overview
+
+When a kubelet runs for the first time, it must be given TLS assets
+or generate them itself. In the first case, this is a burden on the cluster
+admin and a significant logistical barrier to secure Kubernetes rollouts. In
+the second, the kubelet must self-sign its certificate and forfeits many of the
+advantages of a PKI system. Instead, we propose that the kubelet generate a
+private key and a CSR for submission to a cluster-level certificate signing
+process.
+
+## Preliminaries
+
+We assume the existence of a functioning control plane. The
+apiserver should be configured for TLS initially or possess the ability to
+generate valid TLS credentials for itself. If secret information is passed in
+the request (e.g. auth tokens supplied with the request or included in
+ExtraInfo) then all communications from the node to the apiserver must take
+place over a verified TLS connection.
+
+Each node is additionally provisioned with the following information:
+
+1. Location of the apiserver
+2. Any CA certificates necessary to trust the apiserver's TLS certificate
+3. Access tokens (if needed) to communicate with the CSR endpoint
+
+These should not change often and are thus simple to include in a static
+provisioning script.
+
+## API Changes
+
+### CertificateSigningRequest Object
+
+We introduce a new API object to represent PKCS#10 certificate signing
+requests. It will be accessible under:
+
+`/apis/certificates/v1beta1/certificatesigningrequests/mycsr`
+
+It will have the following structure:
+
+```go
+// Describes a certificate signing request
+type CertificateSigningRequest struct {
+ unversioned.TypeMeta `json:",inline"`
+ api.ObjectMeta `json:"metadata,omitempty"`
+
+ // The certificate request itself and any additional information.
+ Spec CertificateSigningRequestSpec `json:"spec,omitempty"`
+
+ // Derived information about the request.
+ Status CertificateSigningRequestStatus `json:"status,omitempty"`
+}
+
+// This information is immutable after the request is created.
+type CertificateSigningRequestSpec struct {
+ // Base64-encoded PKCS#10 CSR data
+ Request string `json:"request"`
+
+ // Any extra information the node wishes to send with the request.
+ ExtraInfo []string `json:"extrainfo,omitempty"`
+}
+
+// This information is derived from the request by Kubernetes and cannot be
+// modified by users. All information is optional since it might not be
+// available in the underlying request. This is intended to aid approval
+// decisions.
+type CertificateSigningRequestStatus struct {
+ // Information about the requesting user (if relevant)
+ // See user.Info interface for details
+ Username string `json:"username,omitempty"`
+ UID string `json:"uid,omitempty"`
+ Groups []string `json:"groups,omitempty"`
+
+ // Fingerprint of the public key in request
+ Fingerprint string `json:"fingerprint,omitempty"`
+
+ // Subject fields from the request
+ Subject internal.Subject `json:"subject,omitempty"`
+
+ // DNS SANs from the request
+ Hostnames []string `json:"hostnames,omitempty"`
+
+ // IP SANs from the request
+ IPAddresses []string `json:"ipaddresses,omitempty"`
+
+ Conditions []CertificateSigningRequestCondition `json:"conditions,omitempty"`
+}
+
+type RequestConditionType string
+
+// These are the possible states for a certificate request.
+const (
+ Approved RequestConditionType = "Approved"
+ Denied RequestConditionType = "Denied"
+)
+
+type CertificateSigningRequestCondition struct {
+ // request approval state, currently Approved or Denied.
+ Type RequestConditionType `json:"type"`
+ // brief reason for the request state
+ Reason string `json:"reason,omitempty"`
+ // human readable message with details about the request state
+ Message string `json:"message,omitempty"`
+ // If request was approved, the controller will place the issued certificate here.
+ Certificate []byte `json:"certificate,omitempty"`
+}
+
+type CertificateSigningRequestList struct {
+ unversioned.TypeMeta `json:",inline"`
+ unversioned.ListMeta `json:"metadata,omitempty"`
+
+ Items []CertificateSigningRequest `json:"items,omitempty"`
+}
+```
+
+We also introduce CertificateSigningRequestList to allow listing all the CSRs in the cluster:
+
+```go
+type CertificateSigningRequestList struct {
+ api.TypeMeta
+ api.ListMeta
+
+ Items []CertificateSigningRequest
+}
+```
+
+## Certificate Request Process
+
+### Node intialization
+
+When the kubelet executes it checks a location on disk for TLS assets
+(currently `/var/run/kubernetes/kubelet.{key,crt}` by default). If it finds
+them, it proceeds. If there are no TLS assets, the kubelet generates a keypair
+and self-signed certificate. We propose the following optional behavior:
+
+1. Generate a keypair
+2. Generate a CSR for that keypair with CN set to the hostname (or
+ `--hostname-override` value) and DNS/IP SANs supplied with whatever values
+ the host knows for itself.
+3. Post the CSR to the CSR API endpoint.
+4. Set a watch on the CSR object to be notified of approval or rejection.
+
+### Controller response
+
+The apiserver persists the CertificateSigningRequests and exposes the List of
+all CSRs for an administrator to approve or reject.
+
+A new certificate controller watches for certificate requests. It must first
+validate the signature on each CSR and add `Condition=Denied` on
+any requests with invalid signatures (with Reason and Message incidicating
+such). For valid requests, the controller will derive the information in
+`CertificateSigningRequestStatus` and update that object. The controller should
+watch for updates to the approval condition of any CertificateSigningRequest.
+When a request is approved (signified by Conditions containing only Approved)
+the controller should generate and sign a certificate based on that CSR, then
+update the condition with the certificate data using the `/approval`
+subresource.
+
+### Manual CSR approval
+
+An administrator using `kubectl` or another API client can query the
+CertificateSigningRequestList and update the approval condition of
+CertificateSigningRequests. The default state is empty, indicating that there
+has been no decision so far. A state of "Approved" indicates that the admin has
+approved the request and the certificate controller should issue the
+certificate. A state of "Denied" indicates that admin has denied the
+request. An admin may also supply Reason and Message fields to explain the
+rejection.
+
+## kube-apiserver support
+
+The apiserver will present the new endpoints mentioned above and support the
+relevant object types.
+
+## kube-controller-manager support
+
+To handle certificate issuance, the controller-manager will need access to CA
+signing assets. This could be as simple as a private key and a config file or
+as complex as a PKCS#11 client and supplementary policy system. For now, we
+will add flags for a signing key, a certificate, and a basic policy file.
+
+## kubectl support
+
+To support manual CSR inspection and approval, we will add support for listing,
+inspecting, and approving or denying CertificateSigningRequests to kubectl. The
+interaction will be similar to
+[salt-key](https://docs.saltstack.com/en/latest/ref/cli/salt-key.html).
+
+Specifically, the admin will have the ability to retrieve the full list of
+pending CSRs, inspect their contents, and set their approval conditions to one
+of:
+
+1. **Approved** if the controller should issue the cert
+2. **Denied** if the controller should not issue the cert
+
+The suggested command for listing is `kubectl get csrs`. The approve/deny
+interactions can be accomplished with normal updates, but would be more
+conveniently accessed by direct subresource updates. We leave this for future
+updates to kubectl.
+
+## Security Considerations
+
+### Endpoint Access Control
+
+The ability to post CSRs to the signing endpoint should be controlled. As a
+simple solution we propose that each node be provisioned with an auth token
+(possibly static across the cluster) that is scoped via ABAC to only allow
+access to the CSR endpoint.
+
+### Expiration & Revocation
+
+The node is responsible for monitoring its own certificate expiration date.
+When the certificate is close to expiration, the kubelet should begin repeating
+this flow until it successfully obtains a new certificate. If the expiring
+certificate has not been revoked and the previous certificate request is still
+approved, then it may do so using the same keypair unless the cluster policy
+(see "Future Work") requires fresh keys.
+
+Revocation is for the most part an unhandled problem in Go, requiring each
+application to produce its own logic around a variety of parsing functions. For
+now, our suggested best practice is to issue only short-lived certificates. In
+the future it may make sense to add CRL support to the apiserver's client cert
+auth.
+
+## Future Work
+
+- revocation UI in kubectl and CRL support at the apiserver
+- supplemental policy (e.g. cluster CA only issues 30-day certs for hostnames *.k8s.example.com, each new cert must have fresh keys, ...)
+- fully automated provisioning (using a handshake protocol or external list of authorized machines)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-tls-bootstrap.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/kubemark.md b/contributors/design-proposals/kubemark.md
new file mode 100644
index 00000000..1f28e2b0
--- /dev/null
+++ b/contributors/design-proposals/kubemark.md
@@ -0,0 +1,157 @@
+# Kubemark proposal
+
+## Goal of this document
+
+This document describes a design of Kubemark - a system that allows performance testing of a Kubernetes cluster. It describes the
+assumption, high level design and discusses possible solutions for lower-level problems. It is supposed to be a starting point for more
+detailed discussion.
+
+## Current state and objective
+
+Currently performance testing happens on ‘live’ clusters of up to 100 Nodes. It takes quite a while to start such cluster or to push
+updates to all Nodes, and it uses quite a lot of resources. At this scale the amount of wasted time and used resources is still acceptable.
+In the next quarter or two we’re targeting 1000 Node cluster, which will push it way beyond ‘acceptable’ level. Additionally we want to
+enable people without many resources to run scalability tests on bigger clusters than they can afford at given time. Having an ability to
+cheaply run scalability tests will enable us to run some set of them on "normal" test clusters, which in turn would mean ability to run
+them on every PR.
+
+This means that we need a system that will allow for realistic performance testing on (much) smaller number of “real” machines. First
+assumption we make is that Nodes are independent, i.e. number of existing Nodes do not impact performance of a single Node. This is not
+entirely true, as number of Nodes can increase latency of various components on Master machine, which in turn may increase latency of Node
+operations, but we’re not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by
+Node daemons affects the performance of Master components.
+
+## Kubemark architecture overview
+
+The high-level idea behind Kubemark is to write library that allows running artificial "Hollow" Nodes that will be able to simulate a
+behavior of real Kubelet and KubeProxy in a single, lightweight binary. Hollow components will need to correctly respond to Controllers
+(via API server), and preferably, in the fullness of time, be able to ‘replay’ previously recorded real traffic (this is out of scope for
+initial version). To teach Hollow components replaying recorded traffic they will need to store data specifying when given Pod/Container
+should die (e.g. observed lifetime). Such data can be extracted e.g. from etcd Raft logs, or it can be reconstructed from Events. In the
+initial version we only want them to be able to fool Master components and put some configurable (in what way TBD) load on them.
+
+When we have Hollow Node ready, we’ll be able to test performance of Master Components by creating a real Master Node, with API server,
+Controllers, etcd and whatnot, and create number of Hollow Nodes that will register to the running Master.
+
+To make Kubemark easier to maintain when system evolves Hollow components will reuse real "production" code for Kubelet and KubeProxy, but
+will mock all the backends with no-op or very simple mocks. We believe that this approach is better in the long run than writing special
+"performance-test-aimed" separate version of them. This may take more time to create an initial version, but we think maintenance cost will
+be noticeably smaller.
+
+### Option 1
+
+For the initial version we will teach Master components to use port number to identify Kubelet/KubeProxy. This will allow running those
+components on non-default ports, and in the same time will allow to run multiple Hollow Nodes on a single machine. During setup we will
+generate credentials for cluster communication and pass them to HollowKubelet/HollowProxy to use. Master will treat all HollowNodes as
+normal ones.
+
+![Kubmark architecture diagram for option 1](Kubemark_architecture.png?raw=true "Kubemark architecture overview")
+*Kubmark architecture diagram for option 1*
+
+### Option 2
+
+As a second (equivalent) option we will run Kubemark on top of 'real' Kubernetes cluster, where both Master and Hollow Nodes will be Pods.
+In this option we'll be able to use Kubernetes mechanisms to streamline setup, e.g. by using Kubernetes networking to ensure unique IPs for
+Hollow Nodes, or using Secrets to distribute Kubelet credentials. The downside of this configuration is that it's likely that some noise
+will appear in Kubemark results from either CPU/Memory pressure from other things running on Nodes (e.g. FluentD, or Kubelet) or running
+cluster over an overlay network. We believe that it'll be possible to turn off cluster monitoring for Kubemark runs, so that the impact
+of real Node daemons will be minimized, but we don't know what will be the impact of using higher level networking stack. Running a
+comparison will be an interesting test in itself.
+
+### Discussion
+
+Before taking a closer look at steps necessary to set up a minimal Hollow cluster it's hard to tell which approach will be simpler. It's
+quite possible that the initial version will end up as hybrid between running the Hollow cluster directly on top of VMs and running the
+Hollow cluster on top of a Kubernetes cluster that is running on top of VMs. E.g. running Nodes as Pods in Kubernetes cluster and Master
+directly on top of VM.
+
+## Things to simulate
+
+In real Kubernetes on a single Node we run two daemons that communicate with Master in some way: Kubelet and KubeProxy.
+
+### KubeProxy
+
+As a replacement for KubeProxy we'll use HollowProxy, which will be a real KubeProxy with injected no-op mocks everywhere it makes sense.
+
+### Kubelet
+
+As a replacement for Kubelet we'll use HollowKubelet, which will be a real Kubelet with injected no-op or simple mocks everywhere it makes
+sense.
+
+Kubelet also exposes cadvisor endpoint which is scraped by Heapster, healthz to be read by supervisord, and we have FluentD running as a
+Pod on each Node that exports logs to Elasticsearch (or Google Cloud Logging). Both Heapster and Elasticsearch are running in Pods in the
+cluster so do not add any load on a Master components by themselves. There can be other systems that scrape Heapster through proxy running
+on Master, which adds additional load, but they're not the part of default setup, so in the first version we won't simulate this behavior.
+
+In the first version we’ll assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model
+of short-running batch jobs, but in the initial version we’ll assume only serving-like Pods.
+
+### Heapster
+
+In addition to system components we run Heapster as a part of cluster monitoring setup. Heapster currently watches Events, Pods and Nodes
+through the API server. In the test setup we can use real heapster for watching API server, with mocked out piece that scrapes cAdvisor
+data from Kubelets.
+
+### Elasticsearch and Fluentd
+
+Similarly to Heapster Elasticsearch runs outside the Master machine but generates some traffic on it. Fluentd “daemon” running on Master
+periodically sends Docker logs it gathered to the Elasticsearch running on one of the Nodes. In the initial version we omit Elasticsearch,
+as it produces only a constant small load on Master Node that does not change with the size of the cluster.
+
+## Necessary work
+
+There are three more or less independent things that needs to be worked on:
+- HollowNode implementation, creating a library/binary that will be able to listen to Watches and respond in a correct fashion with Status
+updates. This also involves creation of a CloudProvider that can produce such Hollow Nodes, or making sure that HollowNodes can correctly
+self-register in no-provider Master.
+- Kubemark setup, including figuring networking model, number of Hollow Nodes that will be allowed to run on a single “machine”, writing
+setup/run/teardown scripts (in [option 1](#option-1)), or figuring out how to run Master and Hollow Nodes on top of Kubernetes
+(in [option 2](#option-2))
+- Creating a Player component that will send requests to the API server putting a load on a cluster. This involves creating a way to
+specify desired workload. This task is
+very well isolated from the rest, as it is about sending requests to the real API server. Because of that we can discuss requirements
+separately.
+
+## Concerns
+
+Network performance most likely won't be a problem for the initial version if running on directly on VMs rather than on top of a Kubernetes
+cluster, as Kubemark will be running on standard networking stack (no cloud-provider software routes, or overlay network is needed, as we
+don't need custom routing between Pods). Similarly we don't think that running Kubemark on Kubernetes virtualized cluster networking will
+cause noticeable performance impact, but it requires testing.
+
+On the other hand when adding additional features it may turn out that we need to simulate Kubernetes Pod network. In such, when running
+'pure' Kubemark we may try one of the following:
+ - running overlay network like Flannel or OVS instead of using cloud providers routes,
+ - write simple network multiplexer to multiplex communications from the Hollow Kubelets/KubeProxies on the machine.
+
+In case of Kubemark on Kubernetes it may turn that we run into a problem with adding yet another layer of network virtualization, but we
+don't need to solve this problem now.
+
+## Work plan
+
+- Teach/make sure that Master can talk to multiple Kubelets on the same Machine [option 1](#option-1):
+ - make sure that Master can talk to a Kubelet on non-default port,
+ - make sure that Master can talk to all Kubelets on different ports,
+- Write HollowNode library:
+ - new HollowProxy,
+ - new HollowKubelet,
+ - new HollowNode combining the two,
+ - make sure that Master can talk to two HollowKubelets running on the same machine
+- Make sure that we can run Hollow cluster on top of Kubernetes [option 2](#option-2)
+- Write a player that will automatically put some predefined load on Master, <- this is the moment when it’s possible to play with it and is useful by itself for
+scalability tests. Alternatively we can just use current density/load tests,
+- Benchmark our machines - see how many Watch clients we can have before everything explodes,
+- See how many HollowNodes we can run on a single machine by attaching them to the real master <- this is the moment it starts to useful
+- Update kube-up/kube-down scripts to enable creating “HollowClusters”/write a new scripts/something, integrate HollowCluster with a Elasticsearch/Heapster equivalents,
+- Allow passing custom configuration to the Player
+
+## Future work
+
+In the future we want to add following capabilities to the Kubemark system:
+- replaying real traffic reconstructed from the recorded Events stream,
+- simulating scraping things running on Nodes through Master proxy.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubemark.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/local-cluster-ux.md b/contributors/design-proposals/local-cluster-ux.md
new file mode 100644
index 00000000..c78a51b7
--- /dev/null
+++ b/contributors/design-proposals/local-cluster-ux.md
@@ -0,0 +1,161 @@
+# Kubernetes Local Cluster Experience
+
+This proposal attempts to improve the existing local cluster experience for kubernetes.
+The current local cluster experience is sub-par and often not functional.
+There are several options to setup a local cluster (docker, vagrant, linux processes, etc) and we do not test any of them continuously.
+Here are some highlighted issues:
+- Docker based solution breaks with docker upgrades, does not support DNS, and many kubelet features are not functional yet inside a container.
+- Vagrant based solution are too heavy and have mostly failed on OS X.
+- Local linux cluster is poorly documented and is undiscoverable.
+From an end user perspective, they want to run a kubernetes cluster. They care less about *how* a cluster is setup locally and more about what they can do with a functional cluster.
+
+
+## Primary Goals
+
+From a high level the goal is to make it easy for a new user to run a Kubernetes cluster and play with curated examples that require least amount of knowledge about Kubernetes.
+These examples will only use kubectl and only a subset of Kubernetes features that are available will be exposed.
+
+- Works across multiple OSes - OS X, Linux and Windows primarily.
+- Single command setup and teardown UX.
+- Unified UX across OSes
+- Minimal dependencies on third party software.
+- Minimal resource overhead.
+- Eliminate any other alternatives to local cluster deployment.
+
+## Secondary Goals
+
+- Enable developers to use the local cluster for kubernetes development.
+
+## Non Goals
+
+- Simplifying kubernetes production deployment experience. [Kube-deploy](https://github.com/kubernetes/kube-deploy) is attempting to tackle this problem.
+- Supporting all possible deployment configurations of Kubernetes like various types of storage, networking, etc.
+
+
+## Local cluster requirements
+
+- Includes all the master components & DNS (Apiserver, scheduler, controller manager, etcd and kube dns)
+- Basic auth
+- Service accounts should be setup
+- Kubectl should be auto-configured to use the local cluster
+- Tested & maintained as part of Kubernetes core
+
+## Existing solutions
+
+Following are some of the existing solutions that attempt to simplify local cluster deployments.
+
+### [Spread](https://github.com/redspread/spread)
+
+Spread's UX is great!
+It is adapted from monokube and includes DNS as well.
+It satisfies almost all the requirements, excepting that of requiring docker to be pre-installed.
+It has a loose dependency on docker.
+New releases of docker might break this setup.
+
+### [Kmachine](https://github.com/skippbox/kmachine)
+
+Kmachine is adapted from docker-machine.
+It exposes the entire docker-machine CLI.
+It is possible to repurpose Kmachine to meet all our requirements.
+
+### [Monokube](https://github.com/polvi/monokube)
+
+Single binary that runs all kube master components.
+Does not include DNS.
+This is only a part of the overall local cluster solution.
+
+### Vagrant
+
+The kube-up.sh script included in Kubernetes release supports a few Vagrant based local cluster deployments.
+kube-up.sh is not user friendly.
+It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on OS X.
+The [Core OS single machine guide](https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant-single.html) uses Vagrant as well and it just works.
+Since we are targeting a single command install/teardown experience, vagrant needs to be an implementation detail and not be exposed to our users.
+
+## Proposed Solution
+
+To avoid exposing users to third party software and external dependencies, we will build a toolbox that will be shipped with all the dependencies including all kubernetes components, hypervisor, base image, kubectl, etc.
+*Note: Docker provides a [similar toolbox](https://www.docker.com/products/docker-toolbox).*
+This "Localkube" tool will be referred to as "Minikube" in this proposal to avoid ambiguity against Spread's existing ["localkube"](https://github.com/redspread/localkube).
+The final name of this tool is TBD. Suggestions are welcome!
+
+Minikube will provide a unified CLI to interact with the local cluster.
+The CLI will support only a few operations:
+ - **Start** - creates & starts a local cluster along with setting up kubectl & networking (if necessary)
+ - **Stop** - suspends the local cluster & preserves cluster state
+ - **Delete** - deletes the local cluster completely
+ - **Upgrade** - upgrades internal components to the latest available version (upgrades are not guaranteed to preserve cluster state)
+
+For running and managing the kubernetes components themselves, we can re-use [Spread's localkube](https://github.com/redspread/localkube).
+Localkube is a self-contained go binary that includes all the master components including DNS and runs them using multiple go threads.
+Each Kubernetes release will include a localkube binary that has been tested exhaustively.
+
+To support Windows and OS X, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines.
+Minikube will be shipped with an hypervisor (virtualbox) in the case of OS X.
+Minikube will include a base image that will be well tested.
+
+In the case of Linux, since the cluster can be run locally, we ideally want to avoid setting up a VM.
+Since docker is the only fully supported runtime as of Kubernetes v1.2, we can initially use docker to run and manage localkube.
+There is risk of being incompatible with the existing version of docker.
+By using a VM, we can avoid such incompatibility issues though.
+Feedback from the community will be helpful here.
+
+If the goal is to run outside of a VM, we can have minikube prompt the user if docker is unavailable or version is incompatible.
+Alternatives to docker for running the localkube core includes using [rkt](https://coreos.com/rkt/docs/latest/), setting up systemd services, or a System V Init script depending on the distro.
+
+To summarize the pipeline is as follows:
+
+##### OS X / Windows
+
+minikube -> libmachine -> virtualbox/hyper V -> linux VM -> localkube
+
+##### Linux
+
+minikube -> docker -> localkube
+
+### Alternatives considered
+
+#### Bring your own docker
+
+##### Pros
+
+- Kubernetes users will probably already have it
+- No extra work for us
+- Only one VM/daemon, we can just reuse the existing one
+
+##### Cons
+
+- Not designed to be wrapped, may be unstable
+- Might make configuring networking difficult on OS X and Windows
+- Versioning and updates will be challenging. We can mitigate some of this with testing at HEAD, but we'll - inevitably hit situations where it's infeasible to work with multiple versions of docker.
+- There are lots of different ways to install docker, networking might be challenging if we try to support many paths.
+
+#### Vagrant
+
+##### Pros
+
+- We control the entire experience
+- Networking might be easier to build
+- Docker can't break us since we'll include a pinned version of Docker
+- Easier to support rkt or hyper in the future
+- Would let us run some things outside of containers (kubelet, maybe ingress/load balancers)
+
+##### Cons
+
+- More work
+- Extra resources (if the user is also running docker-machine)
+- Confusing if there are two docker daemons (images built in one can't be run in another)
+- Always needs a VM, even on Linux
+- Requires installing and possibly understanding Vagrant.
+
+## Releases & Distribution
+
+- Minikube will be released independent of Kubernetes core in order to facilitate fixing of issues that are outside of Kubernetes core.
+- The latest version of Minikube is guaranteed to support the latest release of Kubernetes, including documentation.
+- The Google Cloud SDK will package minikube and provide utilities for configuring kubectl to use it, but will not in any other way wrap minikube.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/local-cluster-ux.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/metadata-policy.md b/contributors/design-proposals/metadata-policy.md
new file mode 100644
index 00000000..57416f11
--- /dev/null
+++ b/contributors/design-proposals/metadata-policy.md
@@ -0,0 +1,137 @@
+# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system
+
+## Introduction
+
+This document describes a new API resource, `MetadataPolicy`, that configures an
+admission controller to take one or more actions based on an object's metadata.
+Initially the metadata fields that the predicates can examine are labels and
+annotations, and the actions are to add one or more labels and/or annotations,
+or to reject creation/update of the object. In the future other actions might be
+supported, such as applying an initializer.
+
+The first use of `MetadataPolicy` will be to decide which scheduler should
+schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md)
+Kubernetes system. In particular, the policy will add the scheduler name
+annotation to a pod based on an annotation that is already on the pod that
+indicates the QoS of the pod. (That annotation was presumably set by a simpler
+admission controller that uses code, rather than configuration, to map the
+resource requests and limits of a pod to QoS, and attaches the corresponding
+annotation.)
+
+We anticipate a number of other uses for `MetadataPolicy`, such as defaulting
+for labels and annotations, prohibiting/requiring particular labels or
+annotations, or choosing a scheduling policy within a scheduler. We do not
+discuss them in this doc.
+
+
+## API
+
+```go
+// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource.
+// Every rule is applied, in an unspecified order, but if the action for any rule
+// that matches is to reject the object, then the object is rejected without being mutated.
+type MetadataPolicySpec struct {
+ Rules []MetadataPolicyRule `json:"rules,omitempty"`
+}
+
+// If the PolicyPredicate is met, then the PolicyAction is applied.
+// Example rules:
+// reject object if label with key X is present (i.e. require X)
+// reject object if label with key X is not present (i.e. forbid X)
+// add label X=Y if label with key X is not present (i.e. default X)
+// add annotation A=B if object has annotation C=D or E=F
+type MetadataPolicyRule struct {
+ PolicyPredicate PolicyPredicate `json:"policyPredicate"`
+ PolicyAction PolicyAction `json:policyAction"`
+}
+
+// All criteria must be met for the PolicyPredicate to be considered met.
+type PolicyPredicate struct {
+ // Note that Namespace is not listed here because MetadataPolicy is per-Namespace.
+ LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
+ AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"`
+}
+
+// Apply the indicated Labels and/or Annotations (if present), unless Reject is set
+// to true, in which case reject the object without mutating it.
+type PolicyAction struct {
+ // If true, the object will be rejected and not mutated.
+ Reject bool `json:"reject"`
+ // The labels to add or update, if any.
+ UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"`
+ // The annotations to add or update, if any.
+ UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"`
+}
+
+// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying
+// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's
+// are applied to all objects in the namespace; the order of evaluation is not guaranteed,
+// but if any of the matching policies have an action of rejecting the object, then the object
+// will be rejected without being mutated.
+type MetadataPolicy struct {
+ unversioned.TypeMeta `json:",inline"`
+ // Standard object's metadata.
+ // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
+ ObjectMeta `json:"metadata,omitempty"`
+
+ // Spec defines the metadata policy that should be enforced.
+ // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
+ Spec MetadataPolicySpec `json:"spec,omitempty"`
+}
+
+// MetadataPolicyList is a list of MetadataPolicy items.
+type MetadataPolicyList struct {
+ unversioned.TypeMeta `json:",inline"`
+ // Standard list metadata.
+ // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
+ unversioned.ListMeta `json:"metadata,omitempty"`
+
+ // Items is a list of MetadataPolicy objects.
+ // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota
+ Items []MetadataPolicy `json:"items"`
+}
+```
+
+## Implementation plan
+
+1. Create `MetadataPolicy` API resource
+1. Create admission controller that implements policies defined in
+`MetadataPolicy`
+1. Create admission controller that sets annotation
+`scheduler.alpha.kubernetes.io/qos: <QoS>`
+(where `QOS` is one of `Guaranteed, Burstable, BestEffort`)
+based on pod's resource request and limit.
+
+## Future work
+
+Longer-term we will have QoS be set on create and update by the registry,
+similar to `Pending` phase today, instead of having an admission controller
+(that runs before the one that takes `MetadataPolicy` as input) do it.
+
+We plan to eventually move from having an admission controller set the scheduler
+name as a pod annotation, to using the initializer concept. In particular, the
+scheduler will be an initializer, and the admission controller that decides
+which scheduler to use will add the scheduler's name to the list of initializers
+for the pod (presumably the scheduler will be the last initializer to run on
+each pod). The admission controller would still be configured using the
+`MetadataPolicy` described here, only the mechanism the admission controller
+uses to record its decision of which scheduler to use would change.
+
+## Related issues
+
+The main issue for multiple schedulers is #11793. There was also a lot of
+discussion in PRs #17197 and #17865.
+
+We could use the approach described here to choose a scheduling policy within a
+single scheduler, as opposed to choosing a scheduler, a desire mentioned in
+
+# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where
+
+`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized
+API for matching "claims" to "service classes"; matching a pod to a scheduler
+would be one use for such an API.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/monitoring_architecture.md b/contributors/design-proposals/monitoring_architecture.md
new file mode 100644
index 00000000..b819eeca
--- /dev/null
+++ b/contributors/design-proposals/monitoring_architecture.md
@@ -0,0 +1,203 @@
+# Kubernetes monitoring architecture
+
+## Executive Summary
+
+Monitoring is split into two pipelines:
+
+* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down
+Heapster called metrics-server, and the API server serving the master metrics API. These
+metrics are used by core system components, such as scheduling logic (e.g. scheduler and
+horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components
+(e.g. `kubectl top`). This pipeline is not intended for integration with third-party
+monitoring systems.
+* A **monitoring pipeline** used for collecting various metrics from the system and exposing
+them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore
+via adapters. Users can choose from many monitoring system vendors, or run none at all. In
+open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options
+will be easy to install. We expect that such pipelines will typically consist of a per-node
+agent and a cluster-level aggregator.
+
+The architecture is illustrated in the diagram in the Appendix of this doc.
+
+## Introduction and Objectives
+
+This document proposes a high-level monitoring architecture for Kubernetes. It covers
+a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc,
+specifically focusing on an architecture (components and their interactions) that
+hopefully meets the numerous requirements. We do not specify any particular timeframe
+for implementing this architecture, nor any particular roadmap for getting there.
+
+### Terminology
+
+There are two types of metrics, system metrics and service metrics. System metrics are
+generic metrics that are generally available from every entity that is monitored (e.g.
+usage of CPU and memory by container and node). Service metrics are explicitly defined
+in application code and exported (e.g. number of 500s served by the API server). Both
+system metrics and service metrics can originate from users’ containers or from system
+infrastructure components (master components like the API server, addon pods running on
+the master, and addon pods running on user nodes).
+
+We divide system metrics into
+
+* *core metrics*, which are metrics that Kubernetes understands and uses for operation
+of its internal components and core utilities -- for example, metrics used for scheduling
+(including the inputs to the algorithms for resource estimation, initial resources/vertical
+autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics),
+the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage,
+memory instantaneous usage, disk usage of pods, disk usage of containers
+* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they
+include the core metrics (though not necessarily in a format Kubernetes understands) plus
+additional metrics.
+
+Service metrics can be divided into those produced by Kubernetes infrastructure components
+(and thus useful for operation of the Kubernetes cluster) and those produced by user applications.
+Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics.
+Of course horizontal pod autoscaling also uses core metrics.
+
+We consider logging to be separate from monitoring, so logging is outside the scope of
+this doc.
+
+### Requirements
+
+The monitoring architecture should
+
+* include a solution that is part of core Kubernetes and
+ * makes core system metrics about nodes, pods, and containers available via a standard
+ master API (today the master metrics API), such that core Kubernetes features do not
+ depend on non-core components
+ * requires Kubelet to only export a limited set of metrics, namely those required for
+ core Kubernetes components to correctly operate (this is related to #18770)
+ * can scale up to at least 5000 nodes
+ * is small enough that we can require that all of its components be running in all deployment
+ configurations
+* include an out-of-the-box solution that can serve historical data, e.g. to support Initial
+Resources and vertical pod autoscaling as well as cluster analytics queries, that depends
+only on core Kubernetes
+* allow for third-party monitoring solutions that are not part of core Kubernetes and can
+be integrated with components like Horizontal Pod Autoscaler that require service metrics
+
+## Architecture
+
+We divide our description of the long-term architecture plan into the core metrics pipeline
+and the monitoring pipeline. For each, it is necessary to think about how to deal with each
+type of metric (core metrics, non-core metrics, and service metrics) from both the master
+and minions.
+
+### Core metrics pipeline
+
+The core metrics pipeline collects a set of core system metrics. There are two sources for
+these metrics
+
+* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that
+is part of Kubelet will be slimmed down to provide only core system metrics)
+* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from
+Kubelet into resource estimates (values used by scheduler for a more advanced usage-based
+scheduler)
+
+These sources are scraped by a component we call *metrics-server* which is like a slimmed-down
+version of today's Heapster. metrics-server stores locally only latest values and has no sinks.
+metrics-server exposes the master metrics API. (The configuration described here is similar
+to the current Heapster in “standalone” mode.)
+[Discovery summarizer](../../docs/proposals/federated-api-servers.md)
+makes the master metrics API available to external clients such that from the client’s perspective
+it looks the same as talking to the API server.
+
+Core (system) metrics are handled as described above in all deployment environments. The only
+easily replaceable part is resource estimator, which could be replaced by power users. In
+theory, metric-server itself can also be substituted, but it’d be similar to substituting
+apiserver itself or controller-manager - possible, but not recommended and not supported.
+
+Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon
+themselves (e.g. CPU usage of Kubelet), even though they do not run in containers.
+
+The core metrics pipeline is intentionally small and not designed for third-party integrations.
+“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline
+(see next section) and can run on Kubernetes without having to make changes to upstream components.
+In this way we can remove the burden we have today that comes with maintaining Heapster as the
+integration point for every possible metrics source, sink, and feature.
+
+#### Infrastore
+
+We will build an open-source Infrastore component (most likely reusing existing technologies)
+for serving historical queries over core system metrics and events, which it will fetch from
+the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries --
+this is TBD) to handle the following use cases
+
+* initial resources
+* vertical autoscaling
+* oldtimer API
+* decision-support queries for debugging, capacity planning, etc.
+* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
+
+In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes
+infrastructure containers), described in the upcoming sections.
+
+### Monitoring pipeline
+
+One of the goals of building a dedicated metrics pipeline for core metrics, as described in the
+previous section, is to allow for a separate monitoring pipeline that can be very flexible
+because core Kubernetes components do not need to rely on it. By default we will not provide
+one, but we will provide an easy way to install one (using a single command, most likely using
+Helm). We described the monitoring pipeline in this section.
+
+Data collected by the monitoring pipeline may contain any sub- or superset of the following groups
+of metrics:
+
+* core system metrics
+* non-core system metrics
+* service metrics from user application containers
+* service metrics from Kubernetes infrastructure containers; these metrics are exposed using
+Prometheus instrumentation
+
+It is up to the monitoring solution to decide which of these are collected.
+
+In order to enable horizontal pod autoscaling based on custom metrics, the provider of the
+monitoring pipeline would also have to create a stateless API adapter that pulls the custom
+metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such
+API will be a well defined, versioned API similar to regular APIs. Details of how it will be
+exposed or discovered will be covered in a detailed design doc for this component.
+
+The same approach applies if it is desired to make monitoring pipeline metrics available in
+Infrastore. These adapters could be standalone components, libraries, or part of the monitoring
+solution itself.
+
+There are many possible combinations of node and cluster-level agents that could comprise a
+monitoring pipeline, including
+cAdvisor + Heapster + InfluxDB (or any other sink)
+* cAdvisor + collectd + Heapster
+* cAdvisor + Prometheus
+* snapd + Heapster
+* snapd + SNAP cluster-level agent
+* Sysdig
+
+As an example we’ll describe a potential integration with cAdvisor + Prometheus.
+
+Prometheus has the following metric sources on a node:
+* core and non-core system metrics from cAdvisor
+* service metrics exposed by containers via HTTP handler in Prometheus format
+* [optional] metrics about node itself from Node Exporter (a Prometheus component)
+
+All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus
+cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a
+standalone API adapter that proxies/translates between the Prometheus Query Language endpoint
+on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be
+used to make the metrics from the monitoring pipeline available in Infrastore. Neither
+adapter is necessary if the user does not need the corresponding feature.
+
+The command that installs cAdvisor+Prometheus should also automatically set up collection
+of the metrics from infrastructure containers. This is possible because the names of the
+infrastructure containers and metrics of interest are part of the Kubernetes control plane
+configuration itself, and because the infrastructure containers export their metrics in
+Prometheus format.
+
+## Appendix: Architecture diagram
+
+### Open-source monitoring pipeline
+
+![Architecture Diagram](monitoring_architecture.png?raw=true "Architecture overview")
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/monitoring_architecture.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/monitoring_architecture.png b/contributors/design-proposals/monitoring_architecture.png
new file mode 100644
index 00000000..570996b7
--- /dev/null
+++ b/contributors/design-proposals/monitoring_architecture.png
Binary files differ
diff --git a/contributors/design-proposals/multi-platform.md b/contributors/design-proposals/multi-platform.md
new file mode 100644
index 00000000..36eacefa
--- /dev/null
+++ b/contributors/design-proposals/multi-platform.md
@@ -0,0 +1,532 @@
+# Kubernetes for multiple platforms
+
+**Author**: Lucas Käldström ([@luxas](https://github.com/luxas))
+
+**Status** (25th of August 2016): Some parts are already implemented; but still there quite a lot of work to be done.
+
+## Abstract
+
+We obviously want Kubernetes to run on as many platforms as possible, in order to make Kubernetes a even more powerful system.
+This is a proposal that explains what should be done in order to achieve a true cross-platform container management system.
+
+Kubernetes is written in Go, and Go code is portable across platforms.
+Docker and rkt are also written in Go, and it's already possible to use them on various platforms.
+When it's possible to run containers on a specific architecture, people also want to use Kubernetes to manage the containers.
+
+In this proposal, a `platform` is defined as `operating system/architecture` or `${GOOS}/${GOARCH}` in Go terms.
+
+The following platforms are proposed to be built for in a Kubernetes release:
+ - linux/amd64
+ - linux/arm (GOARM=6 initially, but we probably have to bump this to GOARM=7 due to that the most of other ARM things are ARMv7)
+ - linux/arm64
+ - linux/ppc64le
+
+If there's interest in running Kubernetes on `linux/s390x` too, it won't require many changes to the source now when we've laid the ground for a multi-platform Kubernetes already.
+
+There is also work going on with porting Kubernetes to Windows (`windows/amd64`). See [this issue](https://github.com/kubernetes/kubernetes/issues/22623) for more details.
+
+But note that when porting to a new OS like windows, a lot of os-specific changes have to be implemented before cross-compiling, releasing and other concerns this document describes may apply.
+
+## Motivation
+
+Then the question probably is: Why?
+
+In fact, making it possible to run Kubernetes on other platforms will enable people to create customized and highly-optimized solutions that exactly fits their hardware needs.
+
+Example: [Paypal validates arm64 for real-time data analysis](http://www.datacenterdynamics.com/content-tracks/servers-storage/paypal-successfully-tests-arm-based-servers/93835.fullarticle)
+
+Also, by including other platforms to the Kubernetes party a healthy competition between platforms can/will take place.
+
+Every platform obviously has both pros and cons. By adding the option to make clusters of mixed platforms, the end user may take advantage of the good sides of every platform.
+
+## Use Cases
+
+For a large enterprise where computing power is the king, one may imagine the following combinations:
+ - `linux/amd64`: For running most of the general-purpose computing tasks, cluster addons, etc.
+ - `linux/ppc64le`: For running highly-optimized software; especially massive compute tasks
+ - `windows/amd64`: For running services that are only compatible on windows; e.g. business applications written in C# .NET
+
+For a mid-sized business where efficiency is most important, these could be combinations:
+ - `linux/amd64`: For running most of the general-purpose computing tasks, plus tasks that require very high single-core performance.
+ - `linux/arm64`: For running webservices and high-density tasks => the cluster could autoscale in a way that `linux/amd64` machines could hibernate at night in order to minimize power usage.
+
+For a small business or university, arm is often sufficient:
+ - `linux/arm`: Draws very little power, and can run web sites and app backends efficiently on Scaleway for example.
+
+And last but not least; Raspberry Pi's should be used for [education at universities](http://kubecloud.io/) and are great for **demoing Kubernetes' features at conferences.**
+
+## Main proposal
+
+### Release binaries for all platforms
+
+First and foremost, binaries have to be released for all platforms.
+This affects the build-release tools. Fortunately, this is quite straightforward to implement, once you understand how Go cross-compilation works.
+
+Since Kubernetes' release and build jobs run on `linux/amd64`, binaries have to be cross-compiled and Docker images should be cross-built.
+Builds should be run in a Docker container in order to get reproducible builds; and `gcc` should be installed for all platforms inside that image (`kube-cross`)
+
+All released binaries should be uploaded to `https://storage.googleapis.com/kubernetes-release/release/${version}/bin/${os}/${arch}/${binary}`
+
+This is a fairly long topic. If you're interested how to cross-compile, see [details about cross-compilation](#cross-compilation-details)
+
+### Support all platforms in a "run everywhere" deployment
+
+The easiest way of running Kubernetes on another architecture at the time of writing is probably by using the docker-multinode deployment. Of course, you may choose whatever deployment you want, the binaries are easily downloadable from the URL above.
+
+[docker-multinode](https://github.com/kubernetes/kube-deploy/tree/master/docker-multinode) is intended to be a "kick-the-tires" multi-platform solution with Docker as the only real dependency (but it's not production ready)
+
+But when we (`sig-cluster-lifecycle`) have standardized the deployments to about three and made them production ready; at least one deployment should support **all platforms**.
+
+### Set up a build and e2e CI's
+
+#### Build CI
+
+Kubernetes should always enforce that all binaries are compiling.
+**On every PR, `make release` have to be run** in order to require the code proposed to be merged to be compatible for all architectures.
+
+For more information, see [conflicts](#conflicts)
+
+#### e2e CI
+
+To ensure all functionality really is working on all other platforms, the community should be able to setup a CI.
+To be able to do that, all the test-specific images have to be ported to multiple architectures, and the test images should preferably be manifest lists.
+If the test images aren't manifest lists, the test code should automatically choose the right image based on the image naming.
+
+IBM volunteered to run continuously running e2e tests for `linux/ppc64le`.
+Still it's hard to set up a such CI (even on `linux/amd64`), but that work belongs to `kubernetes/test-infra` proposals.
+
+When it's possible to test Kubernetes using Kubernetes; volunteers should be given access to publish their results on `k8s-testgrid.appspot.com`.
+
+### Official support level
+
+When all e2e tests are passing for a given platform; the platform should be officially supported by the Kubernetes team.
+At the time of writing, `amd64` is in the officially supported category category.
+
+When a platform is building and it's possible to set up a cluster with the core functionality, the platform is supported on a "best-effort" and experimental basis.
+At the time of writing, `arm`, `arm64` and `ppc64le` are in the experimental category; the e2e tests aren't cross-platform yet.
+
+### Docker image naming and manifest lists
+
+#### Docker manifest lists
+
+Here's a good article about how the "manifest list" in the Docker image [manifest spec v2](https://github.com/docker/distribution/pull/1068) works: [A step towards multi-platform Docker images](https://integratedcode.us/2016/04/22/a-step-towards-multi-platform-docker-images/)
+
+A short summary: A manifest list is a list of Docker images with a single name (e.g. `busybox`), that holds layers for multiple platforms _when it's stored in a registry_.
+When the image is pulled by a client (`docker pull busybox`), only layers for the target platforms are downloaded.
+Right now we have to write `busybox-${ARCH}` for example instead, but that leads to extra scripting and unnecessary logic.
+
+For reference see [docker/docker#24739](https://github.com/docker/docker/issues/24739) and [appc/docker2aci#193](https://github.com/appc/docker2aci/issues/193)
+
+#### Image naming
+
+This has been debated quite a lot about; how we should name non-amd64 docker images that are pushed to `gcr.io`. See [#23059](https://github.com/kubernetes/kubernetes/pull/23059) and [#23009](https://github.com/kubernetes/kubernetes/pull/23009).
+
+This means that the naming `gcr.io/google_containers/${binary}:${version}` should contain a _manifest list_ for future tags.
+The manifest list thereby becomes a wrapper that is pointing to the `-${arch}` images.
+This requires `docker-1.10` or newer, which probably means Kubernetes v1.4 and higher.
+
+TL;DR;
+ - `${binary}-${arch}:${version}` images should be pushed for all platforms
+ - `${binary}:${version}` images should point to the `-${arch}`-specific ones, and docker will then download the right image.
+
+### Components should expose their platform
+
+It should be possible to run clusters with mixed platforms smoothly. After all, bringing heterogenous machines together to a single unit (a cluster) is one of Kubernetes' greatest strengths. And since the Kubernetes' components communicate over HTTP, two binaries of different architectures may talk to each other normally.
+
+The crucial thing here is that the components that handle platform-specific tasks (e.g. kubelet) should expose their platform. In the kubelet case, we've initially solved it by exposing the labels `beta.kubernetes.io/{os,arch}` on every node. This way an user may run binaries for different platforms on a multi-platform cluster, but still it requires manual work to apply the label to every manifest.
+
+Also, [the apiserver now exposes](https://github.com/kubernetes/kubernetes/pull/19905) it's platform at `GET /version`. But note that the value exposed at `/version` only is the apiserver's platform; there might be kubelets of various other platforms.
+
+### Standardize all image Makefiles to follow the same pattern
+
+All Makefiles should push for all platforms when doing `make push`, and build for all platforms when doing `make build`.
+Under the hood; they should compile binaries in a container for reproducability, and use QEMU for emulating Dockerfile `RUN` commands if necessary.
+
+### Remove linux/amd64 hard-codings from the codebase
+
+All places where `linux/amd64` is hardcoded in the codebase should be rewritten.
+
+#### Make kubelet automatically use the right pause image
+
+The `pause` is used for connecting containers into Pods. It's a binary that just sleeps forever.
+When Kubernetes starts up a Pod, it first starts a `pause` container, and let's all "real" containers join the same network by setting `--net=${pause_container_id}`.
+
+So in order to start Kubernetes Pods on any other architecture, an ever-sleeping image have to exist.
+
+Fortunately, `kubelet` has the `--pod-infra-container-image` option, and it has been used when running Kubernetes on other platforms.
+
+But relying on the deployment setup to specify the right image for the platform isn't great, the kubelet should be smarter than that.
+
+This specific problem has been fixed in [#23059](https://github.com/kubernetes/kubernetes/pull/23059).
+
+#### Vendored packages
+
+Here are two common problems that a vendored package might have when trying to add/update it:
+ - Including constants combined with build tags
+
+```go
+//+ build linux,amd64
+const AnAmd64OnlyConstant = 123
+```
+
+ - Relying on platform-specific syscalls (e.g. `syscall.Dup2`)
+
+If someone tries to add a dependency that doesn't satisfy these requirements; the CI will catch it and block the PR until the author has updated the vendored repo and fixed the problem.
+
+### kubectl should be released for all platforms that are relevant
+
+kubectl is released for more platforms than the proposed server platforms, if you want to check out an up-to-date list of them, [see here](../../hack/lib/golang.sh).
+
+kubectl is trivial to cross-compile, so if there's interest in adding a new platform for it, it may be as easy as appending the platform to the list linked above.
+
+### Addons
+
+Addons like dns, heapster and ingress play a big role in a working Kubernetes cluster, and we should aim to be able to deploy these addons on multiple platforms too.
+
+`kube-dns`, `dashboard` and `addon-manager` are the most important images, and they are already ported for multiple platforms.
+
+These addons should also be converted to multiple platforms:
+ - heapster, influxdb + grafana
+ - nginx-ingress
+ - elasticsearch, fluentd + kibana
+ - registry
+
+### Conflicts
+
+What should we do if there's a conflict between keeping e.g. `linux/ppc64le` builds vs. merging a release blocker?
+
+In fact, we faced this problem while this proposal was being written; in [#25243](https://github.com/kubernetes/kubernetes/pull/25243). It is quite obvious that the release blocker is of higher priority.
+
+However, before temporarily [deactivating builds](https://github.com/kubernetes/kubernetes/commit/2c9b83f291e3e506acc3c08cd10652c255f86f79), the author of the breaking PR should first try to fix the problem. If it turns out being really hard to solve, builds for the affected platform may be deactivated and a P1 issue should be made to activate them again.
+
+## Cross-compilation details (for reference)
+
+### Go language details
+
+Go 1.5 introduced many changes. To name a few that are relevant to Kubernetes:
+ - C was eliminated from the tree (it was earlier used for the bootstrap runtime).
+ - All processors are used by default, which means we should be able to remove [lines like this one](https://github.com/kubernetes/kubernetes/blob/v1.2.0/cmd/kubelet/kubelet.go#L37)
+ - The garbage collector became more efficent (but also [confused our latency test](https://github.com/golang/go/issues/14396)).
+ - `linux/arm64` and `linux/ppc64le` were added as new ports.
+ - The `GO15VENDOREXPERIMENT` was started. We switched from `Godeps/_workspace` to the native `vendor/` in [this PR](https://github.com/kubernetes/kubernetes/pull/24242).
+ - It's not required to pre-build the whole standard library `std` when cross-compliling. [Details](#prebuilding-the-standard-library-std)
+ - Builds are approximately twice as slow as earlier. That affects the CI. [Details](#releasing)
+ - The native Go DNS resolver will suffice in the most situations. This makes static linking much easier.
+
+All release notes for Go 1.5 [are here](https://golang.org/doc/go1.5)
+
+Go 1.6 didn't introduce as many changes as Go 1.5 did, but here are some of note:
+ - It should perform a little bit better than Go 1.5.
+ - `linux/mips64` and `linux/mips64le` were added as new ports.
+ - Go < 1.6.2 for `ppc64le` had [bugs in it](https://github.com/kubernetes/kubernetes/issues/24922).
+
+All release notes for Go 1.6 [are here](https://golang.org/doc/go1.6)
+
+In Kubernetes 1.2, the only supported Go version was `1.4.2`, so `linux/arm` was the only possible extra architecture: [#19769](https://github.com/kubernetes/kubernetes/pull/19769).
+In Kubernetes 1.3, [we upgraded to Go 1.6](https://github.com/kubernetes/kubernetes/pull/22149), which made it possible to build Kubernetes for even more architectures [#23931](https://github.com/kubernetes/kubernetes/pull/23931).
+
+#### The `sync/atomic` bug on 32-bit platforms
+
+From https://golang.org/pkg/sync/atomic/#pkg-note-BUG:
+> On both ARM and x86-32, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a global variable or in an allocated struct or slice can be relied upon to be 64-bit aligned.
+
+`etcd` have had [issues](https://github.com/coreos/etcd/issues/2308) with this. See [how to fix it here](https://github.com/coreos/etcd/pull/3249)
+
+```go
+// 32-bit-atomic-bug.go
+package main
+import "sync/atomic"
+
+type a struct {
+ b chan struct{}
+ c int64
+}
+
+func main(){
+ d := a{}
+ atomic.StoreInt64(&d.c, 10 * 1000 * 1000 * 1000)
+}
+```
+
+```console
+$ GOARCH=386 go build 32-bit-atomic-bug.go
+$ file 32-bit-atomic-bug
+32-bit-atomic-bug: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
+$ ./32-bit-atomic-bug
+panic: runtime error: invalid memory address or nil pointer dereference
+[signal 0xb code=0x1 addr=0x0 pc=0x808cd9b]
+
+goroutine 1 [running]:
+panic(0x8098de0, 0x1830a038)
+ /usr/local/go/src/runtime/panic.go:481 +0x326
+sync/atomic.StoreUint64(0x1830e0f4, 0x540be400, 0x2)
+ /usr/local/go/src/sync/atomic/asm_386.s:190 +0xb
+main.main()
+ /tmp/32-bit-atomic-bug.go:11 +0x4b
+```
+
+This means that all structs should keep all `int64` and `uint64` fields at the top of the struct to be safe. If we would move `a.c` to the top of the `a` struct above, the operation would succeed.
+
+The bug affects `32-bit` platforms when a `(u)int64` field is accessed by an `atomic` method.
+It would be great to write a tool that checks so all `atomic` accessed fields are aligned at the top of the struct, but it's hard: [coreos/etcd#5027](https://github.com/coreos/etcd/issues/5027).
+
+## Prebuilding the Go standard library (`std`)
+
+A great blog post [that is describing this](https://medium.com/@rakyll/go-1-5-cross-compilation-488092ba44ec#.5jcd0owem)
+
+Before Go 1.5, the whole Go project had to be cross-compiled from source for **all** platforms that _might_ be used, and that was quite a slow process:
+
+```console
+# From build-tools/build-image/cross/Dockerfile when we used Go 1.4
+$ cd /usr/src/go/src
+$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} ./make.bash --no-clean; done
+```
+
+With Go 1.5+, cross-compiling the Go repository isn't required anymore. Go will automatically cross-compile the `std` packages that are being used by the code that is being compiled, _and throw it away after the compilation_.
+If you cross-compile multiple times, Go will build parts of `std`, throw it away, compile parts of it again, throw that away and so on.
+
+However, there is an easy way of cross-compiling all `std` packages in advance with Go 1.5+:
+
+```console
+# From build-tools/build-image/cross/Dockerfile when we're using Go 1.5+
+$ for platform in ${PLATFORMS}; do GOOS=${platform%/*} GOARCH=${platform##*/} go install std; done
+```
+
+### Static cross-compilation
+
+Static compilation with Go 1.5+ is dead easy:
+
+```go
+// main.go
+package main
+import "fmt"
+func main() {
+ fmt.Println("Hello Kubernetes!")
+}
+```
+
+```console
+$ go build main.go
+$ file main
+main: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
+$ GOOS=linux GOARCH=arm go build main.go
+$ file main
+main: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
+```
+
+The only thing you have to do is change the `GOARCH` and `GOOS` variables. Here's a list of valid values for [GOOS/GOARCH](https://golang.org/doc/install/source#environment)
+
+#### Static compilation with `net`
+
+Consider this:
+
+```go
+// main-with-net.go
+package main
+import "net"
+import "fmt"
+func main() {
+ fmt.Println(net.ParseIP("10.0.0.10").String())
+}
+```
+
+```console
+$ go build main-with-net.go
+$ file main-with-net
+main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
+ interpreter /lib64/ld-linux-x86-64.so.2, not stripped
+$ GOOS=linux GOARCH=arm go build main-with-net.go
+$ file main-with-net
+main-with-net: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
+```
+
+Wait, what? Just because we included `net` from the `std` package, the binary defaults to being dynamically linked when the target platform equals to the host platform?
+Let's take a look at `go env` to get a clue why this happens:
+
+```console
+$ go env
+GOARCH="amd64"
+GOHOSTARCH="amd64"
+GOHOSTOS="linux"
+GOOS="linux"
+GOPATH="/go"
+GOROOT="/usr/local/go"
+GO15VENDOREXPERIMENT="1"
+CC="gcc"
+CXX="g++"
+CGO_ENABLED="1"
+```
+
+See the `CGO_ENABLED=1` at the end? That's where compilation for the host and cross-compilation differs. By default, Go will link statically if no `cgo` code is involved. `net` is one of the packages that prefers `cgo`, but doesn't depend on it.
+
+When cross-compiling on the other hand, `CGO_ENABLED` is set to `0` by default.
+
+To always be safe, run this when compiling statically:
+
+```console
+$ CGO_ENABLED=0 go build -a -installsuffix cgo main-with-net.go
+$ file main-with-net
+main-with-net: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
+```
+
+See [golang/go#9344](https://github.com/golang/go/issues/9344) for more details.
+
+### Dynamic cross-compilation
+
+In order to dynamically compile a go binary with `cgo`, we need `gcc` installed at build time.
+
+The only Kubernetes binary that is using C code is the `kubelet`, or in fact `cAdvisor` on which `kubelet` depends. `hyperkube` is also dynamically linked as long as `kubelet` is. We should aim to make `kubelet` statically linked.
+
+The normal `x86_64-linux-gnu` can't cross-compile binaries, so we have to install gcc cross-compilers for every platform. We do this in the [`kube-cross`](../../build-tools/build-image/cross/Dockerfile) image,
+and depend on the [`emdebian.org` repository](https://wiki.debian.org/CrossToolchains). Depending on `emdebian` isn't ideal, so we should consider using the latest `gcc` cross-compiler packages from the `ubuntu` main repositories in the future.
+
+Here's an example when cross-compiling plain C code:
+
+```c
+// main.c
+#include <stdio.h>
+main()
+{
+ printf("Hello Kubernetes!\n");
+}
+```
+
+```console
+$ arm-linux-gnueabi-gcc -o main-c main.c
+$ file main-c
+main-c: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked,
+ interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped
+```
+
+And here's an example when cross-compiling `go` and `c`:
+
+```go
+// main-cgo.go
+package main
+/*
+char* sayhello(void) { return "Hello Kubernetes!"; }
+*/
+import "C"
+import "fmt"
+func main() {
+ fmt.Println(C.GoString(C.sayhello()))
+}
+```
+
+```console
+$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOOS=linux GOARCH=arm go build main-cgo.go
+$ file main-cgo
+./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked,
+ interpreter /lib/ld-linux.so.3, for GNU/Linux 2.6.32, not stripped
+```
+
+The bad thing with dynamic compilation is that it adds an unnecessary dependency on `glibc` _at runtime_.
+
+### Static compilation with CGO code
+
+Lastly, it's even possible to cross-compile `cgo` code _statically_:
+
+```console
+$ CGO_ENABLED=1 CC=arm-linux-gnueabi-gcc GOARCH=arm go build -ldflags '-extldflags "-static"' main-cgo.go
+$ file main-cgo
+./main-cgo: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked,
+ for GNU/Linux 2.6.32, not stripped
+```
+
+This is especially useful if we want to include the binary in a container.
+If the binary is statically compiled, we may use `busybox` or even `scratch` as the base image.
+This should be the preferred way of compiling binaries that strictly require C code to be a part of it.
+
+#### GOARM
+
+32-bit ARM comes in two main flavours: ARMv5 and ARMv7. Go has the `GOARM` environment variable that controls which version of ARM Go should target. Here's a table of all ARM versions and how they play together:
+
+ARM Version | GOARCH | GOARM | GCC package | No. of bits
+----------- | ------ | ----- | ----------- | -----------
+ARMv5 | arm | 5 | armel | 32-bit
+ARMv6 | arm | 6 | - | 32-bit
+ARMv7 | arm | 7 | armhf | 32-bit
+ARMv8 | arm64 | - | aarch64 | 64-bit
+
+The compability between the versions is pretty straightforward, ARMv5 binaries may run on ARMv7 hosts, but not vice versa.
+
+## Cross-building docker images for linux
+
+After binaries have been cross-compiled, they should be distributed in some manner.
+
+The default and maybe the most intuitive way of doing this is by packaging it in a docker image.
+
+### Trivial Dockerfile
+
+All `Dockerfile` commands except for `RUN` works for any architecture without any modification.
+The base image has to be switched to an arch-specific one, but except from that, a cross-built image is only a `docker build` away.
+
+```Dockerfile
+FROM armel/busybox
+ENV kubernetes=true
+COPY kube-apiserver /usr/local/bin/
+CMD ["/usr/local/bin/kube-apiserver"]
+```
+
+```console
+$ file kube-apiserver
+kube-apiserver: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), statically linked, not stripped
+$ docker build -t gcr.io/google_containers/kube-apiserver-arm:v1.x.y .
+Step 1 : FROM armel/busybox
+ ---> 9bb1e6d4f824
+Step 2 : ENV kubernetes true
+ ---> Running in 8a1bfcb220ac
+ ---> e4ef9f34236e
+Removing intermediate container 8a1bfcb220ac
+Step 3 : COPY kube-apiserver /usr/local/bin/
+ ---> 3f0c4633e5ac
+Removing intermediate container b75a054ab53c
+Step 4 : CMD /usr/local/bin/kube-apiserver
+ ---> Running in 4e6fe931a0a5
+ ---> 28f50e58c909
+Removing intermediate container 4e6fe931a0a5
+Successfully built 28f50e58c909
+```
+
+### Complex Dockerfile
+
+However, in the most cases, `RUN` statements are needed when building the image.
+
+The `RUN` statement invokes `/bin/sh` inside the container, but in this example, `/bin/sh` is an ARM binary, which can't execute on an `amd64` processor.
+
+#### QEMU to the rescue
+
+Here's a way to run ARM Docker images on an amd64 host by using `qemu`:
+
+```console
+# Register other architectures` magic numbers in the binfmt_misc kernel module, so it`s possible to run foreign binaries
+$ docker run --rm --privileged multiarch/qemu-user-static:register --reset
+# Download qemu 2.5.0
+$ curl -sSL https://github.com/multiarch/qemu-user-static/releases/download/v2.5.0/x86_64_qemu-arm-static.tar.xz \
+ | tar -xJ
+# Run a foreign docker image, and inject the amd64 qemu binary for translating all syscalls
+$ docker run -it -v $(pwd)/qemu-arm-static:/usr/bin/qemu-arm-static armel/busybox /bin/sh
+
+# Now we`re inside an ARM container although we`re running on an amd64 host
+$ uname -a
+Linux 0a7da80f1665 4.2.0-25-generic #30-Ubuntu SMP Mon Jan 18 12:31:50 UTC 2016 armv7l GNU/Linux
+```
+
+Here a linux module called `binfmt_misc` registered the "magic numbers" in the kernel, so the kernel may detect which architecture a binary is, and prepend the call with `/usr/bin/qemu-(arm|aarch64|ppc64le)-static`. For example, `/usr/bin/qemu-arm-static` is a statically linked `amd64` binary that translates all ARM syscalls to `amd64` syscalls.
+
+The multiarch guys have done a great job here, you may find the source for this and other images at [GitHub](https://github.com/multiarch)
+
+
+## Implementation
+
+## History
+
+32-bit ARM (`linux/arm`) was the first platform Kubernetes was ported to, and luxas' project [`Kubernetes on ARM`](https://github.com/luxas/kubernetes-on-arm) (released on GitHub the 31st of September 2015)
+served as a way of running Kubernetes on ARM devices easily.
+The 30th of November 2015, a tracking issue about making Kubernetes run on ARM was opened: [#17981](https://github.com/kubernetes/kubernetes/issues/17981). It later shifted focus to how to make Kubernetes a more platform-independent system.
+
+The 27th of April 2016, Kubernetes `v1.3.0-alpha.3` was released, and it became the first release that was able to run the [docker getting started guide](http://kubernetes.io/docs/getting-started-guides/docker/) on `linux/amd64`, `linux/arm`, `linux/arm64` and `linux/ppc64le` without any modification.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multi-platform.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/multiple-schedulers.md b/contributors/design-proposals/multiple-schedulers.md
new file mode 100644
index 00000000..4f675b10
--- /dev/null
+++ b/contributors/design-proposals/multiple-schedulers.md
@@ -0,0 +1,138 @@
+# Multi-Scheduler in Kubernetes
+
+**Status**: Design & Implementation in progress.
+
+> Contact @HaiyangDING for questions & suggestions.
+
+## Motivation
+
+In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster.
+However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services,
+are running in the same cluster and they need to be scheduled in different ways. For example, in
+[Omega](http://research.google.com/pubs/pub41684.html) batch workload and service workload are scheduled by two types of schedulers:
+the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate
+and the service workload is scheduled by another one which considers the reserved resources in the
+cluster and many other constraints since their performance must meet some higher SLOs.
+[Mesos](http://mesos.apache.org/) has done a great work to support multiple schedulers by building a
+two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler
+so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling
+behavior as they need. As previously discussed in [#11793](https://github.com/kubernetes/kubernetes/issues/11793),
+[#9920](https://github.com/kubernetes/kubernetes/issues/9920) and [#11470](https://github.com/kubernetes/kubernetes/issues/11470),
+the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods.
+It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets
+set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer,
+as the doc currently does.
+
+Before going to the details of this proposal, below lists a number of the methods to extend the scheduler:
+
+- Write your own scheduler and run it along with Kubernetes native scheduler. This is going to be detailed in this proposal
+- Use the callout approach such as the one implemented in [#13580](https://github.com/kubernetes/kubernetes/issues/13580)
+- Recompile the scheduler with a new policy
+- Restart the scheduler with a new [scheduler policy config file](../../examples/scheduler-policy-config.json)
+- Or maybe in future dynamically link a new policy into the running scheduler
+
+## Challenges in multiple schedulers
+
+- Separating the pods
+
+ Each pod should be scheduled by only one scheduler. As for implementation, a pod should
+ have an additional field to tell by which scheduler it wants to be scheduled. Besides,
+ each scheduler, including the default one, should have a unique logic of how to add unscheduled
+ pods to its to-be-scheduled pod queue. Details will be explained in later sections.
+
+- Dealing with conflicts
+
+ Different schedulers are essentially separated processes. When all schedulers try to schedule
+ their pods onto the nodes, there might be conflicts.
+
+ One example of the conflicts is resource racing: Suppose there be a `pod1` scheduled by
+ `my-scheduler` requiring 1 CPU's *request*, and a `pod2` scheduled by `kube-scheduler` (k8s native
+ scheduler, acting as default scheduler) requiring 2 CPU's *request*, while `node-a` only has 2.5
+ free CPU's, if both schedulers all try to put their pods on `node-a`, then one of them would eventually
+ fail when Kubelet on `node-a` performs the create action due to insufficient CPU resources.
+
+ This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet
+ to do the conflict check and if the conflict happens, effected pods would be put back to scheduler
+ and waiting to be scheduled again. Implementation details are in later sections.
+
+## Where to start: initial design
+
+We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes
+we want to make in the first step.
+
+- Add an annotation in pod template: `scheduler.alpha.kubernetes.io/name: scheduler-name`, this is used to
+separate pods between schedulers. `scheduler-name` should match one of the schedulers' `scheduler-name`
+- Add a `scheduler-name` to each scheduler. It is done by hardcode or as command-line argument. The
+Kubernetes native scheduler (now `kube-scheduler` process) would have the name as `kube-scheduler`
+- The `scheduler-name` plays an important part in separating the pods between different schedulers.
+Pods are statically dispatched to different schedulers based on `scheduler.alpha.kubernetes.io/name: scheduler-name`
+annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must
+NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if:
+ 1. The pod has no nodeName, **AND**
+ 2. The `scheduler-name` specified in the pod's annotation `scheduler.alpha.kubernetes.io/name: scheduler-name`
+ matches the `scheduler-name` of the scheduler.
+
+ The only one exception is the default scheduler. Any pod that has no `scheduler.alpha.kubernetes.io/name: scheduler-name`
+ annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature,
+ the default scheduler would be the Kubernetes built-in scheduler with `scheduler-name` as `kube-scheduler`.
+ The Kubernetes build-in scheduler will claim any pod which has no `scheduler.alpha.kubernetes.io/name: scheduler-name`
+ annotation or which has `scheduler.alpha.kubernetes.io/name: kube-scheduler`. In the future, it may be possible to
+ change which scheduler is the default for a given cluster.
+
+- Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as
+the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler
+may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling
+it back the same node. To make it easier for people who write new schedulers to obey this rule, we will
+create a library containing the predicates Kubelet uses. (See issue [#12744](https://github.com/kubernetes/kubernetes/issues/12744).)
+
+In summary, in the initial version of this multi-scheduler design, we will achieve the following:
+
+- If a pod has the annotation `scheduler.alpha.kubernetes.io/name: kube-scheduler` or the user does not explicitly
+sets this annotation in the template, it will be picked up by default scheduler
+- If the annotation is set and refers to a valid `scheduler-name`, it will be picked up by the scheduler of
+specified `scheduler-name`
+- If the annotation is set but refers to an invalid `scheduler-name`, the pod will not be picked by any scheduler.
+The pod will keep PENDING.
+
+### An example
+
+```yaml
+ kind: Pod
+ apiVersion: v1
+ metadata:
+ name: pod-abc
+ labels:
+ foo: bar
+ annotations:
+ scheduler.alpha.kubernetes.io/name: my-scheduler
+```
+
+This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler
+of name "my-scheduler", the pod will never be scheduled.
+
+## Next steps
+
+1. Use admission controller to add and verify the annotation, and do some modification if necessary. For example, the
+admission controller might add the scheduler annotation based on the namespace of the pod, and/or identify if
+there are conflicting rules, and/or set a default value for the scheduler annotation, and/or reject pods on
+which the client has set a scheduler annotation that does not correspond to a running scheduler.
+2. Dynamic launching scheduler(s) and registering to admission controller (as an external call). This also
+requires some work on authorization and authentication to control what schedulers can write the /binding
+subresource of which pods.
+3. Optimize the behaviors of priority functions in multi-scheduler scenario. In the case where multiple schedulers have
+the same predicate and priority functions (for example, when using multiple schedulers for parallelism rather than to
+customize the scheduling policies), all schedulers would tend to pick the same node as "best" when scheduling identical
+pods and therefore would be likely to conflict on the Kubelet. To solve this problem, we can pass
+an optional flag such as `--randomize-node-selection=N` to scheduler, setting this flag would cause the scheduler to pick
+randomly among the top N nodes instead of the one with the highest score.
+
+## Other issues/discussions related to scheduler design
+
+- [#13580](https://github.com/kubernetes/kubernetes/pull/13580): scheduler extension
+- [#17097](https://github.com/kubernetes/kubernetes/issues/17097): policy config file in pod template
+- [#16845](https://github.com/kubernetes/kubernetes/issues/16845): scheduling groups of pods
+- [#17208](https://github.com/kubernetes/kubernetes/issues/17208): guide to writing a new scheduler
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/multiple-schedulers.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/namespaces.md b/contributors/design-proposals/namespaces.md
new file mode 100644
index 00000000..8a9c97c8
--- /dev/null
+++ b/contributors/design-proposals/namespaces.md
@@ -0,0 +1,370 @@
+# Namespaces
+
+## Abstract
+
+A Namespace is a mechanism to partition resources created by users into
+a logically named group.
+
+## Motivation
+
+A single cluster should be able to satisfy the needs of multiple user
+communities.
+
+Each user community wants to be able to work in isolation from other
+communities.
+
+Each user community has its own:
+
+1. resources (pods, services, replication controllers, etc.)
+2. policies (who can or cannot perform actions in their community)
+3. constraints (this community is allowed this much quota, etc.)
+
+A cluster operator may create a Namespace for each unique user community.
+
+The Namespace provides a unique scope for:
+
+1. named resources (to avoid basic naming collisions)
+2. delegated management authority to trusted users
+3. ability to limit community resource consumption
+
+## Use cases
+
+1. As a cluster operator, I want to support multiple user communities on a
+single cluster.
+2. As a cluster operator, I want to delegate authority to partitions of the
+cluster to trusted users in those communities.
+3. As a cluster operator, I want to limit the amount of resources each
+community can consume in order to limit the impact to other communities using
+the cluster.
+4. As a cluster user, I want to interact with resources that are pertinent to
+my user community in isolation of what other user communities are doing on the
+cluster.
+
+## Design
+
+### Data Model
+
+A *Namespace* defines a logically named group for multiple *Kind*s of resources.
+
+```go
+type Namespace struct {
+ TypeMeta `json:",inline"`
+ ObjectMeta `json:"metadata,omitempty"`
+
+ Spec NamespaceSpec `json:"spec,omitempty"`
+ Status NamespaceStatus `json:"status,omitempty"`
+}
+```
+
+A *Namespace* name is a DNS compatible label.
+
+A *Namespace* must exist prior to associating content with it.
+
+A *Namespace* must not be deleted if there is content associated with it.
+
+To associate a resource with a *Namespace* the following conditions must be
+satisfied:
+
+1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with
+the server
+2. The resource's *TypeMeta.Namespace* field must have a value that references
+an existing *Namespace*
+
+The *Name* of a resource associated with a *Namespace* is unique to that *Kind*
+in that *Namespace*.
+
+It is intended to be used in resource URLs; provided by clients at creation
+time, and encouraged to be human friendly; intended to facilitate idempotent
+creation, space-uniqueness of singleton objects, distinguish distinct entities,
+and reference particular entities across operations.
+
+### Authorization
+
+A *Namespace* provides an authorization scope for accessing content associated
+with the *Namespace*.
+
+See [Authorization plugins](../admin/authorization.md)
+
+### Limit Resource Consumption
+
+A *Namespace* provides a scope to limit resource consumption.
+
+A *LimitRange* defines min/max constraints on the amount of resources a single
+entity can consume in a *Namespace*.
+
+See [Admission control: Limit Range](admission_control_limit_range.md)
+
+A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and
+allows cluster operators to define *Hard* resource usage limits that a
+*Namespace* may consume.
+
+See [Admission control: Resource Quota](admission_control_resource_quota.md)
+
+### Finalizers
+
+Upon creation of a *Namespace*, the creator may provide a list of *Finalizer*
+objects.
+
+```go
+type FinalizerName string
+
+// These are internal finalizers to Kubernetes, must be qualified name unless defined here
+const (
+ FinalizerKubernetes FinalizerName = "kubernetes"
+)
+
+// NamespaceSpec describes the attributes on a Namespace
+type NamespaceSpec struct {
+ // Finalizers is an opaque list of values that must be empty to permanently remove object from storage
+ Finalizers []FinalizerName
+}
+```
+
+A *FinalizerName* is a qualified name.
+
+The API Server enforces that a *Namespace* can only be deleted from storage if
+and only if it's *Namespace.Spec.Finalizers* is empty.
+
+A *finalize* operation is the only mechanism to modify the
+*Namespace.Spec.Finalizers* field post creation.
+
+Each *Namespace* created has *kubernetes* as an item in its list of initial
+*Namespace.Spec.Finalizers* set by default.
+
+### Phases
+
+A *Namespace* may exist in the following phases.
+
+```go
+type NamespacePhase string
+const(
+ NamespaceActive NamespacePhase = "Active"
+ NamespaceTerminating NamespaceTerminating = "Terminating"
+)
+
+type NamespaceStatus struct {
+ ...
+ Phase NamespacePhase
+}
+```
+
+A *Namespace* is in the **Active** phase if it does not have a
+*ObjectMeta.DeletionTimestamp*.
+
+A *Namespace* is in the **Terminating** phase if it has a
+*ObjectMeta.DeletionTimestamp*.
+
+**Active**
+
+Upon creation, a *Namespace* goes in the *Active* phase. This means that content
+may be associated with a namespace, and all normal interactions with the
+namespace are allowed to occur in the cluster.
+
+If a DELETE request occurs for a *Namespace*, the
+*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A
+*namespace controller* observes the change, and sets the
+*Namespace.Status.Phase* to *Terminating*.
+
+**Terminating**
+
+A *namespace controller* watches for *Namespace* objects that have a
+*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to
+initiate graceful termination of the *Namespace* associated content that are
+known to the cluster.
+
+The *namespace controller* enumerates each known resource type in that namespace
+and deletes it one by one.
+
+Admission control blocks creation of new resources in that namespace in order to
+prevent a race-condition where the controller could believe all of a given
+resource type had been deleted from the namespace, when in fact some other rogue
+client agent had created new objects. Using admission control in this scenario
+allows each of registry implementations for the individual objects to not need
+to take into account Namespace life-cycle.
+
+Once all objects known to the *namespace controller* have been deleted, the
+*namespace controller* executes a *finalize* operation on the namespace that
+removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list.
+
+If the *namespace controller* sees a *Namespace* whose
+*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers*
+list is empty, it will signal the server to permanently remove the *Namespace*
+from storage by sending a final DELETE action to the API server.
+
+### REST API
+
+To interact with the Namespace API:
+
+| Action | HTTP Verb | Path | Description |
+| ------ | --------- | ---- | ----------- |
+| CREATE | POST | /api/{version}/namespaces | Create a namespace |
+| LIST | GET | /api/{version}/namespaces | List all namespaces |
+| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} |
+| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} |
+| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} |
+| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces |
+
+This specification reserves the name *finalize* as a sub-resource to namespace.
+
+As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*.
+
+To interact with content associated with a Namespace:
+
+| Action | HTTP Verb | Path | Description |
+| ---- | ---- | ---- | ---- |
+| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} |
+| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} |
+| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} |
+| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} |
+| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} |
+| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} |
+| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
+| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
+
+The API server verifies the *Namespace* on resource creation matches the
+*{namespace}* on the path.
+
+The API server will associate a resource with a *Namespace* if not populated by
+the end-user based on the *Namespace* context of the incoming request. If the
+*Namespace* of the resource being created, or updated does not match the
+*Namespace* on the request, then the API server will reject the request.
+
+### Storage
+
+A namespace provides a unique identifier space and therefore must be in the
+storage path of a resource.
+
+In etcd, we want to continue to still support efficient WATCH across namespaces.
+
+Resources that persist content in etcd will have storage paths as follows:
+
+/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
+
+This enables consumers to WATCH /registry/{resourceType} for changes across
+namespace of a particular {resourceType}.
+
+### Kubelet
+
+The kubelet will register pod's it sources from a file or http source with a
+namespace associated with the *cluster-id*
+
+### Example: OpenShift Origin managing a Kubernetes Namespace
+
+In this example, we demonstrate how the design allows for agents built on-top of
+Kubernetes that manage their own set of resource types associated with a
+*Namespace* to take part in Namespace termination.
+
+OpenShift creates a Namespace in Kubernetes
+
+```json
+{
+ "apiVersion":"v1",
+ "kind": "Namespace",
+ "metadata": {
+ "name": "development",
+ "labels": {
+ "name": "development"
+ }
+ },
+ "spec": {
+ "finalizers": ["openshift.com/origin", "kubernetes"]
+ },
+ "status": {
+ "phase": "Active"
+ }
+}
+```
+
+OpenShift then goes and creates a set of resources (pods, services, etc)
+associated with the "development" namespace. It also creates its own set of
+resources in its own storage associated with the "development" namespace unknown
+to Kubernetes.
+
+User deletes the Namespace in Kubernetes, and Namespace now has following state:
+
+```json
+{
+ "apiVersion":"v1",
+ "kind": "Namespace",
+ "metadata": {
+ "name": "development",
+ "deletionTimestamp": "...",
+ "labels": {
+ "name": "development"
+ }
+ },
+ "spec": {
+ "finalizers": ["openshift.com/origin", "kubernetes"]
+ },
+ "status": {
+ "phase": "Terminating"
+ }
+}
+```
+
+The Kubernetes *namespace controller* observes the namespace has a
+*deletionTimestamp* and begins to terminate all of the content in the namespace
+that it knows about. Upon success, it executes a *finalize* action that modifies
+the *Namespace* by removing *kubernetes* from the list of finalizers:
+
+```json
+{
+ "apiVersion":"v1",
+ "kind": "Namespace",
+ "metadata": {
+ "name": "development",
+ "deletionTimestamp": "...",
+ "labels": {
+ "name": "development"
+ }
+ },
+ "spec": {
+ "finalizers": ["openshift.com/origin"]
+ },
+ "status": {
+ "phase": "Terminating"
+ }
+}
+```
+
+OpenShift Origin has its own *namespace controller* that is observing cluster
+state, and it observes the same namespace had a *deletionTimestamp* assigned to
+it. It too will go and purge resources from its own storage that it manages
+associated with that namespace. Upon completion, it executes a *finalize* action
+and removes the reference to "openshift.com/origin" from the list of finalizers.
+
+This results in the following state:
+
+```json
+{
+ "apiVersion":"v1",
+ "kind": "Namespace",
+ "metadata": {
+ "name": "development",
+ "deletionTimestamp": "...",
+ "labels": {
+ "name": "development"
+ }
+ },
+ "spec": {
+ "finalizers": []
+ },
+ "status": {
+ "phase": "Terminating"
+ }
+}
+```
+
+At this point, the Kubernetes *namespace controller* in its sync loop will see
+that the namespace has a deletion timestamp and that its list of finalizers is
+empty. As a result, it knows all content associated from that namespace has been
+purged. It performs a final DELETE action to remove that Namespace from the
+storage.
+
+At this point, all content associated with that Namespace, and the Namespace
+itself are gone.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/network-policy.md b/contributors/design-proposals/network-policy.md
new file mode 100644
index 00000000..ff75aa57
--- /dev/null
+++ b/contributors/design-proposals/network-policy.md
@@ -0,0 +1,304 @@
+# NetworkPolicy
+
+## Abstract
+
+A proposal for implementing a new resource - NetworkPolicy - which
+will enable definition of ingress policies for selections of pods.
+
+The design for this proposal has been created by, and discussed
+extensively within the Kubernetes networking SIG. It has been implemented
+and tested using Kubernetes API extensions by various networking solutions already.
+
+In this design, users can create various NetworkPolicy objects which select groups of pods and
+define how those pods should be allowed to communicate with each other. The
+implementation of that policy at the network layer is left up to the
+chosen networking solution.
+
+> Note that this proposal does not yet include egress / cidr-based policy, which is still actively undergoing discussion in the SIG. These are expected to augment this proposal in a backwards compatible way.
+
+## Implementation
+
+The implementation in Kubernetes consists of:
+- A v1beta1 NetworkPolicy API object
+- A structure on the `Namespace` object to control policy, to be developed as an annotation for now.
+
+### Namespace changes
+
+The following objects will be defined on a Namespace Spec.
+>NOTE: In v1beta1 the Namespace changes will be implemented as an annotation.
+
+```go
+type IngressIsolationPolicy string
+
+const (
+ // Deny all ingress traffic to pods in this namespace. Ingress means
+ // any incoming traffic to pods, whether that be from other pods within this namespace
+ // or any source outside of this namespace.
+ DefaultDeny IngressIsolationPolicy = "DefaultDeny"
+)
+
+// Standard NamespaceSpec object, modified to include a new
+// NamespaceNetworkPolicy field.
+type NamespaceSpec struct {
+ // This is a pointer so that it can be left undefined.
+ NetworkPolicy *NamespaceNetworkPolicy `json:"networkPolicy,omitempty"`
+}
+
+type NamespaceNetworkPolicy struct {
+ // Ingress configuration for this namespace. This config is
+ // applied to all pods within this namespace. For now, only
+ // ingress is supported. This field is optional - if not
+ // defined, then the cluster default for ingress is applied.
+ Ingress *NamespaceIngressPolicy `json:"ingress,omitempty"`
+}
+
+// Configuration for ingress to pods within this namespace.
+// For now, this only supports specifying an isolation policy.
+type NamespaceIngressPolicy struct {
+ // The isolation policy to apply to pods in this namespace.
+ // Currently this field only supports "DefaultDeny", but could
+ // be extended to support other policies in the future. When set to DefaultDeny,
+ // pods in this namespace are denied ingress traffic by default. When not defined,
+ // the cluster default ingress isolation policy is applied (currently allow all).
+ Isolation *IngressIsolationPolicy `json:"isolation,omitempty"`
+}
+```
+
+```yaml
+kind: Namespace
+apiVersion: v1
+spec:
+ networkPolicy:
+ ingress:
+ isolation: DefaultDeny
+```
+
+The above structures will be represented in v1beta1 as a json encoded annotation like so:
+
+```yaml
+kind: Namespace
+apiVersion: v1
+metadata:
+ annotations:
+ net.beta.kubernetes.io/network-policy: |
+ {
+ "ingress": {
+ "isolation": "DefaultDeny"
+ }
+ }
+```
+
+### NetworkPolicy Go Definition
+
+For a namespace with ingress isolation, connections to pods in that namespace (from any source) are prevented.
+The user needs a way to explicitly declare which connections are allowed into pods of that namespace.
+
+This is accomplished through ingress rules on `NetworkPolicy`
+objects (of which there can be multiple in a single namespace). Pods selected by
+one or more NetworkPolicy objects should allow any incoming connections that match any
+ingress rule on those NetworkPolicy objects, per the network plugin’s capabilities.
+
+NetworkPolicy objects and the above namespace isolation both act on _connections_ rather than individual packets. That is to say that if traffic from pod A to pod B is allowed by the configured
+policy, then the return packets for that connection from B -> A are also allowed, even if the policy in place would not allow B to initiate a connection to A. NetworkPolicy objects act on a broad definition of _connection_ which includes both TCP and UDP streams. If new network policy is applied that would block an existing connection between two endpoints, the enforcer of policy
+should terminate and block the existing connection as soon as can be expected by the implementation.
+
+We propose adding the new NetworkPolicy object to the `extensions/v1beta1` API group for now.
+
+The SIG also considered the following while developing the proposed NetworkPolicy object:
+- A per-pod policy field. We discounted this in favor of the loose coupling that labels provide, similar to Services.
+- Per-Service policy. We chose not to attach network policy to services to avoid semantic overloading of a single object, and conflating the existing semantics of load-balancing and service discovery with those of network policy.
+
+```go
+type NetworkPolicy struct {
+ TypeMeta
+ ObjectMeta
+
+ // Specification of the desired behavior for this NetworkPolicy.
+ Spec NetworkPolicySpec
+}
+
+type NetworkPolicySpec struct {
+ // Selects the pods to which this NetworkPolicy object applies. The array of ingress rules
+ // is applied to any pods selected by this field. Multiple network policies can select the
+ // same set of pods. In this case, the ingress rules for each are combined additively.
+ // This field is NOT optional and follows standard unversioned.LabelSelector semantics.
+ // An empty podSelector matches all pods in this namespace.
+ PodSelector unversioned.LabelSelector `json:"podSelector"`
+
+ // List of ingress rules to be applied to the selected pods.
+ // Traffic is allowed to a pod if namespace.networkPolicy.ingress.isolation is undefined and cluster policy allows it,
+ // OR if the traffic source is the pod's local node,
+ // OR if the traffic matches at least one ingress rule across all of the NetworkPolicy
+ // objects whose podSelector matches the pod.
+ // If this field is empty then this NetworkPolicy does not affect ingress isolation.
+ // If this field is present and contains at least one rule, this policy allows any traffic
+ // which matches at least one of the ingress rules in this list.
+ Ingress []NetworkPolicyIngressRule `json:"ingress,omitempty"`
+}
+
+// This NetworkPolicyIngressRule matches traffic if and only if the traffic matches both ports AND from.
+type NetworkPolicyIngressRule struct {
+ // List of ports which should be made accessible on the pods selected for this rule.
+ // Each item in this list is combined using a logical OR.
+ // If this field is not provided, this rule matches all ports (traffic not restricted by port).
+ // If this field is empty, this rule matches no ports (no traffic matches).
+ // If this field is present and contains at least one item, then this rule allows traffic
+ // only if the traffic matches at least one port in the ports list.
+ Ports *[]NetworkPolicyPort `json:"ports,omitempty"`
+
+ // List of sources which should be able to access the pods selected for this rule.
+ // Items in this list are combined using a logical OR operation.
+ // If this field is not provided, this rule matches all sources (traffic not restricted by source).
+ // If this field is empty, this rule matches no sources (no traffic matches).
+ // If this field is present and contains at least on item, this rule allows traffic only if the
+ // traffic matches at least one item in the from list.
+ From *[]NetworkPolicyPeer `json:"from,omitempty"`
+}
+
+type NetworkPolicyPort struct {
+ // Optional. The protocol (TCP or UDP) which traffic must match.
+ // If not specified, this field defaults to TCP.
+ Protocol *api.Protocol `json:"protocol,omitempty"`
+
+ // If specified, the port on the given protocol. This can
+ // either be a numerical or named port. If this field is not provided,
+ // this matches all port names and numbers.
+ // If present, only traffic on the specified protocol AND port
+ // will be matched.
+ Port *intstr.IntOrString `json:"port,omitempty"`
+}
+
+type NetworkPolicyPeer struct {
+ // Exactly one of the following must be specified.
+
+ // This is a label selector which selects Pods in this namespace.
+ // This field follows standard unversioned.LabelSelector semantics.
+ // If present but empty, this selector selects all pods in this namespace.
+ PodSelector *unversioned.LabelSelector `json:"podSelector,omitempty"`
+
+ // Selects Namespaces using cluster scoped-labels. This
+ // matches all pods in all namespaces selected by this label selector.
+ // This field follows standard unversioned.LabelSelector semantics.
+ // If present but empty, this selector selects all namespaces.
+ NamespaceSelector *unversioned.LabelSelector `json:"namespaceSelector,omitempty"`
+}
+```
+
+### Behavior
+
+The following pseudo-code attempts to define when traffic is allowed to a given pod when using this API.
+
+```python
+def is_traffic_allowed(traffic, pod):
+ """
+ Returns True if traffic is allowed to this pod, False otherwise.
+ """
+ if not pod.Namespace.Spec.NetworkPolicy.Ingress.Isolation:
+ # If ingress isolation is disabled on the Namespace, use cluster default.
+ return clusterDefault(traffic, pod)
+ elif traffic.source == pod.node.kubelet:
+ # Traffic is from kubelet health checks.
+ return True
+ else:
+ # If namespace ingress isolation is enabled, only allow traffic
+ # that matches a network policy which selects this pod.
+ for network_policy in network_policies(pod.Namespace):
+ if not network_policy.Spec.PodSelector.selects(pod):
+ # This policy doesn't select this pod. Try the next one.
+ continue
+
+ # This policy selects this pod. Check each ingress rule
+ # defined on this policy to see if it allows the traffic.
+ # If at least one does, then the traffic is allowed.
+ for ingress_rule in network_policy.Ingress or []:
+ if ingress_rule.matches(traffic):
+ return True
+
+ # Ingress isolation is DefaultDeny and no policies match the given pod and traffic.
+ return false
+```
+
+### Potential Future Work / Questions
+
+- A single podSelector per NetworkPolicy may lead to managing a large number of NetworkPolicy objects, each of which is small and easy to understand on its own. However, this may lead for a policy change to require touching several policy objects. Allowing an optional podSelector per ingress rule additionally to the podSelector per NetworkPolicy object would allow the user to group rules into logical segments and define size/complexity ratio where it makes sense. This may lead to a smaller number of objects with more complexity if the user opts in to the additional podSelector. This increases the complexity of the NetworkPolicy object itself. This proposal has opted to favor a larger number of smaller objects that are easier to understand, with the understanding that additional podSelectors could be added to this design in the future should the requirement become apparent.
+
+- Is the `Namespaces` selector in the `NetworkPolicyPeer` struct too coarse? Do we need to support the AND combination of `Namespaces` and `Pods`?
+
+### Examples
+
+1) Only allow traffic from frontend pods on TCP port 6379 to backend pods in the same namespace.
+
+```yaml
+kind: Namespace
+apiVersion: v1
+metadata:
+ name: myns
+ annotations:
+ net.beta.kubernetes.io/network-policy: |
+ {
+ "ingress": {
+ "isolation": "DefaultDeny"
+ }
+ }
+---
+kind: NetworkPolicy
+apiVersion: extensions/v1beta1
+metadata:
+ name: allow-frontend
+ namespace: myns
+spec:
+ podSelector:
+ matchLabels:
+ role: backend
+ ingress:
+ - from:
+ - podSelector:
+ matchLabels:
+ role: frontend
+ ports:
+ - protocol: TCP
+ port: 6379
+```
+
+2) Allow TCP 443 from any source in Bob's namespaces.
+
+```yaml
+kind: NetworkPolicy
+apiVersion: extensions/v1beta1
+metadata:
+ name: allow-tcp-443
+spec:
+ podSelector:
+ matchLabels:
+ role: frontend
+ ingress:
+ - ports:
+ - protocol: TCP
+ port: 443
+ from:
+ - namespaceSelector:
+ matchLabels:
+ user: bob
+```
+
+3) Allow all traffic to all pods in this namespace.
+
+```yaml
+kind: NetworkPolicy
+apiVersion: extensions/v1beta1
+metadata:
+ name: allow-all
+spec:
+ podSelector:
+ ingress:
+ - {}
+```
+
+## References
+
+- https://github.com/kubernetes/kubernetes/issues/22469 tracks network policy in kubernetes.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/network-policy.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/networking.md b/contributors/design-proposals/networking.md
new file mode 100644
index 00000000..6e269481
--- /dev/null
+++ b/contributors/design-proposals/networking.md
@@ -0,0 +1,190 @@
+# Networking
+
+There are 4 distinct networking problems to solve:
+
+1. Highly-coupled container-to-container communications
+2. Pod-to-Pod communications
+3. Pod-to-Service communications
+4. External-to-internal communications
+
+## Model and motivation
+
+Kubernetes deviates from the default Docker networking model (though as of
+Docker 1.8 their network plugins are getting closer). The goal is for each pod
+to have an IP in a flat shared networking namespace that has full communication
+with other physical computers and containers across the network. IP-per-pod
+creates a clean, backward-compatible model where pods can be treated much like
+VMs or physical hosts from the perspectives of port allocation, networking,
+naming, service discovery, load balancing, application configuration, and
+migration.
+
+Dynamic port allocation, on the other hand, requires supporting both static
+ports (e.g., for externally accessible services) and dynamically allocated
+ports, requires partitioning centrally allocated and locally acquired dynamic
+ports, complicates scheduling (since ports are a scarce resource), is
+inconvenient for users, complicates application configuration, is plagued by
+port conflicts and reuse and exhaustion, requires non-standard approaches to
+naming (e.g. consul or etcd rather than DNS), requires proxies and/or
+redirection for programs using standard naming/addressing mechanisms (e.g. web
+browsers), requires watching and cache invalidation for address/port changes
+for instances in addition to watching group membership changes, and obstructs
+container/pod migration (e.g. using CRIU). NAT introduces additional complexity
+by fragmenting the addressing space, which breaks self-registration mechanisms,
+among other problems.
+
+## Container to container
+
+All containers within a pod behave as if they are on the same host with regard
+to networking. They can all reach each other’s ports on localhost. This offers
+simplicity (static ports know a priori), security (ports bound to localhost
+are visible within the pod but never outside it), and performance. This also
+reduces friction for applications moving from the world of uncontainerized apps
+on physical or virtual hosts. People running application stacks together on
+the same host have already figured out how to make ports not conflict and have
+arranged for clients to find them.
+
+The approach does reduce isolation between containers within a pod &mdash;
+ports could conflict, and there can be no container-private ports, but these
+seem to be relatively minor issues with plausible future workarounds. Besides,
+the premise of pods is that containers within a pod share some resources
+(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
+Additionally, the user can control what containers belong to the same pod
+whereas, in general, they don't control what pods land together on a host.
+
+## Pod to pod
+
+Because every pod gets a "real" (not machine-private) IP address, pods can
+communicate without proxies or translations. The pod can use well-known port
+numbers and can avoid the use of higher-level service discovery systems like
+DNS-SD, Consul, or Etcd.
+
+When any container calls ioctl(SIOCGIFADDR) (get the address of an interface),
+it sees the same IP that any peer container would see them coming from &mdash;
+each pod has its own IP address that other pods can know. By making IP addresses
+and ports the same both inside and outside the pods, we create a NAT-less, flat
+address space. Running "ip addr show" should work as expected. This would enable
+all existing naming/discovery mechanisms to work out of the box, including
+self-registration mechanisms and applications that distribute IP addresses. We
+should be optimizing for inter-pod network communication. Within a pod,
+containers are more likely to use communication through volumes (e.g., tmpfs) or
+IPC.
+
+This is different from the standard Docker model. In that mode, each container
+gets an IP in the 172-dot space and would only see that 172-dot address from
+SIOCGIFADDR. If these containers connect to another container the peer would see
+the connect coming from a different IP than the container itself knows. In short
+&mdash; you can never self-register anything from a container, because a
+container can not be reached on its private IP.
+
+An alternative we considered was an additional layer of addressing: pod-centric
+IP per container. Each container would have its own local IP address, visible
+only within that pod. This would perhaps make it easier for containerized
+applications to move from physical/virtual hosts to pods, but would be more
+complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS)
+and to reason about, due to the additional layer of address translation, and
+would break self-registration and IP distribution mechanisms.
+
+Like Docker, ports can still be published to the host node's interface(s), but
+the need for this is radically diminished.
+
+## Implementation
+
+For the Google Compute Engine cluster configuration scripts, we use [advanced
+routing rules](https://developers.google.com/compute/docs/networking#routing)
+and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that
+get routed to it. This is in addition to the 'main' IP address assigned to the
+VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to
+differentiate it from `docker0`) is set up outside of Docker proper.
+
+Example of GCE's advanced routing rules:
+
+```sh
+gcloud compute routes add "${NODE_NAMES[$i]}" \
+ --project "${PROJECT}" \
+ --destination-range "${NODE_IP_RANGES[$i]}" \
+ --network "${NETWORK}" \
+ --next-hop-instance "${NODE_NAMES[$i]}" \
+ --next-hop-instance-zone "${ZONE}" &
+```
+
+GCE itself does not know anything about these IPs, though. This means that when
+a pod tries to egress beyond GCE's project the packets must be SNAT'ed
+(masqueraded) to the VM's IP, which GCE recognizes and allows.
+
+### Other implementations
+
+With the primary aim of providing IP-per-pod-model, other implementations exist
+to serve the purpose outside of GCE.
+ - [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md)
+ - [Flannel](https://github.com/coreos/flannel#flannel)
+ - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/)
+ ("With Linux Bridge devices" section)
+ - [Weave](https://github.com/zettio/weave) is yet another way to build an
+ overlay network, primarily aiming at Docker integration.
+ - [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real
+ container IPs.
+
+## Pod to service
+
+The [service](../user-guide/services.md) abstraction provides a way to group pods under a
+common access policy (e.g. load-balanced). The implementation of this creates a
+virtual IP which clients can access and which is transparently proxied to the
+pods in a Service. Each node runs a kube-proxy process which programs
+`iptables` rules to trap access to service IPs and redirect them to the correct
+backends. This provides a highly-available load-balancing solution with low
+performance overhead by balancing client traffic from a node on that same node.
+
+## External to internal
+
+So far the discussion has been about how to access a pod or service from within
+the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
+want to offer highly-available, high-performance load balancing to target
+Kubernetes Services. Most public cloud providers are simply not flexible enough
+yet.
+
+The way this is generally implemented is to set up external load balancers (e.g.
+GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
+traffic arrives at a node it is recognized as being part of a particular Service
+and routed to an appropriate backend Pod. This does mean that some traffic will
+get double-bounced on the network. Once cloud providers have better offerings
+we can take advantage of those.
+
+## Challenges and future work
+
+### Docker API
+
+Right now, docker inspect doesn't show the networking configuration of the
+containers, since they derive it from another container. That information should
+be exposed somehow.
+
+### External IP assignment
+
+We want to be able to assign IP addresses externally from Docker
+[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need
+to statically allocate fixed-size IP ranges to each node, so that IP addresses
+can be made stable across pod infra container restarts
+([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate
+pod migration. Right now, if the pod infra container dies, all the user
+containers must be stopped and restarted because the netns of the pod infra
+container will change on restart, and any subsequent user container restart
+will join that new netns, thereby not being able to see its peers.
+Additionally, a change in IP address would encounter DNS caching/TTL problems.
+External IP assignment would also simplify DNS support (see below).
+
+### IPv6
+
+IPv6 support would be nice but requires significant internal changes in a few
+areas. First pods should be able to report multiple IP addresses
+[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398)
+and the network plugin architecture Kubernetes uses needs to allow returning
+IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245).
+Kubernetes code that deals with IP addresses must then be audited and fixed to
+support both IPv4 and IPv6 addresses and not assume IPv4.
+Additionally, direct ipv6 assignment to instances doesn't appear to be supported
+by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull
+requests from people running Kubernetes on bare metal, though. :-)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/node-allocatable.md b/contributors/design-proposals/node-allocatable.md
new file mode 100644
index 00000000..ac9f46c4
--- /dev/null
+++ b/contributors/design-proposals/node-allocatable.md
@@ -0,0 +1,151 @@
+# Node Allocatable Resources
+
+**Issue:** https://github.com/kubernetes/kubernetes/issues/13984
+
+## Overview
+
+Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional
+parameters to serve several purposes:
+
+1. Kubernetes metrics provides "/docker-daemon", "/kubelet",
+ "/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage
+ patterns and detecting regressions. Eventually we want to cap system component usage to a certain
+ limit / request. However this is not currently feasible due to a variety of reasons including:
+ 1. Docker still uses tons of computing resources (See
+ [#16943](https://github.com/kubernetes/kubernetes/issues/16943))
+ 2. We have not yet defined the minimal system requirements, so we cannot control Kubernetes
+ nodes or know about arbitrary daemons, which can make the system resources
+ unmanageable. Even with a resource cap we cannot do a full resource management on the
+ node, but with the proposed parameters we can mitigate really bad resource over commits
+ 3. Usage scales with the number of pods running on the node
+2. For external schedulers (such as mesos, hadoop, etc.) integration, they might want to partition
+ compute resources on a given node, limiting how much Kubelet can use. We should provide a
+ mechanism by which they can query kubelet, and reserve some resources for their own purpose.
+
+### Scope of proposal
+
+This proposal deals with resource reporting through the [`Allocatable` field](#allocatable) for more
+reliable scheduling, and minimizing resource over commitment. This proposal *does not* cover
+resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when
+reservation grows), or running multiple Kubelets on a single node.
+
+## Design
+
+### Definitions
+
+![image](node-allocatable.png)
+
+1. **Node Capacity** - Already provided as
+ [`NodeStatus.Capacity`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus),
+ this is total capacity read from the node instance, and assumed to be constant.
+2. **System-Reserved** (proposed) - Compute resources reserved for processes which are not managed by
+ Kubernetes. Currently this covers all the processes lumped together in the `/system` raw
+ container.
+3. **Kubelet Allocatable** - Compute resources available for scheduling (including scheduled &
+ unscheduled resources). This value is the focus of this proposal. See [below](#api-changes) for
+ more details.
+4. **Kube-Reserved** (proposed) - Compute resources reserved for Kubernetes components such as the
+ docker daemon, kubelet, kube proxy, etc.
+
+### API changes
+
+#### Allocatable
+
+Add `Allocatable` (4) to
+[`NodeStatus`](https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html#_v1_nodestatus):
+
+```
+type NodeStatus struct {
+ ...
+ // Allocatable represents schedulable resources of a node.
+ Allocatable ResourceList `json:"allocatable,omitempty"`
+ ...
+}
+```
+
+Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:
+
+```
+ [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]
+```
+
+The scheduler will use `Allocatable` in place of `Capacity` when scheduling pods, and the Kubelet
+will use it when performing admission checks.
+
+*Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a
+ separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this
+ proposal.*
+
+#### Kube-Reserved
+
+`KubeReserved` is the parameter specifying resources reserved for kubernetes components (4). It is
+provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during
+normal Kubelet operation (this may change in the [future](#future-work)).
+
+The flag will be specified as a serialized `ResourceList`, with resources defined by the API
+`ResourceName` and values specified in `resource.Quantity` format, e.g.:
+
+```
+--kube-reserved=cpu=500m,memory=5Mi
+```
+
+Initially we will only support CPU and memory, but will eventually support more resources. See
+[#16889](https://github.com/kubernetes/kubernetes/pull/16889) for disk accounting.
+
+If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it
+is explicitly set to 0 (along with `SystemReserved`), then `Allocatable == Capacity`, and the system
+behavior is equivalent to the 1.1 behavior with scheduling based on Capacity.
+
+#### System-Reserved
+
+In the initial implementation, `SystemReserved` will be functionally equivalent to
+[`KubeReserved`](#system-reserved), but with a different semantic meaning. While KubeReserved
+designates resources set aside for kubernetes components, SystemReserved designates resources set
+aside for non-kubernetes components (currently this is reported as all the processes lumped
+together in the `/system` raw container).
+
+## Issues
+
+### Kubernetes reservation is smaller than kubernetes component usage
+
+**Solution**: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved
+resources and hope for the best. If the node usage is less than Allocatable, there will be some room
+for overflow and the node should continue to function. If the node has been scheduled to capacity
+(worst-case scenario) it may enter an unstable state, which is the current behavior in this
+situation.
+
+In the [future](#future-work) we may set a parent cgroup for kubernetes components, with limits set
+according to `KubeReserved`.
+
+### Version discrepancy
+
+**API server / scheduler is not allocatable-resources aware:** If the Kubelet rejects a Pod but the
+ scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop
+ scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation,
+ we will do a 2-stage rollout of `Allocatable`. In stage 1 (targeted for 1.2), `Allocatable` will
+ be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue
+ to do admission checks based on `Capacity` (same as today). In stage 2 of the rollout (targeted
+ for 1.3 or later), the Kubelet will start doing admission checks based on `Allocatable`.
+
+**API server expects `Allocatable` but does not receive it:** If the kubelet is older and does not
+ provide `Allocatable` in the `NodeStatus`, then `Allocatable` will be
+ [defaulted](../../pkg/api/v1/defaults.go) to
+ `Capacity` (which will yield today's behavior of scheduling based on capacity).
+
+### 3rd party schedulers
+
+The community should be notified that an update to schedulers is recommended, but if a scheduler is
+not updated it falls under the above case of "scheduler is not allocatable-resources aware".
+
+## Future work
+
+1. Convert kubelet flags to Config API - Prerequisite to (2). See
+ [#12245](https://github.com/kubernetes/kubernetes/issues/12245).
+2. Set cgroup limits according KubeReserved - as described in the [overview](#overview)
+3. Report kernel usage to be considered with scheduling decisions.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/node-allocatable.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/node-allocatable.png b/contributors/design-proposals/node-allocatable.png
new file mode 100644
index 00000000..d6f5383e
--- /dev/null
+++ b/contributors/design-proposals/node-allocatable.png
Binary files differ
diff --git a/contributors/design-proposals/nodeaffinity.md b/contributors/design-proposals/nodeaffinity.md
new file mode 100644
index 00000000..61e04169
--- /dev/null
+++ b/contributors/design-proposals/nodeaffinity.md
@@ -0,0 +1,246 @@
+# Node affinity and NodeSelector
+
+## Introduction
+
+This document proposes a new label selector representation, called
+`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit
+more flexible and is intended to be used only for selecting nodes.
+
+In addition, we propose to replace the `map[string]string` in `PodSpec` that the
+scheduler currently uses as part of restricting the set of nodes onto which a
+pod is eligible to schedule, with a field of type `Affinity` that contains one
+or more affinity specifications. In this document we discuss `NodeAffinity`,
+which contains one or more of the following:
+* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
+represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
+the current `map[string]string` but still serves the purpose of restricting
+the set of nodes onto which the pod can schedule. In addition, unlike the
+behavior of the current `map[string]string`, when it becomes violated the system
+will try to eventually evict the pod from its node.
+* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is
+identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the
+system may or may not try to eventually evict the pod from its node.
+* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that
+specifies which nodes are preferred for scheduling among those that meet all
+scheduling requirements.
+
+(In practice, as discussed later, we will actually *add* the `Affinity` field
+rather than replacing `map[string]string`, due to backward compatibility
+requirements.)
+
+The affinity specifications described above allow a pod to request various
+properties that are inherent to nodes, for example "run this pod on a node with
+an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z."
+([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
+some of the properties that a node might publish as labels, which affinity
+expressions can match against.) They do *not* allow a pod to request to schedule
+(or not schedule) on a node based on what other pods are running on the node.
+That feature is called "inter-pod topological affinity/anti-affinity" and is
+described [here](https://github.com/kubernetes/kubernetes/pull/18265).
+
+## API
+
+### NodeSelector
+
+```go
+// A node selector represents the union of the results of one or more label queries
+// over a set of nodes; that is, it represents the OR of the selectors represented
+// by the nodeSelectorTerms.
+type NodeSelector struct {
+ // nodeSelectorTerms is a list of node selector terms. The terms are ORed.
+ NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
+}
+
+// An empty node selector term matches all objects. A null node selector term
+// matches no objects.
+type NodeSelectorTerm struct {
+ // matchExpressions is a list of node selector requirements. The requirements are ANDed.
+ MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
+}
+
+// A node selector requirement is a selector that contains values, a key, and an operator
+// that relates the key and values.
+type NodeSelectorRequirement struct {
+ // key is the label key that the selector applies to.
+ Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
+ // operator represents a key's relationship to a set of values.
+ // Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
+ Operator NodeSelectorOperator `json:"operator"`
+ // values is an array of string values. If the operator is In or NotIn,
+ // the values array must be non-empty. If the operator is Exists or DoesNotExist,
+ // the values array must be empty. If the operator is Gt or Lt, the values
+ // array must have a single element, which will be interpreted as an integer.
+ // This array is replaced during a strategic merge patch.
+ Values []string `json:"values,omitempty"`
+}
+
+// A node selector operator is the set of operators that can be used in
+// a node selector requirement.
+type NodeSelectorOperator string
+
+const (
+ NodeSelectorOpIn NodeSelectorOperator = "In"
+ NodeSelectorOpNotIn NodeSelectorOperator = "NotIn"
+ NodeSelectorOpExists NodeSelectorOperator = "Exists"
+ NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
+ NodeSelectorOpGt NodeSelectorOperator = "Gt"
+ NodeSelectorOpLt NodeSelectorOperator = "Lt"
+)
+```
+
+### NodeAffinity
+
+We will add one field to `PodSpec`
+
+```go
+Affinity *Affinity `json:"affinity,omitempty"`
+```
+
+The `Affinity` type is defined as follows
+
+```go
+type Affinity struct {
+ NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"`
+}
+
+type NodeAffinity struct {
+ // If the affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a node label update),
+ // the system will try to eventually evict the pod from its node.
+ RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
+ // If the affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a node label update),
+ // the system may or may not try to eventually evict the pod from its node.
+ RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
+ // The scheduler will prefer to schedule pods to nodes that satisfy
+ // the affinity expressions specified by this field, but it may choose
+ // a node that violates one or more of the expressions. The node that is
+ // most preferred is the one with the greatest sum of weights, i.e.
+ // for each node that meets all of the scheduling requirements (resource
+ // request, RequiredDuringScheduling affinity expressions, etc.),
+ // compute a sum by iterating through the elements of this field and adding
+ // "weight" to the sum if the node matches the corresponding MatchExpressions; the
+ // node(s) with the highest sum are the most preferred.
+ PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
+}
+
+// An empty preferred scheduling term matches all objects with implicit weight 0
+// (i.e. it's a no-op). A null preferred scheduling term matches no objects.
+type PreferredSchedulingTerm struct {
+ // weight is in the range 1-100
+ Weight int `json:"weight"`
+ // matchExpressions is a list of node selector requirements. The requirements are ANDed.
+ MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
+}
+```
+
+Unfortunately, the name of the existing `map[string]string` field in PodSpec is
+`NodeSelector` and we can't change it since this name is part of the API.
+Hopefully this won't cause too much confusion.
+
+## Examples
+
+** TODO: fill in this section **
+
+* Run this pod on a node with an Intel or AMD CPU
+
+* Run this pod on a node in availability zone Z
+
+
+## Backward compatibility
+
+When we add `Affinity` to PodSpec, we will deprecate, but not remove, the
+current field in PodSpec
+
+```go
+NodeSelector map[string]string `json:"nodeSelector,omitempty"`
+```
+
+Old version of the scheduler will ignore the `Affinity` field. New versions of
+the scheduler will apply their scheduling predicates to both `Affinity` and
+`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets
+of requirements. We will not attempt to convert between `Affinity` and
+`nodeSelector`.
+
+Old versions of non-scheduling clients will not know how to do anything
+semantically meaningful with `Affinity`, but we don't expect that this will
+cause a problem.
+
+See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
+for more discussion.
+
+Users should not start using `NodeAffinity` until the full implementation has
+been in Kubelet and the master for enough binary versions that we feel
+comfortable that we will not need to roll back either Kubelet or master to a
+version that does not support them. Longer-term we will use a programatic
+approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
+
+## Implementation plan
+
+1. Add the `Affinity` field to PodSpec and the `NodeAffinity`,
+`PreferredDuringSchedulingIgnoredDuringExecution`, and
+`RequiredDuringSchedulingIgnoredDuringExecution` types to the API.
+2. Implement a scheduler predicate that takes
+`RequiredDuringSchedulingIgnoredDuringExecution` into account.
+3. Implement a scheduler priority function that takes
+`PreferredDuringSchedulingIgnoredDuringExecution` into account.
+4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be
+marked as deprecated.
+5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API.
+6. Modify the scheduler predicate from step 2 to also take
+`RequiredDuringSchedulingRequiredDuringExecution` into account.
+7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission
+decision.
+8. Implement code in Kubelet *or* the controllers that evicts a pod that no
+longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
+
+We assume Kubelet publishes labels describing the node's membership in all of
+the relevant scheduling domains (e.g. node name, rack name, availability zone
+name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044).
+
+## Extensibility
+
+The design described here is the result of careful analysis of use cases, a
+decade of experience with Borg at Google, and a review of similar features in
+other open-source container orchestration systems. We believe that it properly
+balances the goal of expressiveness against the goals of simplicity and
+efficiency of implementation. However, we recognize that use cases may arise in
+the future that cannot be expressed using the syntax described here. Although we
+are not implementing an affinity-specific extensibility mechanism for a variety
+of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
+for Kubernetes users to get a consistent experience, etc.), the regular
+Kubernetes annotation mechanism can be used to add or replace affinity rules.
+The way this work would is:
+
+1. Define one or more annotations to describe the new affinity rule(s)
+1. User (or an admission controller) attaches the annotation(s) to pods to
+request the desired scheduling behavior. If the new rule(s) *replace* one or
+more fields of `Affinity` then the user would omit those fields from `Affinity`;
+if they are *additional rules*, then the user would fill in `Affinity` as well
+as the annotation(s).
+1. Scheduler takes the annotation(s) into account when scheduling.
+
+If some particular new syntax becomes popular, we would consider upstreaming it
+by integrating it into the standard `Affinity`.
+
+## Future work
+
+Are there any other fields we should convert from `map[string]string` to
+`NodeSelector`?
+
+## Related issues
+
+The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261).
+
+The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341).
+Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related.
+Those issues reference other related issues.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/performance-related-monitoring.md b/contributors/design-proposals/performance-related-monitoring.md
new file mode 100644
index 00000000..f70da39b
--- /dev/null
+++ b/contributors/design-proposals/performance-related-monitoring.md
@@ -0,0 +1,116 @@
+# Performance Monitoring
+
+## Reason for this document
+
+This document serves as a place to gather information about past performance regressions, their reason and impact and discuss ideas to avoid similar regressions in the future.
+Main reason behind doing this is to understand what kind of monitoring needs to be in place to keep Kubernetes fast.
+
+## Known past and present performance issues
+
+### Higher logging level causing scheduler stair stepping
+
+Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0
+cut. In the end it turned out the be caused by `--v=4` (instead of default `--v=2`) flag in the scheduler together with the flag `--logtostderr` which disables batching of
+log lines and a number of logging without explicit V level. This caused weird behavior of the whole component.
+
+Because we now know that logging may have big performance impact we should consider instrumenting logging mechanism and compute statistics such as number of logged messages,
+total and average size of them. Each binary should be responsible for exposing its metrics. An unaccounted but way too big number of days, if not weeks, of engineering time was
+lost because of this issue.
+
+### Adding per-pod probe-time, which increased the number of PodStatus updates, causing major slowdown
+
+In September 2015 we tried to add per-pod probe times to the PodStatus. It caused (https://github.com/kubernetes/kubernetes/issues/14273) a massive increase in both number and
+total volume of object (PodStatus) changes. It drastically increased the load on API server which wasn’t able to handle new number of requests quickly enough, violating our
+response time SLO. We had to revert this change.
+
+### Late Ready->Running PodPhase transition caused test failures as it seemed like slowdown
+
+In late September we encountered a strange problem (https://github.com/kubernetes/kubernetes/issues/14554): we observed an increased observed latencies in small clusters (few
+Nodes). It turned out that it’s caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows
+how careful we need to be.
+
+### Huge number of handshakes slows down API server
+
+It was a long standing issue for performance and is/was an important bottleneck for scalability (https://github.com/kubernetes/kubernetes/issues/13671). The bug directly
+causing this problem was incorrect (from the golangs standpoint) handling of TCP connections. Secondary issue was that elliptic curve encryption (only one available in go 1.4)
+is unbelievably slow.
+
+## Proposed metrics/statistics to gather/compute to avoid problems
+
+### Cluster-level metrics
+
+Basic ideas:
+- number of Pods/ReplicationControllers/Services in the cluster
+- number of running replicas of master components (if they are replicated)
+- current elected master of ectd cluster (if running distributed version)
+- nuber of master component restarts
+- number of lost Nodes
+
+### Logging monitoring
+
+Log spam is a serious problem and we need to keep it under control. Simplest way to check for regressions, suggested by @brendandburns, is to compute the rate in which log files
+grow in e2e tests.
+
+Basic ideas:
+- log generation rate (B/s)
+
+### REST call monitoring
+
+We do measure REST call duration in the Density test, but we need an API server monitoring as well, to avoid false failures caused e.g. by the network traffic. We already have
+some metrics in place (https://github.com/kubernetes/kubernetes/blob/master/pkg/apiserver/metrics/metrics.go), but we need to revisit the list and add some more.
+
+Basic ideas:
+- number of calls per verb, client, resource type
+- latency distribution per verb, client, resource type
+- number of calls that was rejected per client, resource type and reason (invalid version number, already at maximum number of requests in flight)
+- number of relists in various watchers
+
+### Rate limit monitoring
+
+Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of
+requests sent we can track how saturated are rate limiters. This has additional advantage of giving us data needed to fine-tune rate limiter constants.
+
+Because we have rate limiting on both ends (client and API server) we should monitor number of inflight requests in API server and how it relates to `max-requests-inflight`.
+
+Basic ideas:
+- percentage of used non-burst limit,
+- amount of time in last hour with depleted burst tokens,
+- number of inflight requests in API server.
+
+### Network connection monitoring
+
+During development we observed incorrect use/reuse of HTTP connections multiple times already. We should at least monitor number of created connections.
+
+### ETCD monitoring
+
+@xiang-90 and @hongchaodeng - you probably have way more experience on what'd be good to look at from the ETCD perspective.
+
+Basic ideas:
+- ETCD memory footprint
+- number of objects per kind
+- read/write latencies per kind
+- number of requests from the API server
+- read/write counts per key (it may be too heavy though)
+
+### Resource consumption
+
+On top of all things mentioned above we need to monitor changes in resource usage in both: cluster components (API server, Kubelet, Scheduler, etc.) and system add-ons
+(Heapster, L7 load balancer, etc.). Monitoring memory usage is tricky, because if no limits are set, system won't apply memory pressure to processes, which makes their memory
+footprint constantly grow. We argue that monitoring usage in tests still makes sense, as tests should be repeatable, and if memory usage will grow drastically between two runs
+it most likely can be attributed to some kind of regression (assuming that nothing else has changed in the environment).
+
+Basic ideas:
+- CPU usage
+- memory usage
+
+### Other saturation metrics
+
+We should monitor other aspects of the system, which may indicate saturation of some component.
+
+Basic ideas:
+- queue length for queues in the system,
+- wait time for WaitGroups.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/performance-related-monitoring.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/persistent-storage.md b/contributors/design-proposals/persistent-storage.md
new file mode 100644
index 00000000..70bcde97
--- /dev/null
+++ b/contributors/design-proposals/persistent-storage.md
@@ -0,0 +1,292 @@
+# Persistent Storage
+
+This document proposes a model for managing persistent, cluster-scoped storage
+for applications requiring long lived data.
+
+### Abstract
+
+Two new API kinds:
+
+A `PersistentVolume` (PV) is a storage resource provisioned by an administrator.
+It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/)
+for how to use it.
+
+A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to
+use in a pod. It is analogous to a pod.
+
+One new system component:
+
+`PersistentVolumeClaimBinder` is a singleton running in master that watches all
+PersistentVolumeClaims in the system and binds them to the closest matching
+available PersistentVolume. The volume manager watches the API for newly created
+volumes to manage.
+
+One new volume:
+
+`PersistentVolumeClaimVolumeSource` references the user's PVC in the same
+namespace. This volume finds the bound PV and mounts that volume for the pod. A
+`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another
+type of volume that is owned by someone else (the system).
+
+Kubernetes makes no guarantees at runtime that the underlying storage exists or
+is available. High availability is left to the storage provider.
+
+### Goals
+
+* Allow administrators to describe available storage.
+* Allow pod authors to discover and request persistent volumes to use with pods.
+* Enforce security through access control lists and securing storage to the same
+namespace as the pod volume.
+* Enforce quotas through admission control.
+* Enforce scheduler rules by resource counting.
+* Ensure developers can rely on storage being available without being closely
+bound to a particular disk, server, network, or storage device.
+
+#### Describe available storage
+
+Cluster administrators use the API to manage *PersistentVolumes*. A custom store
+`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by
+storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for
+storage and binds them to an available volume by matching the volume's
+characteristics (AccessModes and storage size) to the user's request.
+
+PVs are system objects and, thus, have no namespace.
+
+Many means of dynamic provisioning will be eventually be implemented for various
+storage types.
+
+
+##### PersistentVolume API
+
+| Action | HTTP Verb | Path | Description |
+| ---- | ---- | ---- | ---- |
+| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume |
+| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} |
+| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} |
+| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} |
+| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume |
+| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume |
+
+
+#### Request Storage
+
+Kubernetes users request persistent storage for their pod by creating a
+```PersistentVolumeClaim```. Their request for storage is described by their
+requirements for resources and mount capabilities.
+
+Requests for volumes are bound to available volumes by the volume manager, if a
+suitable match is found. Requests for resources can go unfulfilled.
+
+Users attach their claim to their pod using a new
+```PersistentVolumeClaimVolumeSource``` volume source.
+
+
+##### PersistentVolumeClaim API
+
+
+| Action | HTTP Verb | Path | Description |
+| ---- | ---- | ---- | ---- |
+| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} |
+| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} |
+| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} |
+| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} |
+| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} |
+| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} |
+
+
+
+#### Scheduling constraints
+
+Scheduling constraints are to be handled similar to pod resource constraints.
+Pods will need to be annotated or decorated with the number of resources it
+requires on a node. Similarly, a node will need to list how many it has used or
+available.
+
+TBD
+
+
+#### Events
+
+The implementation of persistent storage will not require events to communicate
+to the user the state of their claim. The CLI for bound claims contains a
+reference to the backing persistent volume. This is always present in the API
+and CLI, making an event to communicate the same unnecessary.
+
+Events that communicate the state of a mounted volume are left to the volume
+plugins.
+
+### Example
+
+#### Admin provisions storage
+
+An administrator provisions storage by posting PVs to the API. Various ways to
+automate this task can be scripted. Dynamic provisioning is a future feature
+that can maintain levels of PVs.
+
+```yaml
+POST:
+
+kind: PersistentVolume
+apiVersion: v1
+metadata:
+ name: pv0001
+spec:
+ capacity:
+ storage: 10
+ persistentDisk:
+ pdName: "abc123"
+ fsType: "ext4"
+```
+
+```console
+$ kubectl get pv
+
+NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
+pv0001 map[] 10737418240 RWO Pending
+```
+
+#### Users request storage
+
+A user requests storage by posting a PVC to the API. Their request contains the
+AccessModes they wish their volume to have and the minimum size needed.
+
+The user must be within a namespace to create PVCs.
+
+```yaml
+POST:
+
+kind: PersistentVolumeClaim
+apiVersion: v1
+metadata:
+ name: myclaim-1
+spec:
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 3
+```
+
+```console
+$ kubectl get pvc
+
+NAME LABELS STATUS VOLUME
+myclaim-1 map[] pending
+```
+
+
+#### Matching and binding
+
+The ```PersistentVolumeClaimBinder``` attempts to find an available volume that
+most closely matches the user's request. If one exists, they are bound by
+putting a reference on the PV to the PVC. Requests can go unfulfilled if a
+suitable match is not found.
+
+```console
+$ kubectl get pv
+
+NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
+pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e
+
+
+kubectl get pvc
+
+NAME LABELS STATUS VOLUME
+myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e
+```
+
+A claim must request access modes and storage capacity. This is because internally PVs are
+indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity.
+A claim may request one of more of the following attributes to better match a PV: volume name, selectors,
+and volume class (currently implemented as an annotation).
+
+A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which
+PVC it will match.
+A PV may also define labels, annotations, and a volume class (currently implemented as an
+annotation) to better target PVCs.
+
+As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is
+matched to a PV:
+
+1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered.
+"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines
+the mode requested by the claim.
+
+1. The potential PVs above are considered in order of the closest access mode match, with the best case
+being an exact match, and a worse case being more modes than requested by the claim.
+
+1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity
+is not less than the storage being requested by the claim then this PV will bind to the claim. Done.
+
+1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is
+skipped and will be handled by Dynamic Provisioning.
+
+1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a
+placeholder, then the PV is skipped.
+
+1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the
+PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match
+since capacities may differ.
+
+1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder
+for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped.
+If the annotations for the PV and PVC are empty they are treated as being equal.
+
+1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs,
+the PV with the smallest capacity that is also equal to or greater than the claim's requested storage
+is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all
+of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner.
+
+*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default
+`StorageClass` has been defined) then a volume will be dynamically provisioned.
+
+#### Claim usage
+
+The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim
+and mount its volume for a pod.
+
+The claim holder owns the claim and its data for as long as the claim exists.
+The pod using the claim can be deleted, but the claim remains in the user's
+namespace. It can be used again and again by many pods.
+
+```yaml
+POST:
+
+kind: Pod
+apiVersion: v1
+metadata:
+ name: mypod
+spec:
+ containers:
+ - image: nginx
+ name: myfrontend
+ volumeMounts:
+ - mountPath: "/var/www/html"
+ name: mypd
+ volumes:
+ - name: mypd
+ source:
+ persistentVolumeClaim:
+ accessMode: ReadWriteOnce
+ claimRef:
+ name: myclaim-1
+```
+
+#### Releasing a claim and Recycling a volume
+
+When a claim holder is finished with their data, they can delete their claim.
+
+```console
+$ kubectl delete pvc myclaim-1
+```
+
+The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim
+reference from the PV and change the PVs status to 'Released'.
+
+Admins can script the recycling of released volumes. Future dynamic provisioners
+will understand how a volume should be recycled.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/pleg.png b/contributors/design-proposals/pleg.png
new file mode 100644
index 00000000..f15c5d83
--- /dev/null
+++ b/contributors/design-proposals/pleg.png
Binary files differ
diff --git a/contributors/design-proposals/pod-cache.png b/contributors/design-proposals/pod-cache.png
new file mode 100644
index 00000000..dee86c40
--- /dev/null
+++ b/contributors/design-proposals/pod-cache.png
Binary files differ
diff --git a/contributors/design-proposals/pod-lifecycle-event-generator.md b/contributors/design-proposals/pod-lifecycle-event-generator.md
new file mode 100644
index 00000000..207d6a17
--- /dev/null
+++ b/contributors/design-proposals/pod-lifecycle-event-generator.md
@@ -0,0 +1,201 @@
+# Kubelet: Pod Lifecycle Event Generator (PLEG)
+
+In Kubernetes, Kubelet is a per-node daemon that manages the pods on the node,
+driving the pod states to match their pod specifications (specs). To achieve
+this, Kubelet needs to react to changes in both (1) pod specs and (2) the
+container states. For the former, Kubelet watches the pod specs changes from
+multiple sources; for the latter, Kubelet polls the container runtime
+periodically (e.g., 10s) for the latest states for all containers.
+
+Polling incurs non-negligible overhead as the number of pods/containers increases,
+and is exacerbated by Kubelet's parallelism -- one worker (goroutine) per pod, which
+queries the container runtime individually. Periodic, concurrent, large number
+of requests causes high CPU usage spikes (even when there is no spec/state
+change), poor performance, and reliability problems due to overwhelmed container
+runtime. Ultimately, it limits Kubelet's scalability.
+
+(Related issues reported by users: [#10451](https://issues.k8s.io/10451),
+[#12099](https://issues.k8s.io/12099), [#12082](https://issues.k8s.io/12082))
+
+## Goals and Requirements
+
+The goal of this proposal is to improve Kubelet's scalability and performance
+by lowering the pod management overhead.
+ - Reduce unnecessary work during inactivity (no spec/state changes)
+ - Lower the concurrent requests to the container runtime.
+
+The design should be generic so that it can support different container runtimes
+(e.g., Docker and rkt).
+
+## Overview
+
+This proposal aims to replace the periodic polling with a pod lifecycle event
+watcher.
+
+![pleg](pleg.png)
+
+## Pod Lifecycle Event
+
+A pod lifecycle event interprets the underlying container state change at the
+pod-level abstraction, making it container-runtime-agnostic. The abstraction
+shields Kubelet from the runtime specifics.
+
+```go
+type PodLifeCycleEventType string
+
+const (
+ ContainerStarted PodLifeCycleEventType = "ContainerStarted"
+ ContainerStopped PodLifeCycleEventType = "ContainerStopped"
+ NetworkSetupCompleted PodLifeCycleEventType = "NetworkSetupCompleted"
+ NetworkFailed PodLifeCycleEventType = "NetworkFailed"
+)
+
+// PodLifecycleEvent is an event reflects the change of the pod state.
+type PodLifecycleEvent struct {
+ // The pod ID.
+ ID types.UID
+ // The type of the event.
+ Type PodLifeCycleEventType
+ // The accompanied data which varies based on the event type.
+ Data interface{}
+}
+```
+
+Using Docker as an example, starting of a POD infra container would be
+translated to a NetworkSetupCompleted`pod lifecycle event.
+
+
+## Detect Changes in Container States Via Relisting
+
+In order to generate pod lifecycle events, PLEG needs to detect changes in
+container states. We can achieve this by periodically relisting all containers
+(e.g., docker ps). Although this is similar to Kubelet's polling today, it will
+only be performed by a single thread (PLEG). This means that we still
+benefit from not having all pod workers hitting the container runtime
+concurrently. Moreover, only the relevant pod worker would be woken up
+to perform a sync.
+
+The upside of relying on relisting is that it is container runtime-agnostic,
+and requires no external dependency.
+
+### Relist period
+
+The shorter the relist period is, the sooner that Kubelet can detect the
+change. Shorter relist period also implies higher cpu usage. Moreover, the
+relist latency depends on the underlying container runtime, and usually
+increases as the number of containers/pods grows. We should set a default
+relist period based on measurements. Regardless of what period we set, it will
+likely be significantly shorter than the current pod sync period (10s), i.e.,
+Kubelet will detect container changes sooner.
+
+
+## Impact on the Pod Worker Control Flow
+
+Kubelet is responsible for dispatching an event to the appropriate pod
+worker based on the pod ID. Only one pod worker would be woken up for
+each event.
+
+Today, the pod syncing routine in Kubelet is idempotent as it always
+examines the pod state and the spec, and tries to drive to state to
+match the spec by performing a series of operations. It should be
+noted that this proposal does not intend to change this property --
+the sync pod routine would still perform all necessary checks,
+regardless of the event type. This trades some efficiency for
+reliability and eliminate the need to build a state machine that is
+compatible with different runtimes.
+
+## Leverage Upstream Container Events
+
+Instead of relying on relisting, PLEG can leverage other components which
+provide container events, and translate these events into pod lifecycle
+events. This will further improve Kubelet's responsiveness and reduce the
+resource usage caused by frequent relisting.
+
+The upstream container events can come from:
+
+(1). *Event stream provided by each container runtime*
+
+Docker's API exposes an [event
+stream](https://docs.docker.com/reference/api/docker_remote_api_v1.17/#monitor-docker-s-events).
+Nonetheless, rkt does not support this yet, but they will eventually support it
+(see [coreos/rkt#1193](https://github.com/coreos/rkt/issues/1193)).
+
+(2). *cgroups event stream by cAdvisor*
+
+cAdvisor is integrated in Kubelet to provide container stats. It watches cgroups
+containers using inotify and exposes an event stream. Even though it does not
+support rkt yet, it should be straightforward to add such a support.
+
+Option (1) may provide richer sets of events, but option (2) has the advantage
+to be more universal across runtimes, as long as the container runtime uses
+cgroups. Regardless of what one chooses to implement now, the container event
+stream should be easily swappable with a clearly defined interface.
+
+Note that we cannot solely rely on the upstream container events due to the
+possibility of missing events. PLEG should relist infrequently to ensure no
+events are missed.
+
+## Generate Expected Events
+
+*This is optional for PLEGs which performs only relisting, but required for
+PLEGs that watch upstream events.*
+
+A pod worker's actions could lead to pod lifecycle events (e.g.,
+create/kill a container), which the worker would not observe until
+later. The pod worker should ignore such events to avoid unnecessary
+work.
+
+For example, assume a pod has two containers, A and B. The worker
+
+ - Creates container A
+ - Receives an event `(ContainerStopped, B)`
+ - Receives an event `(ContainerStarted, A)`
+
+
+The worker should ignore the `(ContainerStarted, A)` event since it is
+expected. Arguably, the worker could process `(ContainerStopped, B)`
+as soon as it receives the event, before observing the creation of
+A. However, it is desirable to wait until the expected event
+`(ContainerStarted, A)` is observed to keep a consistent per-pod view
+at the worker. Therefore, the control flow of a single pod worker
+should adhere to the following rules:
+
+1. Pod worker should process the events sequentially.
+2. Pod worker should not start syncing until it observes the outcome of its own
+ actions in the last sync to maintain a consistent view.
+
+In other words, a pod worker should record the expected events, and
+only wake up to perform the next sync until all expectations are met.
+
+ - Creates container A, records an expected event `(ContainerStarted, A)`
+ - Receives `(ContainerStopped, B)`; stores the event and goes back to sleep.
+ - Receives `(ContainerStarted, A)`; clears the expectation. Proceeds to handle
+ `(ContainerStopped, B)`.
+
+We should set an expiration time for each expected events to prevent the worker
+from being stalled indefinitely by missing events.
+
+## TODOs for v1.2
+
+For v1.2, we will add a generic PLEG which relists periodically, and leave
+adopting container events for future work. We will also *not* implement the
+optimization that generate and filters out expected events to minimize
+redundant syncs.
+
+- Add a generic PLEG using relisting. Modify the container runtime interface
+ to provide all necessary information to detect container state changes
+ in `GetPods()` (#13571).
+
+- Benchmark docker to adjust relising frequency.
+
+- Fix/adapt features that rely on frequent, periodic pod syncing.
+ * Liveness/Readiness probing: Create a separate probing manager using
+ explicitly container probing period [#10878](https://issues.k8s.io/10878).
+ * Instruct pod workers to set up a wake-up call if syncing failed, so that
+ it can retry.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-lifecycle-event-generator.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/pod-resource-management.md b/contributors/design-proposals/pod-resource-management.md
new file mode 100644
index 00000000..39f939e3
--- /dev/null
+++ b/contributors/design-proposals/pod-resource-management.md
@@ -0,0 +1,416 @@
+# Pod level resource management in Kubelet
+
+**Author**: Buddha Prakash (@dubstack), Vishnu Kannan (@vishh)
+
+**Last Updated**: 06/23/2016
+
+**Status**: Draft Proposal (WIP)
+
+This document proposes a design for introducing pod level resource accounting to Kubernetes, and outlines the implementation and rollout plan.
+
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Pod level resource management in Kubelet](#pod-level-resource-management-in-kubelet)
+ - [Introduction](#introduction)
+ - [Non Goals](#non-goals)
+ - [Motivations](#motivations)
+ - [Design](#design)
+ - [Proposed cgroup hierarchy:](#proposed-cgroup-hierarchy)
+ - [QoS classes](#qos-classes)
+ - [Guaranteed](#guaranteed)
+ - [Burstable](#burstable)
+ - [Best Effort](#best-effort)
+ - [With Systemd](#with-systemd)
+ - [Hierarchy Outline](#hierarchy-outline)
+ - [QoS Policy Design Decisions](#qos-policy-design-decisions)
+ - [Implementation Plan](#implementation-plan)
+ - [Top level Cgroups for QoS tiers](#top-level-cgroups-for-qos-tiers)
+ - [Pod level Cgroup creation and deletion (Docker runtime)](#pod-level-cgroup-creation-and-deletion-docker-runtime)
+ - [Container level cgroups](#container-level-cgroups)
+ - [Rkt runtime](#rkt-runtime)
+ - [Add Pod level metrics to Kubelet's metrics provider](#add-pod-level-metrics-to-kubelets-metrics-provider)
+ - [Rollout Plan](#rollout-plan)
+ - [Implementation Status](#implementation-status)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## Introduction
+
+As of now [Quality of Service(QoS)](../../docs/design/resource-qos.md) is not enforced at a pod level. Excepting pod evictions, all the other QoS features are not applicable at the pod level.
+To better support QoS, there is a need to add support for pod level resource accounting in Kubernetes.
+
+We propose to have a unified cgroup hierarchy with pod level cgroups for better resource management. We will have a cgroup hierarchy with top level cgroups for the three QoS classes Guaranteed, Burstable and BestEffort. Pods (and their containers) belonging to a QoS class will be grouped under these top level QoS cgroups. And all containers in a pod are nested under the pod cgroup.
+
+The proposed cgroup hierarchy would allow for more efficient resource management and lead to improvements in node reliability.
+This would also allow for significant latency optimizations in terms of pod eviction on nodes with the use of pod level resource usage metrics.
+This document provides a basic outline of how we plan to implement and rollout this feature.
+
+
+## Non Goals
+
+- Pod level disk accounting will not be tackled in this proposal.
+- Pod level resource specification in the Kubernetes API will not be tackled in this proposal.
+
+## Motivations
+
+Kubernetes currently supports container level isolation only and lets users specify resource requests/limits on the containers [Compute Resources](../../docs/design/resources.md). The `kubelet` creates a cgroup sandbox (via it's container runtime) for each container.
+
+
+There are a few shortcomings to the current model.
+ - Existing QoS support does not apply to pods as a whole. On-going work to support pod level eviction using QoS requires all containers in a pod to belong to the same class. By having pod level cgroups, it is easy to track pod level usage and make eviction decisions.
+ - Infrastructure overhead per pod is currently charged to the node. The overhead of setting up and managing the pod sandbox is currently accounted to the node. If the pod sandbox is a bit expensive, like in the case of hyper, having pod level accounting becomes critical.
+ - For the docker runtime we have a containerd-shim which is a small library that sits in front of a runtime implementation allowing it to be reparented to init, handle reattach from the caller etc. With pod level cgroups containerd-shim can be charged to the pod instead of the machine.
+ - If a container exits, all its anonymous pages (tmpfs) gets accounted to the machine (root). With pod level cgroups, that usage can also be attributed to the pod.
+ - Let containers share resources - with pod level limits, a pod with a Burstable container and a BestEffort container is classified as Burstable pod. The BestEffort container is able to consume slack resources not used by the Burstable container, and still be capped by the overall pod level limits.
+
+## Design
+
+High level requirements for the design are as follows:
+ - Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics.
+ - Support multiple cgroup managers - systemd, cgroupfs, etc.
+
+How we intend to achieve these high level goals is covered in greater detail in the Implementation Plan.
+
+We use the following denotations in the sections below:
+
+For the three QoS classes
+`G⇒ Guaranteed QoS, Bu⇒ Burstable QoS, BE⇒ BestEffort QoS`
+
+For the value specified for the --qos-memory-overcommitment flag
+`qmo⇒ qos-memory-overcommitment`
+
+Currently the Kubelet highly prioritizes resource utilization and thus allows BE pods to use as much resources as they want. And in case of OOM the BE pods are first to be killed. We follow this policy as G pods often don't use the amount of resource request they specify. By overcommiting the node the BE pods are able to utilize these left over resources. And in case of OOM the BE pods are evicted by the eviciton manager. But there is some latency involved in the pod eviction process which can be a cause of concern in latency-sensitive servers. On such servers we would want to avoid OOM conditions on the node. Pod level cgroups allow us to restrict the amount of available resources to the BE pods. So reserving the requested resources for the G and Bu pods would allow us to avoid invoking the OOM killer.
+
+
+We add a flag `qos-memory-overcommitment` to kubelet which would allow users to configure the percentage of memory overcommitment on the node. We have the default as 100, so by default we allow complete overcommitment on the node and let the BE pod use as much memory as it wants, and not reserve any resources for the G and Bu pods. As expected if there is an OOM in such a case we first kill the BE pods before the G and Bu pods.
+On the other hand if a user wants to ensure very predictable tail latency for latency-sensitive servers he would need to set qos-memory-overcommitment to a really low value(preferrably 0). In this case memory resources would be reserved for the G and Bu pods and BE pods would be able to use only the left over memory resource.
+
+Examples in the next section.
+
+### Proposed cgroup hierarchy:
+
+For the initial implementation we will only support limits for cpu and memory resources.
+
+#### QoS classes
+
+A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.
+
+#### Guaranteed
+
+`G` pods will be placed at the `$Root` cgroup by default. `$Root` is the system root i.e. "/" by default and if `--cgroup-root` flag is used then we use the specified cgroup-root as the `$Root`. To ensure Kubelet's idempotent behaviour we follow a pod cgroup naming format which is opaque and deterministic. Say we have a pod with UID: `5f9b19c9-3a30-11e6-8eea-28d2444e470d` the pod cgroup PodUID would be named: `pod-5f9b19c93a3011e6-8eea28d2444e470d`.
+
+
+__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers.
+
+```
+/PodUID/cpu.quota = cpu limit of Pod
+/PodUID/cpu.shares = cpu request of Pod
+/PodUID/memory.limit_in_bytes = memory limit of Pod
+```
+
+Example:
+We have two pods Pod1 and Pod2 having Pod Spec given below
+
+```yaml
+kind: Pod
+metadata:
+ name: Pod1
+spec:
+ containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 10m
+ memory: 1Gi
+ name: bar
+ resources:
+ limits:
+ cpu: 100m
+ memory: 2Gi
+```
+
+```yaml
+kind: Pod
+metadata:
+ name: Pod2
+spec:
+ containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 20m
+ memory: 2Gii
+```
+
+Pod1 and Pod2 are both classified as `G` and are nested under the `Root` cgroup.
+
+```
+/Pod1/cpu.quota = 110m
+/Pod1/cpu.shares = 110m
+/Pod2/cpu.quota = 20m
+/Pod2/cpu.shares = 20m
+/Pod1/memory.limit_in_bytes = 3Gi
+/Pod2/memory.limit_in_bytes = 2Gi
+```
+
+#### Burstable
+
+We have the following resource parameters for the `Bu` cgroup.
+
+```
+/Bu/cpu.shares = summation of cpu requests of all Bu pods
+/Bu/PodUID/cpu.quota = Pod Cpu Limit
+/Bu/PodUID/cpu.shares = Pod Cpu Request
+/Bu/memory.limit_in_bytes = Allocatable - {(summation of memory requests/limits of `G` pods)*(1-qom/100)}
+/Bu/PodUID/memory.limit_in_bytes = Pod memory limit
+```
+
+`Note: For the `Bu` QoS when limits are not specified for any one of the containers, the Pod limit defaults to the node resource allocatable quantity.`
+
+Example:
+We have two pods Pod3 and Pod4 having Pod Spec given below:
+
+```yaml
+kind: Pod
+metadata:
+ name: Pod3
+spec:
+ containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 50m
+ memory: 2Gi
+ requests:
+ cpu: 20m
+ memory: 1Gi
+ name: bar
+ resources:
+ limits:
+ cpu: 100m
+ memory: 1Gi
+```
+
+```yaml
+kind: Pod
+metadata:
+ name: Pod4
+spec:
+ containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 20m
+ memory: 2Gi
+ requests:
+ cpu: 10m
+ memory: 1Gi
+```
+
+Pod3 and Pod4 are both classified as `Bu` and are hence nested under the Bu cgroup
+And for `qom` = 0
+
+```
+/Bu/cpu.shares = 30m
+/Bu/Pod3/cpu.quota = 150m
+/Bu/Pod3/cpu.shares = 20m
+/Bu/Pod4/cpu.quota = 20m
+/Bu/Pod4/cpu.shares = 10m
+/Bu/memory.limit_in_bytes = Allocatable - 5Gi
+/Bu/Pod3/memory.limit_in_bytes = 3Gi
+/Bu/Pod4/memory.limit_in_bytes = 2Gi
+```
+
+#### Best Effort
+
+For pods belonging to the `BE` QoS we don't set any quota.
+
+```
+/BE/cpu.shares = 2
+/BE/cpu.quota= not set
+/BE/memory.limit_in_bytes = Allocatable - {(summation of memory requests of all `G` and `Bu` pods)*(1-qom/100)}
+/BE/PodUID/memory.limit_in_bytes = no limit
+```
+
+Example:
+We have a pod 'Pod5' having Pod Spec given below:
+
+```yaml
+kind: Pod
+metadata:
+ name: Pod5
+spec:
+ containers:
+ name: foo
+ resources:
+ name: bar
+ resources:
+```
+
+Pod5 is classified as `BE` and is hence nested under the BE cgroup
+And for `qom` = 0
+
+```
+/BE/cpu.shares = 2
+/BE/cpu.quota= not set
+/BE/memory.limit_in_bytes = Allocatable - 7Gi
+/BE/Pod5/memory.limit_in_bytes = no limit
+```
+
+### With Systemd
+
+In systemd we have slices for the three top level QoS class. Further each pod is a subslice of exactly one of the three QoS slices. Each container in a pod belongs to a scope nested under the qosclass-pod slice.
+
+Example: We plan to have the following cgroup hierarchy on systemd systems
+
+```
+/memory/G-PodUID.slice/containerUID.scope
+/cpu,cpuacct/G-PodUID.slice/containerUID.scope
+/memory/Bu.slice/Bu-PodUID.slice/containerUID.scope
+/cpu,cpuacct/Bu.slice/Bu-PodUID.slice/containerUID.scope
+/memory/BE.slice/BE-PodUID.slice/containerUID.scope
+/cpu,cpuacct/BE.slice/BE-PodUID.slice/containerUID.scope
+```
+
+### Hierarchy Outline
+
+- "$Root" is the system root of the node i.e. "/" by default and if `--cgroup-root` is specified then the specified cgroup-root is used as "$Root".
+- We have a top level QoS cgroup for the `Bu` and `BE` QoS classes.
+- But we __dont__ have a separate cgroup for the `G` QoS class. `G` pod cgroups are brought up directly under the `Root` cgroup.
+- Each pod has its own cgroup which is nested under the cgroup matching the pod's QoS class.
+- All containers brought up by the pod are nested under the pod's cgroup.
+- system-reserved cgroup contains the system specific processes.
+- kube-reserved cgroup contains the kubelet specific daemons.
+
+```
+$ROOT
+ |
+ +- Pod1
+ | |
+ | +- Container1
+ | +- Container2
+ | ...
+ +- Pod2
+ | +- Container3
+ | ...
+ +- ...
+ |
+ +- Bu
+ | |
+ | +- Pod3
+ | | |
+ | | +- Container4
+ | | ...
+ | +- Pod4
+ | | +- Container5
+ | | ...
+ | +- ...
+ |
+ +- BE
+ | |
+ | +- Pod5
+ | | |
+ | | +- Container6
+ | | +- Container7
+ | | ...
+ | +- ...
+ |
+ +- System-reserved
+ | |
+ | +- system
+ | +- docker (optional)
+ | +- ...
+ |
+ +- Kube-reserved
+ | |
+ | +- kubelet
+ | +- docker (optional)
+ | +- ...
+ |
+```
+
+#### QoS Policy Design Decisions
+
+- This hierarchy highly prioritizes resource guarantees to the `G` over `Bu` and `BE` pods.
+- By not having a separate cgroup for the `G` class, the hierarchy allows the `G` pods to burst and utilize all of Node's Allocatable capacity.
+- The `BE` and `Bu` pods are strictly restricted from bursting and hogging resources and thus `G` Pods are guaranteed resource isolation.
+- `BE` pods are treated as lowest priority. So for the `BE` QoS cgroup we set cpu shares to the lowest possible value ie.2. This ensures that the `BE` containers get a relatively small share of cpu time.
+- Also we don't set any quota on the cpu resources as the containers on the `BE` pods can use any amount of free resources on the node.
+- Having memory limit of `BE` cgroup as (Allocatable - summation of memory requests of `G` and `Bu` pods) would result in `BE` pods becoming more susceptible to being OOM killed. As more `G` and `Bu` pods are scheduled kubelet will more likely kill `BE` pods, even if the `G` and `Bu` pods are using less than their request since we will be dynamically reducing the size of `BE` m.limit_in_bytes. But this allows for better memory guarantees to the `G` and `Bu` pods.
+
+## Implementation Plan
+
+The implementation plan is outlined in the next sections.
+We will have a 'experimental-cgroups-per-qos' flag to specify if the user wants to use the QoS based cgroup hierarchy. The flag would be set to false by default at least in v1.5.
+
+#### Top level Cgroups for QoS tiers
+
+Two top level cgroups for `Bu` and `BE` QoS classes are created when Kubelet starts to run on a node. All `G` pods cgroups are by default nested under the `Root`. So we dont create a top level cgroup for the `G` class. For raw cgroup systems we would use libcontainers cgroups manager for general cgroup management(cgroup creation/destruction). But for systemd we don't have equivalent support for slice management in libcontainer yet. So we will be adding support for the same in the Kubelet. These cgroups are only created once on Kubelet initialization as a part of node setup. Also on systemd these cgroups are transient units and will not survive reboot.
+
+#### Pod level Cgroup creation and deletion (Docker runtime)
+
+- When a new pod is brought up, its QoS class is firstly determined.
+- We add an interface to Kubelet’s ContainerManager to create and delete pod level cgroups under the cgroup that matches the pod’s QoS class.
+- This interface will be pluggable. Kubelet will support both systemd and raw cgroups based __cgroup__ drivers. We will be using the --cgroup-driver flag proposed in the [Systemd Node Spec](kubelet-systemd.md) to specify the cgroup driver.
+- We inject creation and deletion of pod level cgroups into the pod workers.
+- As new pods are added QoS class cgroup parameters are updated to match the resource requests by the Pod.
+
+#### Container level cgroups
+
+Have docker manager create container cgroups under pod level cgroups. With the docker runtime, we will pass --cgroup-parent using the syntax expected for the corresponding cgroup-driver the runtime was configured to use.
+
+#### Rkt runtime
+
+We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.
+
+#### Add Pod level metrics to Kubelet's metrics provider
+
+Update Kubelet’s metrics provider to include Pod level metrics. Use cAdvisor's cgroup subsystem information to determine various Pod level usage metrics.
+
+`Note: Changes to cAdvisor might be necessary.`
+
+## Rollout Plan
+
+This feature will be opt-in in v1.4 and an opt-out in v1.5. We recommend users to drain their nodes and opt-in, before switching to v1.5, which will result in a no-op when v1.5 kubelet is rolled out.
+
+## Implementation Status
+
+The implementation goals of the first milestone are outlined below.
+- [x] Finalize and submit Pod Resource Management proposal for the project #26751
+- [x] Refactor qos package to be used globally throughout the codebase #27749 #28093
+- [x] Add interfaces for CgroupManager and CgroupManagerImpl which implements the CgroupManager interface and creates, destroys/updates cgroups using the libcontainer cgroupfs driver. #27755 #28566
+- [x] Inject top level QoS Cgroup creation in the Kubelet and add e2e tests to test that behaviour. #27853
+- [x] Add PodContainerManagerImpl Create and Destroy methods which implements the respective PodContainerManager methods using a cgroupfs driver. #28017
+- [x] Have docker manager create container cgroups under pod level cgroups. Inject creation and deletion of pod cgroups into the pod workers. Add e2e tests to test this behaviour. #29049
+- [x] Add support for updating policy for the pod cgroups. Add e2e tests to test this behaviour. #29087
+- [ ] Enabling 'cgroup-per-qos' flag in Kubelet: The user is expected to drain the node and restart it before enabling this feature, but as a fallback we also want to allow the user to just restart the kubelet with the cgroup-per-qos flag enabled to use this feature. As a part of this we need to figure out a policy for pods having Restart Policy: Never. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29946).
+- [ ] Removing terminated pod's Cgroup : We need to cleanup the pod's cgroup once the pod is terminated. More details in this [issue](https://github.com/kubernetes/kubernetes/issues/29927).
+- [ ] Kubelet needs to ensure that the cgroup settings are what the kubelet expects them to be. If security is not of concern, one can assume that once kubelet applies cgroups setting successfully, the values will never change unless kubelet changes it. If security is of concern, then kubelet will have to ensure that the cgroup values meet its requirements and then continue to watch for updates to cgroups via inotify and re-apply cgroup values if necessary.
+Updating QoS limits needs to happen before pod cgroups values are updated. When pod cgroups are being deleted, QoS limits have to be updated after pod cgroup values have been updated for deletion or pod cgroups have been removed. Given that kubelet doesn't have any checkpoints and updates to QoS and pod cgroups are not atomic, kubelet needs to reconcile cgroups status whenever it restarts to ensure that the cgroups values match kubelet's expectation.
+- [ ] [TEST] Opting in for this feature and rollbacks should be accompanied by detailed error message when killing pod intermittently.
+- [ ] Add a systemd implementation for Cgroup Manager interface
+
+
+Other smaller work items that we would be good to have before the release of this feature.
+- [ ] Add Pod UID to the downward api which will help simplify the e2e testing logic.
+- [ ] Check if parent cgroup exist and error out if they don’t.
+- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
+- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
+- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
+- [ ] Log basic cgroup management ie. creation/deletion metrics
+
+
+To better support our requirements we needed to make some changes/add features to Libcontainer as well
+
+- [x] Allowing or denying all devices by writing 'a' to devices.allow or devices.deny is
+not possible once the device cgroups has children. Libcontainer doesn’t have the option of skipping updates on parent devices cgroup. opencontainers/runc/pull/958
+- [x] To use libcontainer for creating and managing cgroups in the Kubelet, I would like to just create a cgroup with no pid attached and if need be apply a pid to the cgroup later on. But libcontainer did not support cgroup creation without attaching a pid. opencontainers/runc/pull/956
+
+
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-resource-management.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/pod-security-context.md b/contributors/design-proposals/pod-security-context.md
new file mode 100644
index 00000000..bfaffa59
--- /dev/null
+++ b/contributors/design-proposals/pod-security-context.md
@@ -0,0 +1,374 @@
+## Abstract
+
+A proposal for refactoring `SecurityContext` to have pod-level and container-level attributes in
+order to correctly model pod- and container-level security concerns.
+
+## Motivation
+
+Currently, containers have a `SecurityContext` attribute which contains information about the
+security settings the container uses. In practice, many of these attributes are uniform across all
+containers in a pod. Simultaneously, there is also a need to apply the security context pattern
+at the pod level to correctly model security attributes that apply only at a pod level.
+
+Users should be able to:
+
+1. Express security settings that are applicable to the entire pod
+2. Express base security settings that apply to all containers
+3. Override only the settings that need to be differentiated from the base in individual
+ containers
+
+This proposal is a dependency for other changes related to security context:
+
+1. [Volume ownership management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/12944)
+2. [Generic SELinux label management in the Kubelet](https://github.com/kubernetes/kubernetes/pull/14192)
+
+Goals of this design:
+
+1. Describe the use cases for which a pod-level security context is necessary
+2. Thoroughly describe the API backward compatibility issues that arise from the introduction of
+ a pod-level security context
+3. Describe all implementation changes necessary for the feature
+
+## Constraints and assumptions
+
+1. We will not design for intra-pod security; we are not currently concerned about isolating
+ containers in the same pod from one another
+1. We will design for backward compatibility with the current V1 API
+
+## Use Cases
+
+1. As a developer, I want to correctly model security attributes which belong to an entire pod
+2. As a user, I want to be able to specify container attributes that apply to all containers
+ without repeating myself
+3. As an existing user, I want to be able to use the existing container-level security API
+
+### Use Case: Pod level security attributes
+
+Some security attributes make sense only to model at the pod level. For example, it is a
+fundamental property of pods that all containers in a pod share the same network namespace.
+Therefore, using the host namespace makes sense to model at the pod level only, and indeed, today
+it is part of the `PodSpec`. Other host namespace support is currently being added and these will
+also be pod-level settings; it makes sense to model them as a pod-level collection of security
+attributes.
+
+## Use Case: Override pod security context for container
+
+Some use cases require the containers in a pod to run with different security settings. As an
+example, a user may want to have a pod with two containers, one of which runs as root with the
+privileged setting, and one that runs as a non-root UID. To support use cases like this, it should
+be possible to override appropriate (i.e., not intrinsically pod-level) security settings for
+individual containers.
+
+## Proposed Design
+
+### SecurityContext
+
+For posterity and ease of reading, note the current state of `SecurityContext`:
+
+```go
+package api
+
+type Container struct {
+ // Other fields omitted
+
+ // Optional: SecurityContext defines the security options the pod should be run with
+ SecurityContext *SecurityContext `json:"securityContext,omitempty"`
+}
+
+type SecurityContext struct {
+ // Capabilities are the capabilities to add/drop when running the container
+ Capabilities *Capabilities `json:"capabilities,omitempty"`
+
+ // Run the container in privileged mode
+ Privileged *bool `json:"privileged,omitempty"`
+
+ // SELinuxOptions are the labels to be applied to the container
+ // and volumes
+ SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+
+ // RunAsUser is the UID to run the entrypoint of the container process.
+ RunAsUser *int64 `json:"runAsUser,omitempty"`
+
+ // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser
+ // field is not explicitly set then the kubelet may check the image for a specified user or
+ // perform defaulting to specify a user.
+ RunAsNonRoot bool `json:"runAsNonRoot,omitempty"`
+}
+
+// SELinuxOptions contains the fields that make up the SELinux context of a container.
+type SELinuxOptions struct {
+ // SELinux user label
+ User string `json:"user,omitempty"`
+
+ // SELinux role label
+ Role string `json:"role,omitempty"`
+
+ // SELinux type label
+ Type string `json:"type,omitempty"`
+
+ // SELinux level label.
+ Level string `json:"level,omitempty"`
+}
+```
+
+### PodSecurityContext
+
+`PodSecurityContext` specifies two types of security attributes:
+
+1. Attributes that apply to the pod itself
+2. Attributes that apply to the containers of the pod
+
+In the internal API, fields of the `PodSpec` controlling the use of the host PID, IPC, and network
+namespaces are relocated to this type:
+
+```go
+package api
+
+type PodSpec struct {
+ // Other fields omitted
+
+ // Optional: SecurityContext specifies pod-level attributes and container security attributes
+ // that apply to all containers.
+ SecurityContext *PodSecurityContext `json:"securityContext,omitempty"`
+}
+
+// PodSecurityContext specifies security attributes of the pod and container attributes that apply
+// to all containers of the pod.
+type PodSecurityContext struct {
+ // Use the host's network namespace. If this option is set, the ports that will be
+ // used must be specified.
+ // Optional: Default to false.
+ HostNetwork bool
+ // Use the host's IPC namespace
+ HostIPC bool
+
+ // Use the host's PID namespace
+ HostPID bool
+
+ // Capabilities are the capabilities to add/drop when running containers
+ Capabilities *Capabilities `json:"capabilities,omitempty"`
+
+ // Run the container in privileged mode
+ Privileged *bool `json:"privileged,omitempty"`
+
+ // SELinuxOptions are the labels to be applied to the container
+ // and volumes
+ SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+
+ // RunAsUser is the UID to run the entrypoint of the container process.
+ RunAsUser *int64 `json:"runAsUser,omitempty"`
+
+ // RunAsNonRoot indicates that the container should be run as a non-root user. If the RunAsUser
+ // field is not explicitly set then the kubelet may check the image for a specified user or
+ // perform defaulting to specify a user.
+ RunAsNonRoot bool
+}
+
+// Comments and generated docs will change for the container.SecurityContext field to indicate
+// the precedence of these fields over the pod-level ones.
+
+type Container struct {
+ // Other fields omitted
+
+ // Optional: SecurityContext defines the security options the pod should be run with.
+ // Settings specified in this field take precedence over the settings defined in
+ // pod.Spec.SecurityContext.
+ SecurityContext *SecurityContext `json:"securityContext,omitempty"`
+}
+```
+
+In the V1 API, the pod-level security attributes which are currently fields of the `PodSpec` are
+retained on the `PodSpec` for backward compatibility purposes:
+
+```go
+package v1
+
+type PodSpec struct {
+ // Other fields omitted
+
+ // Use the host's network namespace. If this option is set, the ports that will be
+ // used must be specified.
+ // Optional: Default to false.
+ HostNetwork bool `json:"hostNetwork,omitempty"`
+ // Use the host's pid namespace.
+ // Optional: Default to false.
+ HostPID bool `json:"hostPID,omitempty"`
+ // Use the host's ipc namespace.
+ // Optional: Default to false.
+ HostIPC bool `json:"hostIPC,omitempty"`
+
+ // Optional: SecurityContext specifies pod-level attributes and container security attributes
+ // that apply to all containers.
+ SecurityContext *PodSecurityContext `json:"securityContext,omitempty"`
+}
+```
+
+The `pod.Spec.SecurityContext` specifies the security context of all containers in the pod.
+The containers' `securityContext` field is overlaid on the base security context to determine the
+effective security context for the container.
+
+The new V1 API should be backward compatible with the existing API. Backward compatibility is
+defined as:
+
+> 1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before your change must
+> work the same after your change.
+> 2. Any API call that uses your change must not cause problems (e.g. crash or degrade behavior) when
+> issued against servers that do not include your change.
+> 3. It must be possible to round-trip your change (convert to different API versions and back) with
+> no loss of information.
+
+Previous versions of this proposal attempted to deal with backward compatibility by defining
+the affect of setting the pod-level fields on the container-level fields. While trying to find
+consensus on this design, it became apparent that this approach was going to be extremely complex
+to implement, explain, and support. Instead, we will approach backward compatibility as follows:
+
+1. Pod-level and container-level settings will not affect one another
+2. Old clients will be able to use container-level settings in the exact same way
+3. Container level settings always override pod-level settings if they are set
+
+#### Examples
+
+1. Old client using `pod.Spec.Containers[x].SecurityContext`
+
+ An old client creates a pod:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1001
+ - name: b
+ securityContext:
+ runAsUser: 1002
+ ```
+
+ looks to old clients like:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1001
+ - name: b
+ securityContext:
+ runAsUser: 1002
+ ```
+
+ looks to new clients like:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1001
+ - name: b
+ securityContext:
+ runAsUser: 1002
+ ```
+
+2. New client using `pod.Spec.SecurityContext`
+
+ A new client creates a pod using a field of `pod.Spec.SecurityContext`:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ securityContext:
+ runAsUser: 1001
+ containers:
+ - name: a
+ - name: b
+ ```
+
+ appears to new clients as:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ securityContext:
+ runAsUser: 1001
+ containers:
+ - name: a
+ - name: b
+ ```
+
+ old clients will see:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ containers:
+ - name: a
+ - name: b
+ ```
+
+3. Pods created using `pod.Spec.SecurityContext` and `pod.Spec.Containers[x].SecurityContext`
+
+ If a field is set in both `pod.Spec.SecurityContext` and
+ `pod.Spec.Containers[x].SecurityContext`, the value in `pod.Spec.Containers[x].SecurityContext`
+ wins. In the following pod:
+
+ ```yaml
+ apiVersion: v1
+ kind: Pod
+ metadata:
+ name: test-pod
+ spec:
+ securityContext:
+ runAsUser: 1001
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1002
+ - name: b
+ ```
+
+ The effective setting for `runAsUser` for container A is `1002`.
+
+#### Testing
+
+A backward compatibility test suite will be established for the v1 API. The test suite will
+verify compatibility by converting objects into the internal API and back to the version API and
+examining the results.
+
+All of the examples here will be used as test-cases. As more test cases are added, the proposal will
+be updated.
+
+An example of a test like this can be found in the
+[OpenShift API package](https://github.com/openshift/origin/blob/master/pkg/api/compatibility_test.go)
+
+E2E test cases will be added to test the correct determination of the security context for containers.
+
+### Kubelet changes
+
+1. The Kubelet will use the new fields on the `PodSecurityContext` for host namespace control
+2. The Kubelet will be modified to correctly implement the backward compatibility and effective
+ security context determination defined here
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-security-context.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/podaffinity.md b/contributors/design-proposals/podaffinity.md
new file mode 100644
index 00000000..9291b8b9
--- /dev/null
+++ b/contributors/design-proposals/podaffinity.md
@@ -0,0 +1,673 @@
+# Inter-pod topological affinity and anti-affinity
+
+## Introduction
+
+NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
+
+This document describes a proposal for specifying and implementing inter-pod
+topological affinity and anti-affinity. By that we mean: rules that specify that
+certain pods should be placed in the same topological domain (e.g. same node,
+same rack, same zone, same power domain, etc.) as some other pods, or,
+conversely, should *not* be placed in the same topological domain as some other
+pods.
+
+Here are a few example rules; we explain how to express them using the API
+described in this doc later, in the section "Examples."
+* Affinity
+ * Co-locate the pods from a particular service or Job in the same availability
+zone, without specifying which zone that should be.
+ * Co-locate the pods from service S1 with pods from service S2 because S1 uses
+S2 and thus it is useful to minimize the network latency between them.
+Co-location might mean same nodes and/or same availability zone.
+* Anti-affinity
+ * Spread the pods of a service across nodes and/or availability zones, e.g. to
+reduce correlated failures.
+ * Give a pod "exclusive" access to a node to guarantee resource isolation --
+it must never share the node with other pods.
+ * Don't schedule the pods of a particular service on the same nodes as pods of
+another service that are known to interfere with the performance of the pods of
+the first service.
+
+For both affinity and anti-affinity, there are three variants. Two variants have
+the property of requiring the affinity/anti-affinity to be satisfied for the pod
+to be allowed to schedule onto a node; the difference between them is that if
+the condition ceases to be met later on at runtime, for one of them the system
+will try to eventually evict the pod, while for the other the system may not try
+to do so. The third variant simply provides scheduling-time *hints* that the
+scheduler will try to satisfy but may not be able to. These three variants are
+directly analogous to the three variants of [node affinity](nodeaffinity.md).
+
+Note that this proposal is only about *inter-pod* topological affinity and
+anti-affinity. There are other forms of topological affinity and anti-affinity.
+For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
+that a set of pods all be scheduled in some specific zone Z. Node affinity is
+not capable of expressing inter-pod dependencies, and conversely the API we
+describe in this document is not capable of expressing node affinity rules. For
+simplicity, we will use the terms "affinity" and "anti-affinity" to mean
+"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
+respectively, in the remainder of this document.
+
+## API
+
+We will add one field to `PodSpec`
+
+```go
+Affinity *Affinity `json:"affinity,omitempty"`
+```
+
+The `Affinity` type is defined as follows
+
+```go
+type Affinity struct {
+ PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
+ PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
+}
+
+type PodAffinity struct {
+ // If the affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a pod label update), the
+ // system will try to eventually evict the pod from its node.
+ // When there are multiple elements, the lists of nodes corresponding to each
+ // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+ RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
+ // If the affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a pod label update), the
+ // system may or may not try to eventually evict the pod from its node.
+ // When there are multiple elements, the lists of nodes corresponding to each
+ // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+ RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
+ // The scheduler will prefer to schedule pods to nodes that satisfy
+ // the affinity expressions specified by this field, but it may choose
+ // a node that violates one or more of the expressions. The node that is
+ // most preferred is the one with the greatest sum of weights, i.e.
+ // for each node that meets all of the scheduling requirements (resource
+ // request, RequiredDuringScheduling affinity expressions, etc.),
+ // compute a sum by iterating through the elements of this field and adding
+ // "weight" to the sum if the node matches the corresponding MatchExpressions; the
+ // node(s) with the highest sum are the most preferred.
+ PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
+}
+
+type PodAntiAffinity struct {
+ // If the anti-affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the anti-affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a pod label update), the
+ // system will try to eventually evict the pod from its node.
+ // When there are multiple elements, the lists of nodes corresponding to each
+ // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+ RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
+ // If the anti-affinity requirements specified by this field are not met at
+ // scheduling time, the pod will not be scheduled onto the node.
+ // If the anti-affinity requirements specified by this field cease to be met
+ // at some point during pod execution (e.g. due to a pod label update), the
+ // system may or may not try to eventually evict the pod from its node.
+ // When there are multiple elements, the lists of nodes corresponding to each
+ // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+ RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
+ // The scheduler will prefer to schedule pods to nodes that satisfy
+ // the anti-affinity expressions specified by this field, but it may choose
+ // a node that violates one or more of the expressions. The node that is
+ // most preferred is the one with the greatest sum of weights, i.e.
+ // for each node that meets all of the scheduling requirements (resource
+ // request, RequiredDuringScheduling anti-affinity expressions, etc.),
+ // compute a sum by iterating through the elements of this field and adding
+ // "weight" to the sum if the node matches the corresponding MatchExpressions; the
+ // node(s) with the highest sum are the most preferred.
+ PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
+}
+
+type WeightedPodAffinityTerm struct {
+ // weight is in the range 1-100
+ Weight int `json:"weight"`
+ PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"`
+}
+
+type PodAffinityTerm struct {
+ LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
+ // namespaces specifies which namespaces the LabelSelector applies to (matches against);
+ // nil list means "this pod's namespace," empty list means "all namespaces"
+ // The json tag here is not "omitempty" since we need to distinguish nil and empty.
+ // See https://golang.org/pkg/encoding/json/#Marshal for more details.
+ Namespaces []api.Namespace `json:"namespaces,omitempty"`
+ // empty topology key is interpreted by the scheduler as "all topologies"
+ TopologyKey string `json:"topologyKey,omitempty"`
+}
+```
+
+Note that the `Namespaces` field is necessary because normal `LabelSelector` is
+scoped to the pod's namespace, but we need to be able to match against all pods
+globally.
+
+To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
+`Affinity` that is configured as follows (note that we've omitted and collapsed
+some fields for simplicity, but this should sufficiently convey the intent of
+the design):
+
+```go
+PodAffinity {
+ RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}},
+ PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}},
+}
+PodAntiAffinity {
+ RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}},
+ PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}}
+}
+```
+
+Then when scheduling pod P, the scheduler:
+* Can only schedule P onto nodes that are running pods that satisfy `P1`.
+(Assumes all nodes have a label with key `node` and value specifying their node
+name.)
+* Should try to schedule P onto zones that are running pods that satisfy `P2`.
+(Assumes all nodes have a label with key `zone` and value specifying their
+zone.)
+* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
+(Assumes all nodes have a label with key `rack` and value specifying their rack
+name.)
+* Should try not to schedule P onto any power domains that are running pods that
+satisfy `P4`. (Assumes all nodes have a label with key `power` and value
+specifying their power domain.)
+
+When `RequiredDuringScheduling` has multiple elements, the requirements are
+ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
+are satisfied for each node, and the node(s) with the highest weight(s) are the
+most preferred.
+
+In reality there are two variants of `RequiredDuringScheduling`: one suffixed
+with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`.
+For the first variant, if the affinity/anti-affinity ceases to be met at some
+point during pod execution (e.g. due to a pod label update), the system will try
+to eventually evict the pod from its node. In the second variant, the system may
+or may not try to eventually evict the pod from its node.
+
+## A comment on symmetry
+
+One thing that makes affinity and anti-affinity tricky is symmetry.
+
+Imagine a cluster that is running pods from two services, S1 and S2. Imagine
+that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
+run me on nodes that are running pods from S2." It is not sufficient just to
+check that there are no S2 pods on a node when you are scheduling a S1 pod. You
+also need to ensure that there are no S1 pods on a node when you are scheduling
+a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
+Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
+RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
+S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
+anti-affinity rule, then:
+* if a node is empty, you can schedule S1 or S2 onto the node
+* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
+
+Note that while RequiredDuringScheduling anti-affinity is symmetric,
+RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
+have a RequiredDuringScheduling affinity rule "run me on nodes that are running
+pods from S2," it is not required that there be S1 pods on a node in order to
+schedule a S2 pod onto that node. More specifically, if S1 has the
+aforementioned RequiredDuringScheduling affinity rule, then:
+* if a node is empty, you can schedule S2 onto the node
+* if a node is empty, you cannot schedule S1 onto the node
+* if a node is running S2, you can schedule S1 onto the node
+* if a node is running S1+S2 and S1 terminates, S2 continues running
+* if a node is running S1+S2 and S2 terminates, the system terminates S1
+(eventually)
+
+However, although RequiredDuringScheduling affinity is not symmetric, there is
+an implicit PreferredDuringScheduling affinity rule corresponding to every
+RequiredDuringScheduling affinity rule: if the pods of S1 have a
+RequiredDuringScheduling affinity rule "run me on nodes that are running pods
+from S2" then it is not required that there be S1 pods on a node in order to
+schedule a S2 pod onto that node, but it would be better if there are.
+
+PreferredDuringScheduling is symmetric. If the pods of S1 had a
+PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
+are running pods from S2" then we would prefer to keep a S1 pod that we are
+scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
+we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
+S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
+are running pods from S2" then we would prefer to place a S1 pod that we are
+scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
+we are scheduling onto a node that is running a S1 pod.
+
+## Examples
+
+Here are some examples of how you would express various affinity and
+anti-affinity rules using the API we described.
+
+### Affinity
+
+In the examples below, the word "put" is intentionally ambiguous; the rules are
+the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
+put" (PreferredDuringScheduling)--all that changes is which field the rule goes
+into. Also, we only discuss scheduling-time, and ignore the execution-time.
+Finally, some of the examples use "zone" and some use "node," just to make the
+examples more interesting; any of the examples with "zone" will also work for
+"node" if you change the `TopologyKey`, and vice-versa.
+
+* **Put the pod in zone Z**:
+Tricked you! It is not possible express this using the API described here. For
+this you should use node affinity.
+
+* **Put the pod in a zone that is running at least one pod from service S**:
+`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
+
+* **Put the pod on a node that is already running a pod that requires a license
+for software package P**: Assuming pods that require a license for software
+package P have a label `{key=license, value=P}`:
+`{LabelSelector: "license" In "P", TopologyKey: "node"}`
+
+* **Put this pod in the same zone as other pods from its same service**:
+Assuming pods from this pod's service have some label `{key=service, value=S}`:
+`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
+
+This last example illustrates a small issue with this API when it is used with a
+scheduler that processes the pending queue one pod at a time, like the current
+Kubernetes scheduler. The RequiredDuringScheduling rule
+`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
+only "works" once one pod from service S has been scheduled. But if all pods in
+service S have this RequiredDuringScheduling rule in their PodSpec, then the
+RequiredDuringScheduling rule will block the first pod of the service from ever
+scheduling, since it is only allowed to run in a zone with another pod from the
+same service. And of course that means none of the pods of the service will be
+able to schedule. This problem *only* applies to RequiredDuringScheduling
+affinity, not PreferredDuringScheduling affinity or any variant of
+anti-affinity. There are at least three ways to solve this problem:
+* **short-term**: have the scheduler use a rule that if the
+RequiredDuringScheduling affinity requirement matches a pod's own labels, and
+there are no other such pods anywhere, then disregard the requirement. This
+approach has a corner case when running parallel schedulers that are allowed to
+schedule pods from the same replicated set (e.g. a single PodTemplate): both
+schedulers may try to schedule pods from the set at the same time and think
+there are no other pods from that set scheduled yet (e.g. they are trying to
+schedule the first two pods from the set), but by the time the second binding is
+committed, the first one has already been committed, leaving you with two pods
+running that do not respect their RequiredDuringScheduling affinity. There is no
+simple way to detect this "conflict" at scheduling time given the current system
+implementation.
+* **longer-term**: when a controller creates pods from a PodTemplate, for
+exactly *one* of those pods, it should omit any RequiredDuringScheduling
+affinity rules that select the pods of that PodTemplate.
+* **very long-term/speculative**: controllers could present the scheduler with a
+group of pods from the same PodTemplate as a single unit. This is similar to the
+first approach described above but avoids the corner case. No special logic is
+needed in the controllers. Moreover, this would allow the scheduler to do proper
+[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
+it could receive an entire gang simultaneously as a single unit.
+
+### Anti-affinity
+
+As with the affinity examples, the examples here can be RequiredDuringScheduling
+or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
+"must not" or as "try not to" depending on whether the rule appears in
+`RequiredDuringScheduling` or `PreferredDuringScheduling`.
+
+* **Spread the pods of this service S across nodes and zones**:
+`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
+{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
+(note that if this is specified as a RequiredDuringScheduling anti-affinity,
+then the first clause is redundant, since the second clause will force the
+scheduler to not put more than one pod from S in the same zone, and thus by
+definition it will not put more than one pod from S on the same node, assuming
+each node is in one zone. This rule is more useful as PreferredDuringScheduling
+anti-affinity, e.g. one might expect it to be common in
+[Cluster Federation](../../docs/proposals/federation.md) clusters.)
+
+* **Don't co-locate pods of this service with pods from service "evilService"**:
+`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}`
+
+* **Don't co-locate pods of this service with any other pods including pods of this service**:
+`{LabelSelector: empty, TopologyKey: "node"}`
+
+* **Don't co-locate pods of this service with any other pods except other pods of this service**:
+Assuming pods from the service have some label `{key=service, value=S}`:
+`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
+Note that this works because `"service" NotIn "S"` matches pods with no key
+"service" as well as pods with key "service" and a corresponding value that is
+not "S."
+
+## Algorithm
+
+An example algorithm a scheduler might use to implement affinity and
+anti-affinity rules is as follows. There are certainly more efficient ways to
+do it; this is just intended to demonstrate that the API's semantics are
+implementable.
+
+Terminology definition: We say a pod P is "feasible" on a node N if P meets all
+of the scheduler predicates for scheduling P onto N. Note that this algorithm is
+only concerned about scheduling time, thus it makes no distinction between
+RequiredDuringExecution and IgnoredDuringExecution.
+
+To make the algorithm slightly more readable, we use the term "HardPodAffinity"
+as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
+"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
+Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
+
+** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
+into account; currently it assumes all terms have weight 1. **
+
+```
+Z = the pod you are scheduling
+{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z
+// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
+X = {Z's PodSpec's HardPodAffinity}
+foreach element H of {X}
+ P = {all pods in the system that match H.LabelSelector}
+ M map[string]int // topology value -> number of pods running on nodes with that topology value
+ foreach pod Q of {P}
+ L = {labels of the node on which Q is running, represented as a map from label key to label value}
+ M[L[H.TopologyKey]]++
+ {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
+// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
+// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
+X = {Z's PodSpec's HardPodAntiAffinity}
+foreach element H of {X}
+ P = {all pods in the system that match H.LabelSelector}
+ M map[string]int // topology value -> number of pods running on nodes with that topology value
+ foreach pod Q of {P}
+ L = {labels of the node on which Q is running, represented as a map from label key to label value}
+ M[L[H.TopologyKey]]++
+ {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
+// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
+foreach node A of {N}
+ foreach pod B that is bound to A
+ if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
+// At this point, all node in {N} are feasible for Z.
+// Step 3a: Soft version of Step 1a
+Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
+Initialize the keys of Y to all of the nodes in {N}, and the values to 0
+X = {Z's PodSpec's SoftPodAffinity}
+Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
+// Step 3b: Soft version of Step 1b
+X = {Z's PodSpec's SoftPodAntiAffinity}
+Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
+// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
+foreach node A of {N}
+ foreach pod B that is bound to A
+ increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
+// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
+// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
+// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).
+```
+
+## Special considerations for RequiredDuringScheduling anti-affinity
+
+In this section we discuss three issues with RequiredDuringScheduling
+anti-affinity: Denial of Service (DoS), co-existing with daemons, and
+determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265)
+for additional discussion of these topics.
+
+### Denial of Service
+
+Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
+can intentionally or unintentionally cause various problems for other pods, due
+to the symmetry property of anti-affinity.
+
+The most notable danger is the ability for a pod that arrives first to some
+topology domain, to block all other pods from scheduling there by stating a
+conflict with all other pods. The standard approach to preventing resource
+hogging is quota, but simple resource quota cannot prevent this scenario because
+the pod may request very little resources. Addressing this using quota requires
+a quota scheme that charges based on "opportunity cost" rather than based simply
+on requested resources. For example, when handling a pod that expresses
+RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
+(i.e. exclusive access to a node), it could charge for the resources of the
+average or largest node in the cluster. Likewise if a pod expresses
+RequiredDuringScheduling anti-affinity for all pods using a "cluster"
+`TopologyKey`, it could charge for the resources of the entire cluster. If node
+affinity is used to constrain the pod to a particular topology domain, then the
+admission-time quota charging should take that into account (e.g. not charge for
+the average/largest machine if the PodSpec constrains the pod to a specific
+machine with a known size; instead charge for the size of the actual machine
+that the pod was constrained to). In all cases once the pod is scheduled, the
+quota charge should be adjusted down to the actual amount of resources allocated
+(e.g. the size of the actual machine that was assigned, not the
+average/largest). If a cluster administrator wants to overcommit quota, for
+example to allow more than N pods across all users to request exclusive node
+access in a cluster with N nodes, then a priority/preemption scheme should be
+added so that the most important pods run when resource demand exceeds supply.
+
+An alternative approach, which is a bit of a blunt hammer, is to use a
+capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
+to trusted users. A more complex capability mechanism might only restrict it
+when using a non-"node" TopologyKey.
+
+Our initial implementation will use a variant of the capability approach, which
+requires no configuration: we will simply reject ALL requests, regardless of
+user, that specify "all namespaces" with non-"node" TopologyKey for
+RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
+case while prohibiting the more dangerous ones.
+
+A weaker variant of the problem described in the previous paragraph is a pod's
+ability to use anti-affinity to degrade the scheduling quality of another pod,
+but not completely block it from scheduling. For example, a set of pods S1 could
+use node affinity to request to schedule onto a set of nodes that some other set
+of pods S2 prefers to schedule onto. If the pods in S1 have
+RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
+S2, then due to the symmetry property of anti-affinity, they can prevent the
+pods in S2 from scheduling onto their preferred nodes if they arrive first (for
+sure in the RequiredDuringScheduling case, and with some probability that
+depends on the weighting scheme for the PreferredDuringScheduling case). A very
+sophisticated priority and/or quota scheme could mitigate this, or alternatively
+we could eliminate the symmetry property of the implementation of
+PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
+anti-affinity could affect scheduling quality of another pod, and as we
+described in the previous paragraph, such pods could be charged quota for the
+full topology domain, thereby reducing the potential for abuse.
+
+We won't try to address this issue in our initial implementation; we can
+consider one of the approaches mentioned above if it turns out to be a problem
+in practice.
+
+### Co-existing with daemons
+
+A cluster administrator may wish to allow pods that express anti-affinity
+against all pods, to nonetheless co-exist with system daemon pods, such as those
+run by DaemonSet. In principle, we would like the specification for
+RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
+more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263)
+for a more detailed explanation of the toleration concept).
+There are at least two ways to accomplish this:
+
+* Scheduler special-cases the namespace(s) where daemons live, in the
+ sense that it ignores pods in those namespaces when it is
+ determining feasibility for pods with anti-affinity. The name(s) of
+ the special namespace(s) could be a scheduler configuration
+ parameter, and default to `kube-system`. We could allow
+ multiple namespaces to be specified if we want cluster admins to be
+ able to give their own daemons this special power (they would add
+ their namespace to the list in the scheduler configuration). And of
+ course this would be symmetric, so daemons could schedule onto a node
+ that is already running a pod with anti-affinity.
+
+* We could add an explicit "toleration" concept/field to allow the
+ user to specify namespaces that are excluded when they use
+ RequiredDuringScheduling anti-affinity, and use an admission
+ controller/defaulter to ensure these namespaces are always listed.
+
+Our initial implementation will use the first approach.
+
+### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
+
+Because anti-affinity is symmetric, in the case of
+RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
+determine which pod(s) to kill when a pod's labels are updated in such as way as
+to cause them to conflict with one or more other pods'
+RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
+absence of a priority/preemption scheme, our rule will be that the pod with the
+anti-affinity rule that becomes violated should be the one killed. A pod should
+only specify constraints that apply to namespaces it trusts to not do malicious
+things. Once we have priority/preemption, we can change the rule to say that the
+lowest-priority pod(s) are killed until all
+RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
+
+## Special considerations for RequiredDuringScheduling affinity
+
+The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
+symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
+conflicting pods, and pods that conflict with P cannot schedule onto the node
+one P has been scheduled there. The design we have described says that the
+symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
+says it can only schedule onto nodes running pod Q, this does not mean Q can
+only run on a node that is running P, but the scheduler will try to schedule Q
+onto a node that is running P (i.e. treats the reverse direction as preferred).
+This raises the same scheduling quality concern as we mentioned at the end of
+the Denial of Service section above, and can be addressed in similar ways.
+
+The nature of affinity (as opposed to anti-affinity) means that there is no
+issue of determining which pod(s) to kill when a pod's labels change: it is
+obviously the pod with the affinity rule that becomes violated that must be
+killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
+"fix" violation an anti-affinity rule.) However, affinity does have a different
+question related to killing: how long should the system wait before declaring
+that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
+at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
+is temporarily killed so that it can be updated to a new binary version, should
+that trigger killing of P? More generally, how long should the system wait
+before declaring that P's affinity is violated? (Of course affinity is expressed
+in terms of label selectors, not for a specific pod, but the scenario is easier
+to describe using a concrete pod.) This is closely related to the concept of
+forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)).
+In theory we could make this time duration be configurable by the user on a per-pod
+basis, but for the first version of this feature we will make it a configurable
+property of whichever component does the killing and that applies across all pods
+using the feature. Making it configurable by the user would require a nontrivial
+change to the API syntax (since the field would only apply to
+RequiredDuringSchedulingRequiredDuringExecution affinity).
+
+## Implementation plan
+
+1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
+`PodAntiAffinity` types to the API along with all of their descendant types.
+2. Implement a scheduler predicate that takes
+`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
+account. Include a workaround for the issue described at the end of the Affinity
+section of the Examples section (can't schedule first pod).
+3. Implement a scheduler priority function that takes
+`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
+into account.
+4. Implement admission controller that rejects requests that specify "all
+namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
+anti-affinity. This admission controller should be enabled by default.
+5. Implement the recommended solution to the "co-existing with daemons" issue
+6. At this point, the feature can be deployed.
+7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
+and anti-affinity, and make sure the pieces of the system already implemented
+for `RequiredDuringSchedulingIgnoredDuringExecution` also take
+`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
+scheduler predicate, the quota mechanism, the "co-existing with daemons"
+solution).
+8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
+`TopologyKey` to Kubelet's admission decision.
+9. Implement code in Kubelet *or* the controllers that evicts a pod that no
+longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
+then only for "node" `TopologyKey`; if controller then potentially for all
+`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
+Do so in a way that addresses the "determining which pod(s) to kill" issue.
+
+We assume Kubelet publishes labels describing the node's membership in all of
+the relevant scheduling domains (e.g. node name, rack name, availability zone
+name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044).
+
+## Backward compatibility
+
+Old versions of the scheduler will ignore `Affinity`.
+
+Users should not start using `Affinity` until the full implementation has been
+in Kubelet and the master for enough binary versions that we feel comfortable
+that we will not need to roll back either Kubelet or master to a version that
+does not support them. Longer-term we will use a programmatic approach to
+enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
+
+## Extensibility
+
+The design described here is the result of careful analysis of use cases, a
+decade of experience with Borg at Google, and a review of similar features in
+other open-source container orchestration systems. We believe that it properly
+balances the goal of expressiveness against the goals of simplicity and
+efficiency of implementation. However, we recognize that use cases may arise in
+the future that cannot be expressed using the syntax described here. Although we
+are not implementing an affinity-specific extensibility mechanism for a variety
+of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
+for Kubernetes users to get a consistent experience, etc.), the regular
+Kubernetes annotation mechanism can be used to add or replace affinity rules.
+The way this work would is:
+1. Define one or more annotations to describe the new affinity rule(s)
+1. User (or an admission controller) attaches the annotation(s) to pods to
+request the desired scheduling behavior. If the new rule(s) *replace* one or
+more fields of `Affinity` then the user would omit those fields from `Affinity`;
+if they are *additional rules*, then the user would fill in `Affinity` as well
+as the annotation(s).
+1. Scheduler takes the annotation(s) into account when scheduling.
+
+If some particular new syntax becomes popular, we would consider upstreaming it
+by integrating it into the standard `Affinity`.
+
+## Future work and non-work
+
+One can imagine that in the anti-affinity RequiredDuringScheduling case one
+might want to associate a number with the rule, for example "do not allow this
+pod to share a rack with more than three other pods (in total, or from the same
+service as the pod)." We could allow this to be specified by adding an integer
+`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
+However, this flexibility complicates the system and we do not intend to
+implement it.
+
+It is likely that the specification and implementation of pod anti-affinity
+can be unified with [taints and tolerations](taint-toleration-dedicated.md),
+and likewise that the specification and implementation of pod affinity
+can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
+labels would be "inherited" by the node, and pods would only be able to specify
+affinity and anti-affinity for a node's labels. Our main motivation for not
+unifying taints and tolerations with pod anti-affinity is that we foresee taints
+and tolerations as being a concept that only cluster administrators need to
+understand (and indeed in some setups taints and tolerations wouldn't even be
+directly manipulated by a cluster administrator, instead they would only be set
+by an admission controller that is implementing the administrator's high-level
+policy about different classes of special machines and the users who belong to
+the groups allowed to access them). Moreover, the concept of nodes "inheriting"
+labels from pods seems complicated; it seems conceptually simpler to separate
+rules involving relatively static properties of nodes from rules involving which
+other pods are running on the same node or larger topology domain.
+
+Data/storage affinity is related to pod affinity, and is likely to draw on some
+of the ideas we have used for pod affinity. Today, data/storage affinity is
+expressed using node affinity, on the assumption that the pod knows which
+node(s) store(s) the data it wants. But a more flexible approach would allow the
+pod to name the data rather than the node.
+
+## Related issues
+
+The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265).
+
+The topic of affinity/anti-affinity has generated a lot of discussion. The main
+issue is [#367](https://github.com/kubernetes/kubernetes/issues/367)
+but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485),
+[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369),
+[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707),
+[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341),
+[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906)
+all have additional discussion and use cases.
+
+As the examples in this document have demonstrated, topological affinity is very
+useful in clusters that are spread across availability zones, e.g. to co-locate
+pods of a service in the same zone to avoid a wide-area network hop, or to
+spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059),
+[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063),
+and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant.
+
+Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related.
+
+This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816).
+
+## Related work
+
+** TODO: cite references **
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/principles.md b/contributors/design-proposals/principles.md
new file mode 100644
index 00000000..4e0b663c
--- /dev/null
+++ b/contributors/design-proposals/principles.md
@@ -0,0 +1,101 @@
+# Design Principles
+
+Principles to follow when extending Kubernetes.
+
+## API
+
+See also the [API conventions](../devel/api-conventions.md).
+
+* All APIs should be declarative.
+* API objects should be complementary and composable, not opaque wrappers.
+* The control plane should be transparent -- there are no hidden internal APIs.
+* The cost of API operations should be proportional to the number of objects
+intentionally operated upon. Therefore, common filtered lookups must be indexed.
+Beware of patterns of multiple API calls that would incur quadratic behavior.
+* Object status must be 100% reconstructable by observation. Any history kept
+must be just an optimization and not required for correct operation.
+* Cluster-wide invariants are difficult to enforce correctly. Try not to add
+them. If you must have them, don't enforce them atomically in master components,
+that is contention-prone and doesn't provide a recovery path in the case of a
+bug allowing the invariant to be violated. Instead, provide a series of checks
+to reduce the probability of a violation, and make every component involved able
+to recover from an invariant violation.
+* Low-level APIs should be designed for control by higher-level systems.
+Higher-level APIs should be intent-oriented (think SLOs) rather than
+implementation-oriented (think control knobs).
+
+## Control logic
+
+* Functionality must be *level-based*, meaning the system must operate correctly
+given the desired state and the current/observed state, regardless of how many
+intermediate state updates may have been missed. Edge-triggered behavior must be
+just an optimization.
+* Assume an open world: continually verify assumptions and gracefully adapt to
+external events and/or actors. Example: we allow users to kill pods under
+control of a replication controller; it just replaces them.
+* Do not define comprehensive state machines for objects with behaviors
+associated with state transitions and/or "assumed" states that cannot be
+ascertained by observation.
+* Don't assume a component's decisions will not be overridden or rejected, nor
+for the component to always understand why. For example, etcd may reject writes.
+Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry,
+but back off and/or make alternative decisions.
+* Components should be self-healing. For example, if you must keep some state
+(e.g., cache) the content needs to be periodically refreshed, so that if an item
+does get erroneously stored or a deletion event is missed etc, it will be soon
+fixed, ideally on timescales that are shorter than what will attract attention
+from humans.
+* Component behavior should degrade gracefully. Prioritize actions so that the
+most important activities can continue to function even when overloaded and/or
+in states of partial failure.
+
+## Architecture
+
+* Only the apiserver should communicate with etcd/store, and not other
+components (scheduler, kubelet, etc.).
+* Compromising a single node shouldn't compromise the cluster.
+* Components should continue to do what they were last told in the absence of
+new instructions (e.g., due to network partition or component outage).
+* All components should keep all relevant state in memory all the time. The
+apiserver should write through to etcd/store, other components should write
+through to the apiserver, and they should watch for updates made by other
+clients.
+* Watch is preferred over polling.
+
+## Extensibility
+
+TODO: pluggability
+
+## Bootstrapping
+
+* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
+* Minimize the number of dependencies, particularly those required for
+steady-state operation.
+* Stratify the dependencies that remain via principled layering.
+* Break any circular dependencies by converting hard dependencies to soft
+dependencies.
+ * Also accept that data from other components from another source, such as
+local files, which can then be manually populated at bootstrap time and then
+continuously updated once those other components are available.
+ * State should be rediscoverable and/or reconstructable.
+ * Make it easy to run temporary, bootstrap instances of all components in
+order to create the runtime state needed to run the components in the steady
+state; use a lock (master election for distributed components, file lock for
+local components like Kubelet) to coordinate handoff. We call this technique
+"pivoting".
+ * Have a solution to restart dead components. For distributed components,
+replication works well. For local components such as Kubelet, a process manager
+or even a simple shell loop works.
+
+## Availability
+
+TODO
+
+## General principles
+
+* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/protobuf.md b/contributors/design-proposals/protobuf.md
new file mode 100644
index 00000000..6741bbab
--- /dev/null
+++ b/contributors/design-proposals/protobuf.md
@@ -0,0 +1,480 @@
+# Protobuf serialization and internal storage
+
+@smarterclayton
+
+March 2016
+
+## Proposal and Motivation
+
+The Kubernetes API server is a "dumb server" which offers storage, versioning,
+validation, update, and watch semantics on API resources. In a large cluster
+the API server must efficiently retrieve, store, and deliver large numbers
+of coarse-grained objects to many clients. In addition, Kubernetes traffic is
+heavily biased towards intra-cluster traffic - as much as 90% of the requests
+served by the APIs are for internal cluster components like nodes, controllers,
+and proxies. The primary format for intercluster API communication is JSON
+today for ease of client construction.
+
+At the current time, the latency of reaction to change in the cluster is
+dominated by the time required to load objects from persistent store (etcd),
+convert them to an output version, serialize them JSON over the network, and
+then perform the reverse operation in clients. The cost of
+serialization/deserialization and the size of the bytes on the wire, as well
+as the memory garbage created during those operations, dominate the CPU and
+network usage of the API servers.
+
+In order to reach clusters of 10k nodes, we need roughly an order of magnitude
+efficiency improvement in a number of areas of the cluster, starting with the
+masters but also including API clients like controllers, kubelets, and node
+proxies.
+
+We propose to introduce a Protobuf serialization for all common API objects
+that can optionally be used by intra-cluster components. Experiments have
+demonstrated a 10x reduction in CPU use during serialization and deserialization,
+a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount
+of objects created on the heap during serialization. The Protobuf schema
+for each object will be automatically generated from the external API Go structs
+we use to serialize to JSON.
+
+Benchmarking showed that the time spent on the server in a typical GET
+resembles:
+
+ etcd -> decode -> defaulting -> convert to internal ->
+ JSON 50us 5us 15us
+ Proto 5us
+ JSON 150allocs 80allocs
+ Proto 100allocs
+
+ process -> convert to external -> encode -> client
+ JSON 15us 40us
+ Proto 5us
+ JSON 80allocs 100allocs
+ Proto 4allocs
+
+ Protobuf has a huge benefit on encoding because it does not need to allocate
+ temporary objects, just one large buffer. Changing to protobuf moves our
+ hotspot back to conversion, not serialization.
+
+
+## Design Points
+
+* Generate Protobuf schema from Go structs (like we do for JSON) to avoid
+ manual schema update and drift
+* Generate Protobuf schema that is field equivalent to the JSON fields (no
+ special types or enumerations), reducing drift for clients across formats.
+* Follow our existing API versioning rules (backwards compatible in major
+ API versions, breaking changes across major versions) by creating one
+ Protobuf schema per API type.
+* Continue to use the existing REST API patterns but offer an alternative
+ serialization, which means existing client and server tooling can remain
+ the same while benefiting from faster decoding.
+* Protobuf objects on disk or in etcd will need to be self identifying at
+ rest, like JSON, in order for backwards compatibility in storage to work,
+ so we must add an envelope with apiVersion and kind to wrap the nested
+ object, and make the data format recognizable to clients.
+* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal
+ operations, allowing us to bypass the expensive reflection used by the
+ golang JSOn operation
+
+
+## Alternatives
+
+* We considered JSON compression to reduce size on wire, but that does not
+ reduce the amount of memory garbage created during serialization and
+ deserialization.
+* More efficient formats like Msgpack were considered, but they only offer
+ 2x speed up vs. the 10x observed for Protobuf
+* gRPC was considered, but is a larger change that requires more core
+ refactoring. This approach does not eliminate the possibility of switching
+ to gRPC in the future.
+* We considered attempting to improve JSON serialization, but the cost of
+ implementing a more efficient serializer library than ugorji is
+ significantly higher than creating a protobuf schema from our Go structs.
+
+
+## Schema
+
+The Protobuf schema for each API group and version will be generated from
+the objects in that API group and version. The schema will be named using
+the package identifier of the Go package, i.e.
+
+ k8s.io/kubernetes/pkg/api/v1
+
+Each top level object will be generated as a Protobuf message, i.e.:
+
+ type Pod struct { ... }
+
+ message Pod {}
+
+Since the Go structs are designed to be serialized to JSON (with only the
+int, string, bool, map, and array primitive types), we will use the
+canonical JSON serialization as the protobuf field type wherever possible,
+i.e.:
+
+ JSON Protobuf
+ string -> string
+ int -> varint
+ bool -> bool
+ array -> repeating message|primitive
+
+We disallow the use of the Go `int` type in external fields because it is
+ambiguous depending on compiler platform, and instead always use `int32` or
+`int64`.
+
+We will use maps (a protobuf 3 extension that can serialize to protobuf 2)
+to represent JSON maps:
+
+ JSON Protobuf Wire (proto2)
+ map -> map<string, ...> -> repeated Message { key string; value bytes }
+
+We will not convert known string constants to enumerations, since that
+would require extra logic we do not already have in JSOn.
+
+To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and
+in the future investigate a Protobuf 3 serialization. We will introduce
+abstractions that let us have more than a single protobuf serialization if
+necessary. Protobuf 3 would require us to support message types for
+pointer primitive (nullable) fields, which is more complex than Protobuf 2's
+support for pointers.
+
+### Example of generated proto IDL
+
+Without gogo extensions:
+
+```
+syntax = 'proto2';
+
+package k8s.io.kubernetes.pkg.api.v1;
+
+import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
+import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
+import "k8s.io/kubernetes/pkg/runtime/generated.proto";
+import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
+
+// Package-wide variables from generator "generated".
+option go_package = "v1";
+
+// Represents a Persistent Disk resource in AWS.
+//
+// An AWS EBS disk must exist before mounting to a container. The disk
+// must also be in the same AWS zone as the kubelet. An AWS EBS disk
+// can only be mounted as read/write once. AWS EBS volumes support
+// ownership management and SELinux relabeling.
+message AWSElasticBlockStoreVolumeSource {
+ // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ optional string volumeID = 1;
+
+ // Filesystem type of the volume that you want to mount.
+ // Tip: Ensure that the filesystem type is supported by the host operating system.
+ // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ // TODO: how do we prevent errors in the filesystem from compromising the machine
+ optional string fsType = 2;
+
+ // The partition in the volume that you want to mount.
+ // If omitted, the default is to mount by volume name.
+ // Examples: For volume /dev/sda1, you specify the partition as "1".
+ // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
+ optional int32 partition = 3;
+
+ // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
+ // If omitted, the default is "false".
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ optional bool readOnly = 4;
+}
+
+// Affinity is a group of affinity scheduling rules, currently
+// only node affinity, but in the future also inter-pod affinity.
+message Affinity {
+ // Describes node affinity scheduling rules for the pod.
+ optional NodeAffinity nodeAffinity = 1;
+}
+```
+
+With extensions:
+
+```
+syntax = 'proto2';
+
+package k8s.io.kubernetes.pkg.api.v1;
+
+import "github.com/gogo/protobuf/gogoproto/gogo.proto";
+import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
+import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
+import "k8s.io/kubernetes/pkg/runtime/generated.proto";
+import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
+
+// Package-wide variables from generator "generated".
+option (gogoproto.marshaler_all) = true;
+option (gogoproto.sizer_all) = true;
+option (gogoproto.unmarshaler_all) = true;
+option (gogoproto.goproto_unrecognized_all) = false;
+option (gogoproto.goproto_enum_prefix_all) = false;
+option (gogoproto.goproto_getters_all) = false;
+option go_package = "v1";
+
+// Represents a Persistent Disk resource in AWS.
+//
+// An AWS EBS disk must exist before mounting to a container. The disk
+// must also be in the same AWS zone as the kubelet. An AWS EBS disk
+// can only be mounted as read/write once. AWS EBS volumes support
+// ownership management and SELinux relabeling.
+message AWSElasticBlockStoreVolumeSource {
+ // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false];
+
+ // Filesystem type of the volume that you want to mount.
+ // Tip: Ensure that the filesystem type is supported by the host operating system.
+ // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ // TODO: how do we prevent errors in the filesystem from compromising the machine
+ optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false];
+
+ // The partition in the volume that you want to mount.
+ // If omitted, the default is to mount by volume name.
+ // Examples: For volume /dev/sda1, you specify the partition as "1".
+ // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
+ optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false];
+
+ // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
+ // If omitted, the default is "false".
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false];
+}
+
+// Affinity is a group of affinity scheduling rules, currently
+// only node affinity, but in the future also inter-pod affinity.
+message Affinity {
+ // Describes node affinity scheduling rules for the pod.
+ optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"];
+}
+```
+
+## Wire format
+
+In order to make Protobuf serialized objects recognizable in a binary form,
+the encoded object must be prefixed by a magic number, and then wrap the
+non-self-describing Protobuf object in a Protobuf object that contains
+schema information. The protobuf object is referred to as the `raw` object
+and the encapsulation is referred to as `wrapper` object.
+
+The simplest serialization is the raw Protobuf object with no identifying
+information. In some use cases, we may wish to have the server identify the
+raw object type on the wire using a protocol dependent format (gRPC uses
+a type HTTP header). This works when all objects are of the same type, but
+we occasionally have reasons to encode different object types in the same
+context (watches, lists of objects on disk, and API calls that may return
+errors).
+
+To identify the type of a wrapped Protobuf object, we wrap it in a message
+in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown`
+having the following schema:
+
+ message Unknown {
+ optional TypeMeta typeMeta = 1;
+ optional bytes value = 2;
+ optional string contentEncoding = 3;
+ optional string contentType = 4;
+ }
+
+ message TypeMeta {
+ optional string apiVersion = 1;
+ optional string kind = 2;
+ }
+
+The `value` field is an encoded protobuf object that matches the schema
+defined in `typeMeta` and has optional `contentType` and `contentEncoding`
+fields. `contentType` and `contentEncoding` have the same meaning as in
+HTTP, if unspecified `contentType` means "raw protobuf object", and
+`contentEncoding` defaults to no encoding. If `contentEncoding` is
+specified, the defined transformation should be applied to `value` before
+attempting to decode the value.
+
+The `contentType` field is required to support objects without a defined
+protobuf schema, like the ThirdPartyResource or templates. Those objects
+would have to be encoded as JSON or another structure compatible form
+when used with Protobuf. Generic clients must deal with the possibility
+that the returned value is not in the known type.
+
+We add the `contentEncoding` field here to preserve room for future
+optimizations like encryption-at-rest or compression of the nested content.
+Clients should error when receiving an encoding they do not support.
+Negotioting encoding is not defined here, but introducing new encodings
+is similar to introducing a schema change or new API version.
+
+A client should use the `kind` and `apiVersion` fields to identify the
+correct protobuf IDL for that message and version, and then decode the
+`bytes` field into that Protobuf message.
+
+Any Unknown value written to stable storage will be given a 4 byte prefix
+`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte.
+The content-type `application/vnd.kubernetes.protobuf` is defined as
+representing the following schema:
+
+ MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN
+ UNKNOWN = <protobuf serialization of k8s.io/kubernetes/pkg/runtime#Unknown>
+
+A client should check for the first four bytes, then perform a protobuf
+deserialization of the remaining bytes into the `runtime.Unknown` type.
+
+## Streaming wire format
+
+While the majority of Kubernetes APIs return single objects that can vary
+in type (Pod vs. Status, PodList vs. Status), the watch APIs return a stream
+of identical objects (Events). At the time of this writing, this is the only
+current or anticipated streaming RESTful protocol (logging, port-forwarding,
+and exec protocols use a binary protocol over Websockets or SPDY).
+
+In JSON, this API is implemented as a stream of JSON objects that are
+separated by their syntax (the closing `}` brace is followed by whitespace
+and the opening `{` brace starts the next object). There is no formal
+specification covering this pattern, nor a unique content-type. Each object
+is expected to be of type `watch.Event`, and is currently not self describing.
+
+For expediency and consistency, we define a format for Protobuf watch Events
+that is similar. Since protobuf messages are not self describing, we must
+identify the boundaries between Events (a `frame`). We do that by prefixing
+each frame of N bytes with a 4-byte, big-endian, unsigned integer with the
+value N.
+
+ frame = length body
+ length = 32-bit unsigned integer in big-endian order, denoting length of
+ bytes of body
+ body = <bytes>
+
+ # frame containing a single byte 0a
+ frame = 01 00 00 00 0a
+
+ # equivalent JSON
+ frame = {"type": "added", ...}
+
+The body of each frame is a serialized Protobuf message `Event` in package
+`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this
+format is `application/vnd.kubernetes.protobuf;type=watch`.
+
+## Negotiation
+
+To allow clients to request protobuf serialization optionally, the `Accept`
+HTTP header is used by callers to indicate which serialization they wish
+returned in the response, and the `Content-Type` header is used to tell the
+server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH
+requests). The server will return 406 if the `Accept` header is not
+recognized or 415 if the `Content-Type` is not recognized (as defined in
+RFC2616).
+
+To be backwards compatible, clients must consider that the server does not
+support protobuf serialization. A number of options are possible:
+
+### Preconfigured
+
+Clients can have a configuration setting that instructs them which version
+to use. This is the simplest option, but requires intervention when the
+component upgrades to protobuf.
+
+### Include serialization information in api-discovery
+
+Servers can define the list of content types they accept and return in
+their API discovery docs, and clients can use protobuf if they support it.
+Allows dynamic configuration during upgrade if the client is already using
+API-discovery.
+
+### Optimistically attempt to send and receive requests using protobuf
+
+Using multiple `Accept` values:
+
+ Accept: application/vnd.kubernetes.protobuf, application/json
+
+clients can indicate their preferences and handle the returned
+`Content-Type` using whatever the server responds. On update operations,
+clients can try protobuf and if they receive a 415 error, record that and
+fall back to JSON. Allows the client to be backwards compatible with
+any server, but comes at the cost of some implementation complexity.
+
+
+## Generation process
+
+Generation proceeds in five phases:
+
+1. Generate a gogo-protobuf annotated IDL from the source Go struct.
+2. Generate temporary Go structs from the IDL using gogo-protobuf.
+3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf.
+4. Take all tag numbers generated for the IDL and apply them as struct tags
+ to the original Go types.
+5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL.
+
+The output is a `generated.proto` file in each package containing a standard
+proto2 IDL, and a `generated.pb.go` file in each package that contains the
+generated marshal/unmarshallers.
+
+The Go struct generated by gogo-protobuf from the first IDL must be identical
+to the origin struct - a number of changes have been made to gogo-protobuf
+to ensure exact 1-1 conversion. A small number of additions may be necessary
+in the future if we introduce more exotic field types (Go type aliases, maps
+with aliased Go types, and embedded fields were fixed). If they are identical,
+the output marshallers/unmarshallers can then work on the origin struct.
+
+Whenever a new field is added, generation will assign that field a unique tag
+and the 4th phase will write that tag back to the origin Go struct as a `protobuf`
+struct tag. This ensures subsequent generation passes are stable, even in the
+face of internal refactors. The first time a field is added, the author will
+need to check in both the new IDL AND the protobuf struct tag changes.
+
+The second IDL is generated without gogo-protobuf annotations to allow clients
+in other languages to generate easily.
+
+Any errors in the generation process are considered fatal and must be resolved
+early (being unable to identify a field type for conversion, duplicate fields,
+duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure
+that a Go struct can be round-tripped to protobuf and back, as we do for JSON
+and conversion testing.
+
+
+## Changes to development process
+
+All existing API change rules would still apply. New fields added would be
+automatically assigned a tag by the generation process. New API versions will
+have a new proto IDL, and field name and changes across API versions would be
+handled using our existing API change rules. Tags cannot change within an
+API version.
+
+Generation would be done by developers and then checked into source control,
+like conversions and ugorji JSON codecs.
+
+Because protoc is not packaged well across all platforms, we will add it to
+the `kube-cross` Docker image and developers can use that to generate
+updated protobufs. Protobuf 3 beta is required.
+
+The generated protobuf will be checked with a verify script before merging.
+
+
+## Implications
+
+* The generated marshal code is large and will increase build times and binary
+ size. We may be able to remove ugorji after protobuf is added, since the
+ bulk of our decoding would switch to protobuf.
+* The protobuf schema is naive, which means it may not be as a minimal as
+ possible.
+* Debugging of protobuf related errors is harder due to the binary nature of
+ the format.
+* Migrating API object storage from JSON to protobuf will require that all
+ API servers are upgraded before beginning to write protobuf to disk, since
+ old servers won't recognize protobuf.
+* Transport of protobuf between etcd and the api server will be less efficient
+ in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON).
+ Should still be smaller than current JSON request.
+* Third-party API objects must be stored as JSON inside of a protobuf wrapper
+ in etcd, and the API endpoints will not benefit from clients that speak
+ protobuf. Clients will have to deal with some API objects not supporting
+ protobuf.
+
+
+## Open Questions
+
+* Is supporting stored protobuf files on disk in the kubectl client worth it?
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/protobuf.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/release-notes.md b/contributors/design-proposals/release-notes.md
new file mode 100644
index 00000000..f602eead
--- /dev/null
+++ b/contributors/design-proposals/release-notes.md
@@ -0,0 +1,194 @@
+
+# Kubernetes Release Notes
+
+[djmm@google.com](mailto:djmm@google.com)<BR>
+Last Updated: 2016-04-06
+
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Kubernetes Release Notes](#kubernetes-release-notes)
+ - [Objective](#objective)
+ - [Background](#background)
+ - [The Problem](#the-problem)
+ - [The (general) Solution](#the-general-solution)
+ - [Then why not just list *every* change that was submitted, CHANGELOG-style?](#then-why-not-just-list-every-change-that-was-submitted-changelog-style)
+ - [Options](#options)
+ - [Collection Design](#collection-design)
+ - [Publishing Design](#publishing-design)
+ - [Location](#location)
+ - [Layout](#layout)
+ - [Alpha/Beta/Patch Releases](#alphabetapatch-releases)
+ - [Major/Minor Releases](#majorminor-releases)
+ - [Work estimates](#work-estimates)
+ - [Caveats / Considerations](#caveats--considerations)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## Objective
+
+Define a process and design tooling for collecting, arranging and publishing
+release notes for Kubernetes releases, automating as much of the process as
+possible.
+
+The goal is to introduce minor changes to the development workflow
+in a way that is mostly frictionless and allows for the capture of release notes
+as PRs are submitted to the repository.
+
+This direct association of release notes to PRs captures the intention of
+release visibility of the PR at the point an idea is submitted upstream.
+The release notes can then be more easily collected and published when the
+release is ready.
+
+## Background
+
+### The Problem
+
+Release notes are often an afterthought and clarifying and finalizing them
+is often left until the very last minute at the time the release is made.
+This is usually long after the feature or bug fix was added and is no longer on
+the mind of the author. Worse, the collecting and summarizing of the
+release is often left to those who may know little or nothing about these
+individual changes!
+
+Writing and editing release notes at the end of the cycle can be a rushed,
+interrupt-driven and often stressful process resulting in incomplete,
+inconsistent release notes often with errors and omissions.
+
+### The (general) Solution
+
+Like most things in the development/release pipeline, the earlier you do it,
+the easier it is for everyone and the better the outcome. Gather your release
+notes earlier in the development cycle, at the time the features and fixes are
+added.
+
+#### Then why not just list *every* change that was submitted, CHANGELOG-style?
+
+On larger projects like Kubernetes, showing every single change (PR) would mean
+hundreds of entries. The goal is to highlight the major changes for a release.
+
+## Options
+
+1. Use of pre-commit and other local git hooks
+ * Experiments here using `prepare-commit-msg` and `commit-msg` git hook files
+ were promising but less than optimal due to the fact that they would
+ require input/confirmation with each commit and there may be multiple
+ commits in a push and eventual PR.
+1. Use of [github templates](https://github.com/blog/2111-issue-and-pull-request-templates)
+ * Templates provide a great way to pre-fill PR comments, but there are no
+ server-side hooks available to parse and/or easily check the contents of
+ those templates to ensure that checkboxes were checked or forms were filled
+ in.
+1. Use of labels enforced by mungers/bots
+ * We already make great use of mungers/bots to manage labels on PRs and it
+ fits very nicely in the existing workflow
+
+## Collection Design
+
+The munger/bot option fits most cleanly into the existing workflow.
+
+All `release-note-*` labeling is managed on the master branch PR only.
+No `release-note-*` labels are needed on cherry-pick PRs and no information
+will be collected from that cherry-pick PR.
+
+The only exception to this rule is when a PR is not a cherry-pick and is
+targeted directly to the non-master branch. In this case, a `release-note-*`
+label is required for that non-master PR.
+
+1. New labels added to github: `release-note-none`, maybe others for new release note categories - see Layout section below
+1. A [new munger](https://github.com/kubernetes/kubernetes/issues/23409) that will:
+ * Add a `release-note-label-needed` label to all new master branch PRs
+ * Block merge by the submit queue on all PRs labeled as `release-note-label-needed`
+ * Auto-remove `release-note-label-needed` when one of the `release-note-*` labels is added
+
+## Publishing Design
+
+### Location
+
+With v1.2.0, the release notes were moved from their previous [github releases](https://github.com/kubernetes/kubernetes/releases)
+location to [CHANGELOG.md](../../CHANGELOG.md). Going forward this seems like a good plan.
+Other projects do similarly.
+
+The kubernetes.tar.gz download link is also displayed along with the release notes
+in [CHANGELOG.md](../../CHANGELOG.md).
+
+Is there any reason to continue publishing anything to github releases if
+the complete release story is published in [CHANGELOG.md](../../CHANGELOG.md)?
+
+### Layout
+
+Different types of releases will generally have different requirements in
+terms of layout. As expected, major releases like v1.2.0 are going
+to require much more detail than the automated release notes will provide.
+
+The idea is that these mechanisms will provide 100% of the release note
+content for alpha, beta and most minor releases and bootstrap the content
+with a release note 'template' for the authors of major releases like v1.2.0.
+
+The authors can then collaborate and edit the higher level sections of the
+release notes in a PR, updating [CHANGELOG.md](../../CHANGELOG.md) as needed.
+
+v1.2.0 demonstrated the need, at least for major releases like v1.2.0, for
+several sections in the published release notes.
+In order to provide a basic layout for release notes in the future,
+new releases can bootstrap [CHANGELOG.md](../../CHANGELOG.md) with the following template types:
+
+#### Alpha/Beta/Patch Releases
+
+These are automatically generated from `release-note*` labels, but can be modified as needed.
+
+```
+Action Required
+* PR titles from the release-note-action-required label
+
+Other notable changes
+* PR titles from the release-note label
+```
+
+#### Major/Minor Releases
+
+```
+Major Themes
+* Add to or delete this section
+
+Other notable improvements
+* Add to or delete this section
+
+Experimental Features
+* Add to or delete this section
+
+Action Required
+* PR titles from the release-note-action-required label
+
+Known Issues
+* Add to or delete this section
+
+Provider-specific Notes
+* Add to or delete this section
+
+Other notable changes
+* PR titles from the release-note label
+```
+
+## Work estimates
+
+* The [new munger](https://github.com/kubernetes/kubernetes/issues/23409)
+ * Owner: @eparis
+ * Time estimate: Mostly done
+* Updates to the tool that collects, organizes, publishes and sends release
+ notifications.
+ * Owner: @david-mcmahon
+ * Time estimate: A few days
+
+
+## Caveats / Considerations
+
+* As part of the planning and development workflow how can we capture
+ release notes for bigger features?
+ [#23070](https://github.com/kubernetes/kubernetes/issues/23070)
+ * For now contributors should simply use the first PR that enables a new
+ feature by default. We'll revisit if this does not work well.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/release-notes.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/rescheduler.md b/contributors/design-proposals/rescheduler.md
new file mode 100644
index 00000000..faf53564
--- /dev/null
+++ b/contributors/design-proposals/rescheduler.md
@@ -0,0 +1,123 @@
+# Rescheduler design space
+
+@davidopp, @erictune, @briangrant
+
+July 2015
+
+## Introduction and definition
+
+A rescheduler is an agent that proactively causes currently-running
+Pods to be moved, so as to optimize some objective function for
+goodness of the layout of Pods in the cluster. (The objective function
+doesn't have to be expressed mathematically; it may just be a
+collection of ad-hoc rules, but in principle there is an objective
+function. Implicitly an objective function is described by the
+scheduler's predicate and priority functions.) It might be triggered
+to run every N minutes, or whenever some event happens that is known
+to make the objective function worse (for example, whenever any Pod goes
+PENDING for a long time.)
+
+## Motivation and use cases
+
+A rescheduler is useful because without a rescheduler, scheduling
+decisions are only made at the time Pods are created. But later on,
+the state of the cell may have changed in some way such that it would
+be better to move the Pod to another node.
+
+There are two categories of movements a rescheduler might trigger: coalescing
+and spreading.
+
+### Coalesce Pods
+
+This is the most common use case. Cluster layout changes over time. For
+example, run-to-completion Pods terminate, producing free space in their wake, but that space
+is fragmented. This fragmentation might prevent a PENDING Pod from scheduling
+(there are enough free resource for the Pod in aggregate across the cluster,
+but not on any single node). A rescheduler can coalesce free space like a
+disk defragmenter, thereby producing enough free space on a node for a PENDING
+Pod to schedule. In some cases it can do this just by moving Pods into existing
+holes, but often it will need to evict (and reschedule) running Pods in order to
+create a large enough hole.
+
+A second use case for a rescheduler to coalesce pods is when it becomes possible
+to support the running Pods on a fewer number of nodes. The rescheduler can
+gradually move Pods off of some set of nodes to make those nodes empty so
+that they can then be shut down/removed. More specifically,
+the system could do a simulation to see whether after removing a node from the
+cluster, will the Pods that were on that node be able to reschedule,
+either directly or with the help of the rescheduler; if the answer is
+yes, then you can safely auto-scale down (assuming services will still
+meeting their application-level SLOs).
+
+### Spread Pods
+
+The main use cases for spreading Pods revolve around relieving congestion on (a) highly
+utilized node(s). For example, some process might suddenly start receiving a significantly
+above-normal amount of external requests, leading to starvation of best-effort
+Pods on the node. We can use the rescheduler to move the best-effort Pods off of the
+node. (They are likely to have generous eviction SLOs, so are more likely to be movable
+than the Pod that is experiencing the higher load, but in principle we might move either.)
+Or even before any node becomes overloaded, we might proactively re-spread Pods from nodes
+with high-utilization, to give them some buffer against future utilization spikes. In either
+case, the nodes we move the Pods onto might have been in the system for a long time or might
+have been added by the cluster auto-scaler specifically to allow the rescheduler to
+rebalance utilization.
+
+A second spreading use case is to separate antagonists.
+Sometimes the processes running in two different Pods on the same node
+may have unexpected antagonistic
+behavior towards one another. A system component might monitor for such
+antagonism and ask the rescheduler to move one of the antagonists to a new node.
+
+### Ranking the use cases
+
+The vast majority of users probably only care about rescheduling for three scenarios:
+
+1. Move Pods around to get a PENDING Pod to schedule
+1. Redistribute Pods onto new nodes added by a cluster auto-scaler when there are no PENDING Pods
+1. Move Pods around when CPU starvation is detected on a node
+
+## Design considerations and design space
+
+Because rescheduling is disruptive--it causes one or more
+already-running Pods to die when they otherwise wouldn't--a key
+constraint on rescheduling is that it must be done subject to
+disruption SLOs. There are a number of ways to specify these SLOs--a
+global rate limit across all Pods, a rate limit across a set of Pods
+defined by some particular label selector, a maximum number of Pods
+that can be down at any one time among a set defined by some
+particular label selector, etc. These policies are presumably part of
+the Rescheduler's configuration.
+
+There are a lot of design possibilities for a rescheduler. To explain
+them, it's easiest to start with the description of a baseline
+rescheduler, and then describe possible modifications. The Baseline
+rescheduler
+* only kicks in when there are one or more PENDING Pods for some period of time; its objective function is binary: completely happy if there are no PENDING Pods, and completely unhappy if there are PENDING Pods; it does not try to optimize for any other aspect of cluster layout
+* is not a scheduler -- it simply identifies a node where a PENDING Pod could fit if one or more Pods on that node were moved out of the way, and then kills those Pods to make room for the PENDING Pod, which will then be scheduled there by the regular scheduler(s). [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless] Of course it should only do this if it is sure the killed Pods will be able to reschedule into already-free space in the cluster. Note that although it is not a scheduler, the Rescheduler needs to be linked with the predicate functions of the scheduling algorithm(s) so that it can know (1) that the PENDING Pod would actually schedule into the hole it has identified once the hole is created, and (2) that the evicted Pod(s) will be able to schedule somewhere else in the cluster.
+
+Possible variations on this Baseline rescheduler are
+
+1. it can kill the Pod(s) whose space it wants **and also schedule the Pod that will take that space and reschedule the Pod(s) that were killed**, rather than just killing the Pod(s) whose space it wants and relying on the regular scheduler(s) to schedule the Pod that will take that space (and to reschedule the Pod(s) that were evicted)
+1. it can run continuously in the background to optimize general cluster layout instead of just trying to get a PENDING Pod to schedule
+1. it can try to move groups of Pods instead of using a one-at-a-time / greedy approach
+1. it can formulate multi-hop plans instead of single-hop
+
+A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s).
+* For the Baseline rescheduler, it needs to know the predicate functions used by the cluster's scheduler(s) else it can't know how to create a hole that the PENDING Pod will fit into, nor be sure that the evicted Pod(s) will be able to reschedule elsewhere.
+* If it is going to run continuously in the background to optimize cluster layout but is still only going to kill Pods, then it still needs to know the predicate functions for the reason mentioned above. In principle it doesn't need to know the priority functions; it could just randomly kill Pods and rely on the regular scheduler to put them back in better places. However, this is a rather inexact approach. Thus it is useful for the rescheduler to know the priority functions, or at least some subset of them, so it can be sure that an action it takes will actually improve the cluster layout.
+* If it is going to run continuously in the background to optimize cluster layout and is going to act as a scheduler rather than just killing Pods, then it needs to know the predicate functions and some compatible (but not necessarily identical) priority functions One example of a case where "compatible but not identical" might be useful is if the main scheduler(s) has a very simple scheduling policy optimized for low scheduling latency, and the Rescheduler having a more sophisticated/optimal scheduling policy that requires more computation time. The main thing to avoid is for the scheduler(s) and rescheduler to have incompatible priority functions, as this will cause them to "fight" (though it still can't lead to an infinite loop, since the scheduler(s) only ever touches a Pod once).
+
+## Appendix: Integrating rescheduler with cluster auto-scaler (scale up)
+
+For scaling up the cluster, a reasonable workflow might be:
+
+1. pod horizontal auto-scaler decides to add one or more Pods to a service, based on the metrics it is observing
+1. the Pod goes PENDING due to lack of a suitable node with sufficient resources
+1. rescheduler notices the PENDING Pod and determines that the Pod cannot schedule just by rearranging existing Pods (while respecting SLOs)
+1. rescheduler triggers cluster auto-scaler to add a node of the appropriate type for the PENDING Pod
+1. the PENDING Pod schedules onto the new node (and possibly the rescheduler also moves other Pods onto that node)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduler.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/rescheduling-for-critical-pods.md b/contributors/design-proposals/rescheduling-for-critical-pods.md
new file mode 100644
index 00000000..1d2d80ee
--- /dev/null
+++ b/contributors/design-proposals/rescheduling-for-critical-pods.md
@@ -0,0 +1,88 @@
+# Rescheduler: guaranteed scheduling of critical addons
+
+## Motivation
+
+In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
+there is a bunch of addons which due to various reasons have to run on a regular cluster node, not the master.
+Some of them are critical to have fully functional cluster: Heapster, DNS, UI. Users can break their cluster
+by evicting a critical addon (either manually or as a side effect of an other operation like upgrade)
+which possibly can become pending (for example when the cluster is highly utilized).
+To avoid such situation we want to have a mechanism which guarantees that
+critical addons are scheduled assuming the cluster is big enough.
+This possibly may affect other pods (including production user’s applications).
+
+## Design
+
+Rescheduler will ensure that critical addons are always scheduled.
+In the first version it will implement only this policy, but later we may want to introduce other policies.
+It will be a standalone component running on master machine similarly to scheduler.
+Those components will share common logic (initially rescheduler will in fact import some of scheduler packages).
+
+### Guaranteed scheduling of critical addons
+
+Rescheduler will observe critical addons
+(with annotation `scheduler.alpha.kubernetes.io/critical-pod`).
+If one of them is marked by scheduler as unschedulable (pod condition `PodScheduled` set to `false`, the reason set to `Unschedulable`)
+the component will try to find a space for the addon by evicting some pods and then the scheduler will schedule the addon.
+
+#### Scoring nodes
+
+Initially we want to choose a random node with enough capacity
+(chosen as described in [Evicting pods](rescheduling-for-critical-pods.md#evicting-pods)) to schedule given addons.
+Later we may want to introduce some heuristic:
+* minimize number of evicted pods with violation of disruption budget or shortened termination grace period
+* minimize number of affected pods by choosing a node on which we have to evict less pods
+* increase probability of scheduling of evicted pods by preferring a set of pods with the smallest total sum of requests
+* avoid nodes which are ‘non-drainable’ (according to drain logic), for example on which there is a pod which doesn’t belong to any RC/RS/Deployment
+
+#### Evicting pods
+
+There are 2 mechanism which possibly can delay a pod eviction: Disruption Budget and Termination Grace Period.
+
+While removing a pod we will try to avoid violating Disruption Budget, though we can’t guarantee it
+since there is a chance that it would block this operation for longer period of time.
+We will also try to respect Termination Grace Period, though without any guarantee.
+In case we have to remove a pod with termination grace period longer than 10s it will be shortened to 10s.
+
+The proposed order while choosing a node to schedule a critical addon and pods to remove:
+1. a node where the critical addon pod can fit after evicting only pods satisfying both
+(1) their disruption budget will not be violated by such eviction and (2) they have grace period <= 10 seconds
+1. a node where the critical addon pod can fit after evicting only pods whose disruption budget will not be violated by such eviction
+1. any node where the critical addon pod can fit after evicting some pods
+
+### Interaction with Scheduler
+
+To avoid situation when Scheduler will schedule another pod into the space prepared for the critical addon,
+the chosen node has to be temporarily excluded from a list of nodes considered by Scheduler while making decisions.
+For this purpose the node will get a temporary
+[Taint](../../docs/design/taint-toleration-dedicated.md) “CriticalAddonsOnly”
+and each critical addon has to have defined toleration for this taint.
+After Rescheduler has no more work to do: all critical addons are scheduled or cluster is too small for them,
+all taints will be removed.
+
+### Interaction with Cluster Autoscaler
+
+Rescheduler possibly can duplicate the responsibility of Cluster Autoscaler:
+both components are taking action when there is unschedulable pod.
+It may cause the situation when CA will add extra node for a pending critical addon
+and Rescheduler will evict some running pods to make a space for the addon.
+This situation would be rare and usually an extra node would be anyway needed for evicted pods.
+In the worst case CA will add and then remove the node.
+To not complicate architecture by introducing interaction between those 2 components we accept this overlap.
+
+We want to ensure that CA won’t remove nodes with critical addons by adding appropriate logic there.
+
+### Rescheduler control loop
+
+The rescheduler control loop will be as follow:
+* while there is an unschedulable critical addon do the following:
+ * choose a node on which the addon should be scheduled (as described in Evicting pods)
+ * add taint to the node to prevent scheduler from using it
+ * delete pods which blocks the addon from being scheduled
+ * wait until scheduler will schedule the critical addon
+* if there is no more critical addons for which we can help, ensure there is no node with the taint
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling-for-critical-pods.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/rescheduling.md b/contributors/design-proposals/rescheduling.md
new file mode 100644
index 00000000..b1bdb937
--- /dev/null
+++ b/contributors/design-proposals/rescheduling.md
@@ -0,0 +1,493 @@
+# Controlled Rescheduling in Kubernetes
+
+## Overview
+
+Although the Kubernetes scheduler(s) try to make good placement decisions for pods,
+conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes
+are removed due to failures or planned maintenance or auto-scaling down, nodes appear due
+to recovery after a failure or re-joining after maintenance or auto-scaling up or adding
+new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are
+some interactions between pods, or between pods and nodes, that they cannot predict). As
+a result, the initial node selected for a pod may turn out to be a bad match, from the
+perspective of the pod and/or the cluster as a whole, at some point after the pod has
+started running.
+
+Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless
+it terminates on its own, is deleted by the user, or experiences some unplanned event
+(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the
+assignment of pods to nodes degrades over time, no matter how good an initial scheduling
+decision the scheduler makes. This observation motivates "controlled rescheduling," a
+mechanism by which Kubernetes will "move" already-running pods over time to improve their
+placement. Controlled rescheduling is the subject of this proposal.
+
+Note that the term "move" is not technically accurate -- the mechanism used is that
+Kubernetes will terminate a pod that is managed by a controller, and the controller will
+create a replacement pod that is then scheduled by the pod's scheduler. The terminated
+pod and replacement pod are completely separate pods, and no pod migration is
+implied. However, describing the process as "moving" the pod is approximately accurate
+and easier to understand, so we will use this terminology in the document.
+
+We use the term "rescheduling" to describe any action the system takes to move an
+already-running pod. The decision may be made and executed by any component; we wil
+introduce the concept of a "rescheduler" component later, but it is not the only
+component that can do rescheduling.
+
+This proposal primarily focuses on the architecture and features/mechanisms used to
+achieve rescheduling, and only briefly discuss example policies. We expect that community
+experimentation will lead to a significantly better understanding of the range, potential,
+and limitations of rescheduling policies.
+
+## Example use cases
+
+Example use cases for rescheduling are
+
+* moving a running pod onto a node that better satisfies its scheduling criteria
+ * moving a pod onto an under-utilized node
+ * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences
+* moving a running pod off of a node in anticipation of a known or speculated future event
+ * draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc.
+ * "preempting" a running pod to make room for a pending pod to schedule
+ * proactively/speculatively make room for large and/or exclusive pods to facilitate
+ fast scheduling in the future (often called "defragmentation")
+ * (note that these last two cases are the only use cases where the first-order intent
+ is to move a pod specifically for the benefit of another pod)
+* moving a running pod off of a node from which it is receiving poor service
+ * anomalous crashlooping or other mysterious incompatiblity between the pod and the node
+ * repeated out-of-resource killing (see #18724)
+ * repeated attempts by the scheduler to schedule the pod onto some node, but it is
+ rejected by Kubelet admission control due to incomplete scheduler knowledge
+ * poor performance due to interference from other containers on the node (CPU hogs,
+ cache thrashers, etc.) (note that in this case there is a choice of moving the victim
+ or the aggressor)
+
+## Some axes of the design space
+
+Among the key design decisions are
+
+* how does a pod specify its tolerance for these system-generated disruptions, and how
+ does the system enforce such disruption limits
+* for each use case, where is the decision made about when and which pods to reschedule
+ (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.)
+* rescheduler design issues: how much does a rescheduler need to know about pods'
+ schedulers' policies, how does the rescheduler specify its rescheduling
+ requests/decisions (e.g. just as an eviction, an eviction with a hint about where to
+ reschedule, or as an eviction paired with a specific binding), how does the system
+ implement these requests, does the rescheduler take into account the second-order
+ effects of decisions (e.g. whether an evicted pod will reschedule, will cause
+ a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans
+ (e.g. evict two pods at the same time with the intent of moving one into the space
+ vacated by the other, or even more complex plans)
+
+Additional musings on the rescheduling design space can be found [here](rescheduler.md).
+
+## Design proposal
+
+The key mechanisms and components of the proposed design are priority, preemption,
+disruption budgets, the `/evict` subresource, and the rescheduler.
+
+### Priority
+
+#### Motivation
+
+
+Just as it is useful to overcommit nodes to increase node-level utilization, it is useful
+to overcommit clusters to increase cluster-level utilization. Scheduling priority (which
+we abbreviate as *priority*, in combination with disruption budgets (described in the
+next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow
+it to safely overcommit nodes.
+
+Today, cluster sharing among users, workload types, etc. is regulated via the
+[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster
+administrator has two choices: (1) the sum of the quotas is less than or equal to the
+capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the
+cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster
+under-utilization, while (2) is unsafe in the sense that someone's pods may go pending
+indefinitely even though they are still within their quota. Priority makes cluster
+overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify
+which pods should be allowed to run, and which should go pending, when demand for cluster
+resources exceeds supply to due to cluster overcommitment.
+
+Priority is also useful in some special-case scenarios, such as ensuring that system
+DaemonSets can always schedule and reschedule onto every node where they want to run
+(assuming they are given the highest priority), e.g. see #21767.
+
+#### Specifying priorities
+
+We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and
+the cluster administrator defines a total ordering on these strings (for example
+`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is
+easy for an administrator to add new priority levels in between existing levels, to
+encourage thinking about priority in terms of user intent and avoid magic numbers, and to
+make the internal implementation more flexible.
+
+When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's
+scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at
+the same or lower priority than P (subject to disruption budgets, see next section) from
+a node in order to make room for P, i.e. in order to make the scheduling predicates
+satisfied for P on that node. (Note that when we add cluster-level resources (#19080),
+it might be necessary to preempt from multiple nodes, but that scenario is outside the
+scope of this document.) The preempted pod(s) may or may not be able to reschedule. The
+net effect of this process is that when demand for cluster resources exceeds supply, the
+higher-priority pods will be able to run while the lower-priority pods will be forced to
+wait. The detailed mechanics of preemption are described in a later section.
+
+In addition to taking disruption budget into account, for equal-priority preemptions the
+scheduler will try to enforce fairness (across victim controllers, services, etc.)
+
+Priorities could be specified directly by users in the podTemplate, or assigned by an
+admission controller using
+properties of the pod. Either way, all schedulers must be configured to understand the
+same priorities (names and ordering). This could be done by making them constants in the
+API, or using ConfigMap to configure the schedulers with the information. The advantage of
+the former (at least making the names, if not the ordering, constants in the API) is that
+it allows the API server to do validation (e.g. to catch mis-spelling).
+
+In the future, which priorities are usable for a given namespace and pods with certain
+attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy.
+
+Priority and resource QoS are indepedent.
+
+The priority we have described here might be used to prioritize the scheduling queue
+(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two
+priority concepts do not have to be connected. It is somewhat logical to tie them
+together, since a higher priority genreally indicates that a pod is more urgent to get
+running. Also, scheduling low-priority pods before high-priority pods might lead to
+avoidable preemptions if the high-priority pods end up preempting the low-priority pods
+that were just scheduled.
+
+TODO: Priority and preemption are global or namespace-relative? See
+[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389).
+
+#### Relationship of priority to quota
+
+Of course, if the decision of what priority to give a pod is solely up to the user, then
+users have no incentive to ever request any priority less than the maximum. Thus
+priority is intimately related to quota, in the sense that resource quotas must be
+allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM
+at priority B, etc.). The "guarantee" that highest-priority pods will always be able to
+schedule can only be achieved if the sum of the quotas at the top priority level is less
+than or equal to the cluster capacity. This is analogous to QoS, where safety can only be
+achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or
+equal to the node capacity. In terms of incentives, an organization could "charge"
+an amount proportional to the priority of the resources.
+
+The topic of how to allocate quota at different priority levels to achieve a desired
+balance between utilization and probability of schedulability is an extremely complex
+topic that is outside the scope of this document. For example, resource fragmentation and
+RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the
+sum of the quotas at the top priority level is less than or equal to the total aggregate
+capacity of the cluster, some pods at the top priority level might still go pending. In
+general, priority provdes a *probabilistic* guarantees of pod schedulability in the face
+of overcommitment, by allowing prioritization of which pods should be allowed to run pods
+when demand for cluster resources exceeds supply.
+
+### Disruption budget
+
+While priority can protect pods from one source of disruption (preemption by a
+lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated
+causes, including preemption by an equal or higher-priority pod, or being evicted to
+achieve other rescheduling goals. In particular, each pod is optionally associated with a
+"disruption budget," a new API resource that limits Kubernetes-initiated terminations
+across a set of pods (e.g. the pods of a particular Service might all point to the same
+disruption budget object), regardless of cause. Initially we expect disruption budget
+(e.g. `DisruptionBudgetSpec`) to consist of
+
+* a rate limit on disruptions (preemption and other evictions) across the corresponding
+ set of pods, e.g. no more than one disruption per hour across the pods of a particular Service
+* a minimum number of pods that must be up simultaneously (sometimes called "shard
+ strength") (of course this can also be expressed as the inverse, i.e. the number of
+ pods of the collection that can be down simultaneously)
+
+The second item merits a bit more explanation. One use case is to specify a quorum size,
+e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up
+at the same time. In practice, a service should ideally create enough replicas to survive
+at least one planned and one unplanned outage. So in our quorum example, we would specify
+that at least 4 replicas must be up at the same time; this allows for one intentional
+disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit
+of shard strength budget) and one unplanned disruption (bringing the number of live
+replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also
+useful for simpler replicated services; for example, you might not want more than 10% of
+your front-ends to be down at the same time, so as to avoid overloading the remaining
+replicas.
+
+Initially, disruption budgets will be specified by the user. Thus as with priority,
+disruption budgets need to be tied into quota, to prevent users from saying none of their
+pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD,
+though a simple starting point would be to have an admission controller assign a default
+disruption budget based on priority level (more liberal with decreasing priority).
+We also likely need a quota that applies to Kubernetes *components*, to the limit the rate
+at which any one component is allowed to consume disruption budget.
+
+Of course there should also be a `DisruptionBudgetStatus` that indicates the current
+disruption rate that the collection of pods is experiencing, and the number of pods that
+are up.
+
+For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
+graceful termination period starts.
+
+A pod that is not covered by a disruption budget but is managed by a controller,
+gets an implicit disruption budget of infinite (though the system should try to not
+unduly victimize such pods). How a pod that is not managed by a controller is
+handled is TBD.
+
+TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
+(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
+budget (e.g. when instantiating a Service), or require the user to create it manually
+before they create a controller? Which objects should return the disruption budget object
+as part of the output on `kubectl get` other than (obviously) `kubectl get` for the
+disruption budget itself?
+
+TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes"
+and "down due to unplanned outage" in spec and status.
+
+For now, there is nothing to prevent clients from circumventing the disruption budget
+protections. Of course, clients that do this are not being "good citizens." In the next
+section we describe a mechanism that at least makes it easy for well-behaved clients to
+obey the disruption budgets.
+
+See #12611 for additional discussion of disruption budgets.
+
+### /evict subresource and PreferAvoidPods
+
+Although we could put the responsibility for checking and updating disruption budgets
+solely on the client, it is safer and more convenient if we implement that functionality
+in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to
+today's "delete" on pod except
+
+ * It will be rejected if the deletion would violate disruption budget. (See how
+ Deployment handles failure of /rollback for ideas on how clients could handle failure
+ of `/evict`.) There are two possible ways to implement this:
+
+ * For the initial implementation, this will be accomplished by the API server just
+ looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the
+ `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller
+ keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and
+ creations in the cluster, so that an approved disruption is quickly reflected in the
+ `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or
+ more additional disruptions could be approved before the first one is reflected in the
+ `DisruptionBudgetStatus`.
+
+ * Thus a subsequent implementation will have the API server explicitly debit the
+ `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a
+ controller, to keep the shard strength status up-to-date when replacement pods are
+ created after an eviction; the controller may also be necessary for the rate status
+ depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.)
+ Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will
+ be placed in the same transaction.
+
+ * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its
+ graceful termination period starts (so when we say "delete" here we do not mean
+ "deleted from etcd" but rather "graceful termination period has started").
+
+ * It will allow clients to communicate additional parameters when they wish to delete a
+ pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific
+ type analogous to `api.DeleteOptions`.)
+
+We will make `kubectl delete pod` use `/evict` by default, and require a command-line
+flag to delete the pod directly.
+
+We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid
+that node (provisionally called `PreferAvoidPods`). One of the pieces of information
+specified in the `/evict` subresource is whether the eviction should add the evicted
+pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod
+signature will be a
+[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
+i.e. a reference to the pod's controller. Controllers are responsible for garbage
+collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API
+server will also enforce a bounded size on the list. All schedulers will have a
+highest-weighted priority function that gives a node the worst priority if the pod it is
+scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in
+`PreferAvoidPods` is similar to
+[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md)
+but it takes precedence over all other priority criteria and is not explicitly listed in
+the `NodeAffinity` of the pod.
+
+`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is
+receiving poor service" use case, as it reduces the chance that the replacement pod will
+end up on the same node (keep in mind that most of those cases are situations that the
+scheduler does not have explicit priority functions for, for example it cannot know in
+advance that a pod will be starved). Also, though we do not intend to implement any such
+policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts
+two pods A and B with the intention of moving A into the space vacated by B (it prevents
+B from rescheduling back into the space it vacated before A's scheduler has a chance to
+reschedule A there). Note that these two uses are subtly different; in the first
+case we want the avoidance to last a relatively long time, whereas in the second case we
+may only need it to last until A schedules.
+
+See #20699 for more discussion.
+
+### Preemption mechanics
+
+**NOTE: We expect a fuller design doc to be written on preemption before it is implemented.
+However, a sketch of some ideas are presented here, since preemption is closely related to the
+concepts discussed in this doc.**
+
+Pod schedulers will decide and enact preemptions, subject to the priority and disruption
+budget rules described earlier. (Though note that we currently do not have any mechanism
+to prevent schedulers from bypassing either the priority or disruption budget rules.)
+The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The
+eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption
+budget(s) of the victim(s), but it does not request to add the victim pod(s) to the
+nodes' `PreferAvoidPods`.
+
+Evicting victim(s) and binding the pending pod that the evictions are intended to enable
+to schedule, are not transactional. We expect the scheduler to issue the operations in
+sequence, but it is still possible that another scheduler could schedule its pod in
+between the eviction(s) and the binding, or that the set of pods running on the node in
+question changed between the time the scheduler made its decision and the time it sent
+the operations to the API server thereby causing the eviction(s) to be not sufficient to get the
+pending pod to schedule. In general there are a number of race conditions that cannot be
+avoided without (1) making the evictions and binding be part of a single transaction, and
+(2) making the binding preconditioned on a version number that is associated with the
+node and is incremented on every binding. We may or may not implement those mechanisms in
+the future.
+
+Given a choice between a node where scheduling a pod requires preemption and one where it
+does not, all other things being equal, a scheduler should choose the one where
+preemption is not required. (TBD: Also, if the selected node does require preemption, the
+scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the
+scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and
+one 4 GB high-priority pod, all of which have sufficient disruption budget, it should
+preempt the two low-priority pods). This is debatable, since all have sufficient
+disruption budget. But still better to err on the side of giving better disruption SLO to
+higher-priority pods when possible?)
+
+Preemption victims must be given their termination grace period. One possible sequence
+of events is
+
+1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the
+preempting pod) and sets `deletionTimestamp` on the victims
+2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their
+graceful termination period
+3. Kubelet sees the preempting pod. It runs the admission checks on the new pod
+assuming all pods that are in their graceful termination period are gone and that
+all pods that are in the waiting state (see (4)) are running.
+4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the
+new pod in a waiting state, and does not run it until the pod passes passes the
+admission checks using the set of actually running pods.
+
+Note that there are a lot of details to be figured out here; above is just a very
+hand-wavy sketch of one general approach that might work.
+
+See #22212 for additional discussion.
+
+### Node drain
+
+Node drain will be handled by one or more components not described in this document. They
+will respect disruption budgets. Initially, we will just make `kubectl drain`
+respect disruption budgets. See #17393 for other discussion.
+
+### Rescheduler
+
+All rescheduling other than preemption and node drain will be decided and enacted by a
+new component called the *rescheduler*. It runs continuously in the background, looking
+for opportunities to move pods to better locations. It acts when the degree of
+improvement meets some threshold and is allowed by the pod's disruption budget. The
+action is eviction of a pod using the `/evict` subresource, with the pod's signature
+enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any
+particular node. Thus it is really an "unscheduler"; only in combination with the evicted
+pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See
+the "Example use cases" section earlier for some example use cases.
+
+The rescheduler is a best-effort service that makes no guarantees about how quickly (or
+whether) it will resolve a suboptimal pod placement.
+
+The first version of the rescheduler will not take into consideration where or whether an
+evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the
+corresponding shard strength disruption budget by one indefinitely. By using the `/evict`
+subresource, the rescheduler ensures that an evicted pod has sufficient budget for the
+evicted pod to go and stay pending. We expect future versions of the rescheduler may be
+linked with the "mandatory" predicate functions (currently, the ones that constitute the
+Kubelet admission criteria), and will only evict if the rescheduler determines that the
+pod can reschedule somewhere according to those criteria. (Note that this still does not
+guarantee that the pod actually will be able to reschedule, for at least two reasons: (1)
+the state of the cluster may change between the time the rescheduler evaluates it and
+when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the
+evicted pod's scheduler may have additional predicate functions in addition to the
+mandatory ones).
+
+(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)).
+
+The first version of the rescheduler will only implement two objectives: moving a pod
+onto an under-utilized node, and moving a pod onto a node that meets more of the pod's
+affinity/anti-affinity preferences than wherever it is currently running. (We assume that
+nodes that are intentionally under-utilized, e.g. because they are being drained, are
+marked unschedulable, thus the first objective will not cause the rescheduler to "fight"
+a system that is draining nodes.) We assume that all schedulers sufficiently weight the
+priority functions for affinity/anti-affinity and avoiding very packed nodes,
+otherwise evicted pods may not actually move onto a node that is better according to
+the criteria that caused it to be evicted. (But note that in all cases it will move to a
+node that is better according to the totality of its scheduler's priority functions,
+except in the case where the node where it was already running was the only node
+where it can run.) As a general rule, the rescheduler should only act when it sees
+particularly bad situations, since (1) an eviction for a marginal improvement is likely
+not worth the disruption--just because there is sufficient budget for an eviction doesn't
+mean an eviction is painless to the application, and (2) rescheduling the pod might not
+actually mitigate the identified problem if it is minor enough that other scheduling
+factors dominate the decision of where the replacement pod is scheduled.
+
+We assume schedulers' priority functions are at least vaguely aligned with the
+rescheduler's policies; otherwise the rescheduler will never accomplish anything useful,
+given that it relies on the schedulers to actually reschedule the evicted pods. (Even if
+the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want
+this to be true, to prevent the schedulers and rescheduler from "fighting" one another.)
+
+The rescheduler will be configured using ConfigMap; the cluster administrator can enable
+or disable policies and can tune the rescheduler's aggressiveness (aggressive means it
+will use a relatively low threshold for triggering an eviction and may consume a lot of
+disruption budget, while non-aggressive means it will use a relatively high threshold for
+triggering an eviction and will try to leave plenty of buffer in disruption budgets). The
+first version of the rescheduler will not be extensible or pluggable, since we want to
+keep the code simple while we gain experience with the overall concept. In the future, we
+anticipate a version that will be extensible and pluggable.
+
+We might want some way to force the evicted pod to the front of the scheduler queue,
+independently of its priority.
+
+See #12140 for additional discussion.
+
+### Final comments
+
+In general, the design space for this topic is huge. This document describes some of the
+design considerations and proposes one particular initial implementation. We expect
+certain aspects of the design to be "permanent" (e.g. the notion and use of priorities,
+preemption, disruption budgets, and the `/evict` subresource) while others may change over time
+(e.g. the partitioning of functionality between schedulers, controllers, rescheduler,
+horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements;
+the factors the rescheduler takes into account when making decisions (e.g. knowledge of
+schedulers' predicate and priority functions, second-order effects like whether and where
+evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its
+decisions; and the complexity of the plans the rescheduler attempts to implement).
+
+## Implementation plan
+
+The highest-priority feature to implement is the rescheduler with the two use cases
+highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a
+node that meets more of the pod's affinity/anti-affinity preferences. The former is
+useful to rebalance pods after cluster auto-scale-up, and the latter is useful for
+Ubernetes. This requires implementing disruption budgets and the `/evict` subresource,
+but not priority or preemption.
+
+Because the general topic of rescheduling is very speculative, we have intentionally
+proposed that the first version of the rescheduler be very simple -- only uses eviction
+(no attempt to guide replacement pod to any particular node), doesn't know schedulers'
+predicate or priority functions, doesn't try to move two pods at the same time, and only
+implements two use cases. As alluded to in the previous subsection, we expect the design
+and implementation to evolve over time, and we encourage members of the community to
+experiment with more sophisticated policies and to report their results from using them
+on real workloads.
+
+## Alternative implementations
+
+TODO.
+
+## Additional references
+
+TODO.
+
+TODO: Add reference to this doc from docs/proposals/rescheduler.md
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/resource-metrics-api.md b/contributors/design-proposals/resource-metrics-api.md
new file mode 100644
index 00000000..fee416e0
--- /dev/null
+++ b/contributors/design-proposals/resource-metrics-api.md
@@ -0,0 +1,151 @@
+# Resource Metrics API
+
+This document describes API part of MVP version of Resource Metrics API effort in Kubernetes.
+Once the agreement will be made the document will be extended to also cover implementation details.
+The shape of the effort may be also a subject of changes once we will have more well-defined use cases.
+
+## Goal
+
+The goal for the effort is to provide resource usage metrics for pods and nodes through the API server.
+This will be a stable, versioned API which core Kubernetes components can rely on.
+In the first version only the well-defined use cases will be handled,
+although the API should be easily extensible for potential future use cases.
+
+## Main use cases
+
+This section describes well-defined use cases which should be handled in the first version.
+Use cases which are not listed below are out of the scope of MVP version of Resource Metrics API.
+
+#### Horizontal Pod Autoscaler
+
+HPA uses the latest value of cpu usage as an average aggregated across 1 minute
+(the window may change in the future). The data for a given set of pods
+(defined either by pod list or label selector) should be accesible in one request
+due to performance issues.
+
+#### Scheduler
+
+Scheduler in order to schedule best-effort pods requires node level resource usage metrics
+as an average aggregated across 1 minute (the window may change in the future).
+The metrics should be available for all resources supported in the scheduler.
+Currently the scheduler does not need this information, because it schedules best-effort pods
+without considering node usage. But having the metrics available in the API server is a blocker
+for adding the ability to take node usage into account when scheduling best-effort pods.
+
+## Other considered use cases
+
+This section describes the other considered use cases and explains why they are out
+of the scope of the MVP version.
+
+#### Custom metrics in HPA
+
+HPA requires the latest value of application level metrics.
+
+The design of the pipeline for collecting application level metrics should
+be revisited and it's not clear whether application level metrics should be
+available in API server so the use case initially won't be supported.
+
+#### Cluster Federation
+
+The Cluster Federation control system might want to consider cluster-level usage (in addition to cluster-level request)
+of running pods when choosing where to schedule new pods. Although
+Cluster Federation is still in design,
+we expect the metrics API described here to be sufficient. Cluster-level usage can be
+obtained by summing over usage of all nodes in the cluster.
+
+#### kubectl top
+
+This feature is not yet specified/implemented although it seems reasonable to provide users information
+about resource usage on pod/node level.
+
+Since this feature has not been fully specified yet it will be not supported initially in the API although
+it will be probably possible to provide a reasonable implementation of the feature anyway.
+
+#### Kubernetes dashboard
+
+[Kubernetes dashboard](https://github.com/kubernetes/dashboard) in order to draw graphs requires resource usage
+in timeseries format from relatively long period of time. The aggregations should be also possible on various levels
+including replication controllers, deployments, services, etc.
+
+Since the use case is complicated it will not be supported initially in the API and they will query Heapster
+directly using some custom API there.
+
+## Proposed API
+
+Initially the metrics API will be in a separate [API group](api-group.md) called ```metrics```.
+Later if we decided to have Node and Pod in different API groups also
+NodeMetrics and PodMetrics should be in different API groups.
+
+#### Schema
+
+The proposed schema is as follow. Each top-level object has `TypeMeta` and `ObjectMeta` fields
+to be compatible with Kubernetes API standards.
+
+```go
+type NodeMetrics struct {
+ unversioned.TypeMeta
+ ObjectMeta
+
+ // The following fields define time interval from which metrics were
+ // collected in the following format [Timestamp-Window, Timestamp].
+ Timestamp unversioned.Time
+ Window unversioned.Duration
+
+ // The memory usage is the memory working set.
+ Usage v1.ResourceList
+}
+
+type PodMetrics struct {
+ unversioned.TypeMeta
+ ObjectMeta
+
+ // The following fields define time interval from which metrics were
+ // collected in the following format [Timestamp-Window, Timestamp].
+ Timestamp unversioned.Time
+ Window unversioned.Duration
+
+ // Metrics for all containers are collected within the same time window.
+ Containers []ContainerMetrics
+}
+
+type ContainerMetrics struct {
+ // Container name corresponding to the one from v1.Pod.Spec.Containers.
+ Name string
+ // The memory usage is the memory working set.
+ Usage v1.ResourceList
+}
+```
+
+By default `Usage` is the mean from samples collected within the returned time window.
+The default time window is 1 minute.
+
+#### Endpoints
+
+All endpoints are GET endpoints, rooted at `/apis/metrics/v1alpha1/`.
+There won't be support for the other REST methods.
+
+The list of supported endpoints:
+- `/nodes` - all node metrics; type `[]NodeMetrics`
+- `/nodes/{node}` - metrics for a specified node; type `NodeMetrics`
+- `/namespaces/{namespace}/pods` - all pod metrics within namespace with support for `all-namespaces`; type `[]PodMetrics`
+- `/namespaces/{namespace}/pods/{pod}` - metrics for a specified pod; type `PodMetrics`
+
+The following query parameters are supported:
+- `labelSelector` - restrict the list of returned objects by labels (list endpoints only)
+
+In the future we may want to introduce the following params:
+`aggregator` (`max`, `min`, `95th`, etc.) and `window` (`1h`, `1d`, `1w`, etc.)
+which will allow to get the other aggregates over the custom time window.
+
+## Further improvements
+
+Depending on the further requirements the following features may be added:
+- support for more metrics
+- support for application level metrics
+- watch for metrics
+- possibility to query for window sizes and aggregation functions (though single window size/aggregation function per request)
+- cluster level metrics
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-metrics-api.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/resource-qos.md b/contributors/design-proposals/resource-qos.md
new file mode 100644
index 00000000..cfbe4faf
--- /dev/null
+++ b/contributors/design-proposals/resource-qos.md
@@ -0,0 +1,218 @@
+# Resource Quality of Service in Kubernetes
+
+**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar)
+**Last Updated**: 5/17/2016
+
+**Status**: Implemented
+
+*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.*
+
+## Introduction
+
+This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*.
+Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
+
+Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use.
+The system computes pod level requests and limits by summing up per-resource requests and limits across all containers.
+When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers.
+This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees.
+Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes.
+
+## Requests and Limits
+
+For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](../proposals/node-allocatable.md) & `request <= limit <= Infinity`.
+If a pod is successfully scheduled, the container is guaranteed the amount of resources requested.
+Scheduling is based on `requests` and not `limits`.
+The pods and its containers will not be allowed to exceed the specified limit.
+How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md).
+
+### Compressible Resource Guarantees
+
+- For now, we are only supporting CPU.
+- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
+- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
+- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
+
+### Incompressible Resource Guarantees
+
+- For now, we are only supporting memory.
+- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
+- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
+
+### Admission/Scheduling Policy
+
+- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU).
+
+## QoS Classes
+
+In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.
+
+The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section.
+
+Pods can be of one of 3 different classes:
+
+- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**.
+
+Examples:
+
+```yaml
+containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 10m
+ memory: 1Gi
+ name: bar
+ resources:
+ limits:
+ cpu: 100m
+ memory: 100Mi
+```
+
+```yaml
+containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 10m
+ memory: 1Gi
+ requests:
+ cpu: 10m
+ memory: 1Gi
+
+ name: bar
+ resources:
+ limits:
+ cpu: 100m
+ memory: 100Mi
+ requests:
+ cpu: 100m
+ memory: 100Mi
+```
+
+- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**.
+When `limits` are not specified, they default to the node capacity.
+
+Examples:
+
+Container `bar` has not resources specified.
+
+```yaml
+containers:
+ name: foo
+ resources:
+ limits:
+ cpu: 10m
+ memory: 1Gi
+ requests:
+ cpu: 10m
+ memory: 1Gi
+
+ name: bar
+```
+
+Container `foo` and `bar` have limits set for different resources.
+
+```yaml
+containers:
+ name: foo
+ resources:
+ limits:
+ memory: 1Gi
+
+ name: bar
+ resources:
+ limits:
+ cpu: 100m
+```
+
+Container `foo` has no limits set, and `bar` has neither requests nor limits specified.
+
+```yaml
+containers:
+ name: foo
+ resources:
+ requests:
+ cpu: 10m
+ memory: 1Gi
+
+ name: bar
+```
+
+- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**.
+
+Examples:
+
+```yaml
+containers:
+ name: foo
+ resources:
+ name: bar
+ resources:
+```
+
+Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
+
+Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
+
+- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory.
+These containers can use any amount of free memory in the node though.
+
+- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
+
+- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available.
+Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist.
+
+### OOM Score configuration at the Nodes
+
+Pod OOM score configuration
+- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed.
+- The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ - process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.
+- The final OOM score of a process is also between 0 and 1000
+
+*Best-effort*
+ - Set OOM_SCORE_ADJ: 1000
+ - So processes in best-effort containers will have an OOM_SCORE of 1000
+
+*Guaranteed*
+ - Set OOM_SCORE_ADJ: -998
+ - So processes in guaranteed containers will have an OOM_SCORE of 0 or 1
+
+*Burstable*
+ - If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2
+ - Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested)
+ - This ensures that the OOM_SCORE of burstable pod is > 1
+ - If memory request is `0`, OOM_SCORE_ADJ is set to `999`.
+ - So burstable pods will be killed if they conflict with guaranteed pods
+ - If a burstable pod uses less memory than requested, its OOM_SCORE < 1000
+ - So best-effort pods will be killed if they conflict with burstable pods using less than requested memory
+ - If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000
+ - Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed
+ - If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees.
+
+*Pod infra containers* or *Special Pod init process*
+ - OOM_SCORE_ADJ: -998
+
+*Kubelet, Docker*
+ - OOM_SCORE_ADJ: -999 (won’t be OOM killed)
+ - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume.
+
+## Known issues and possible improvements
+
+The above implementation provides for basic oversubscription with protection, but there are a few known limitations.
+
+#### Support for Swap
+
+- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.
+
+## Alternative QoS Class Policy
+
+An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed).
+A strict hierarchy of user-specified numerical priorities is not desirable because:
+
+1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively
+2. Changes to desired priority bands would require changes to all user pod configurations.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/resource-quota-scoping.md b/contributors/design-proposals/resource-quota-scoping.md
new file mode 100644
index 00000000..ac977d4e
--- /dev/null
+++ b/contributors/design-proposals/resource-quota-scoping.md
@@ -0,0 +1,333 @@
+# Resource Quota - Scoping resources
+
+## Problem Description
+
+### Ability to limit compute requests and limits
+
+The existing `ResourceQuota` API object constrains the total amount of compute
+resource requests. This is useful when a cluster-admin is interested in
+controlling explicit resource guarantees such that there would be a relatively
+strong guarantee that pods created by users who stay within their quota will find
+enough free resources in the cluster to be able to schedule. The end-user creating
+the pod is expected to have intimate knowledge on their minimum required resource
+as well as their potential limits.
+
+There are many environments where a cluster-admin does not extend this level
+of trust to their end-user because user's often request too much resource, and
+they have trouble reasoning about what they hope to have available for their
+application versus what their application actually needs. In these environments,
+the cluster-admin will often just expose a single value (the limit) to the end-user.
+Internally, they may choose a variety of other strategies for setting the request.
+For example, some cluster operators are focused on satisfying a particular over-commit
+ratio and may choose to set the request as a factor of the limit to control for
+over-commit. Other cluster operators may defer to a resource estimation tool that
+sets the request based on known historical trends. In this environment, the
+cluster-admin is interested in exposing a quota to their end-users that maps
+to their desired limit instead of their request since that is the value the user
+manages.
+
+### Ability to limit impact to node and promote fair-use
+
+The current `ResourceQuota` API object does not allow the ability
+to quota best-effort pods separately from pods with resource guarantees.
+For example, if a cluster-admin applies a quota that caps requested
+cpu at 10 cores and memory at 10Gi, all pods in the namespace must
+make an explicit resource request for cpu and memory to satisfy
+quota. This prevents a namespace with a quota from supporting best-effort
+pods.
+
+In practice, the cluster-admin wants to control the impact of best-effort
+pods to the cluster, but not restrict the ability to run best-effort pods
+altogether.
+
+As a result, the cluster-admin requires the ability to control the
+max number of active best-effort pods. In addition, the cluster-admin
+requires the ability to scope a quota that limits compute resources to
+exclude best-effort pods.
+
+### Ability to quota long-running vs. bounded-duration compute resources
+
+The cluster-admin may want to quota end-users separately
+based on long-running vs. bounded-duration compute resources.
+
+For example, a cluster-admin may offer more compute resources
+for long running pods that are expected to have a more permanent residence
+on the node than bounded-duration pods. Many batch style workloads
+tend to consume as much resource as they can until something else applies
+the brakes. As a result, these workloads tend to operate at their limit,
+while many traditional web applications may often consume closer to their
+request if there is no active traffic. An operator that wants to control
+density will offer lower quota limits for batch workloads than web applications.
+
+A classic example is a PaaS deployment where the cluster-admin may
+allow a separate budget for pods that run their web application vs. pods that
+build web applications.
+
+Another example is providing more quota to a database pod than a
+pod that performs a database migration.
+
+## Use Cases
+
+* As a cluster-admin, I want the ability to quota
+ * compute resource requests
+ * compute resource limits
+ * compute resources for terminating vs. non-terminating workloads
+ * compute resources for best-effort vs. non-best-effort pods
+
+## Proposed Change
+
+### New quota tracked resources
+
+Support the following resources that can be tracked by quota.
+
+| Resource Name | Description |
+| ------------- | ----------- |
+| cpu | total cpu requests (backwards compatibility) |
+| memory | total memory requests (backwards compatibility) |
+| requests.cpu | total cpu requests |
+| requests.memory | total memory requests |
+| limits.cpu | total cpu limits |
+| limits.memory | total memory limits |
+
+### Resource Quota Scopes
+
+Add the ability to associate a set of `scopes` to a quota.
+
+A quota will only measure usage for a `resource` if it matches
+the intersection of enumerated `scopes`.
+
+Adding a `scope` to a quota limits the number of resources
+it supports to those that pertain to the `scope`. Specifying
+a resource on the quota object outside of the allowed set
+would result in a validation error.
+
+| Scope | Description |
+| ----- | ----------- |
+| Terminating | Match `kind=Pod` where `spec.activeDeadlineSeconds >= 0` |
+| NotTerminating | Match `kind=Pod` where `spec.activeDeadlineSeconds = nil` |
+| BestEffort | Match `kind=Pod` where `status.qualityOfService in (BestEffort)` |
+| NotBestEffort | Match `kind=Pod` where `status.qualityOfService not in (BestEffort)` |
+
+A `BestEffort` scope restricts a quota to tracking the following resources:
+
+* pod
+
+A `Terminating`, `NotTerminating`, `NotBestEffort` scope restricts a quota to
+tracking the following resources:
+
+* pod
+* memory, requests.memory, limits.memory
+* cpu, requests.cpu, limits.cpu
+
+## Data Model Impact
+
+```
+// The following identify resource constants for Kubernetes object types
+const (
+ // CPU request, in cores. (500m = .5 cores)
+ ResourceRequestsCPU ResourceName = "requests.cpu"
+ // Memory request, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
+ ResourceRequestsMemory ResourceName = "requests.memory"
+ // CPU limit, in cores. (500m = .5 cores)
+ ResourceLimitsCPU ResourceName = "limits.cpu"
+ // Memory limit, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
+ ResourceLimitsMemory ResourceName = "limits.memory"
+)
+
+// A scope is a filter that matches an object
+type ResourceQuotaScope string
+const (
+ ResourceQuotaScopeTerminating ResourceQuotaScope = "Terminating"
+ ResourceQuotaScopeNotTerminating ResourceQuotaScope = "NotTerminating"
+ ResourceQuotaScopeBestEffort ResourceQuotaScope = "BestEffort"
+ ResourceQuotaScopeNotBestEffort ResourceQuotaScope = "NotBestEffort"
+)
+
+// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
+// The quota matches by default on all objects in its namespace.
+// The quota can optionally match objects that satisfy a set of scopes.
+type ResourceQuotaSpec struct {
+ // Hard is the set of desired hard limits for each named resource
+ Hard ResourceList `json:"hard,omitempty"`
+ // A collection of filters that must match each object tracked by a quota.
+ // If not specified, the quota matches all objects.
+ Scopes []ResourceQuotaScope `json:"scopes,omitempty"`
+}
+```
+
+## Rest API Impact
+
+None.
+
+## Security Impact
+
+None.
+
+## End User Impact
+
+The `kubectl` commands that render quota should display its scopes.
+
+## Performance Impact
+
+This feature will make having more quota objects in a namespace
+more common in certain clusters. This impacts the number of quota
+objects that need to be incremented during creation of an object
+in admission control. It impacts the number of quota objects
+that need to be updated during controller loops.
+
+## Developer Impact
+
+None.
+
+## Alternatives
+
+This proposal initially enumerated a solution that leveraged a
+`FieldSelector` on a `ResourceQuota` object. A `FieldSelector`
+grouped an `APIVersion` and `Kind` with a selector over its
+fields that supported set-based requirements. It would have allowed
+a quota to track objects based on cluster defined attributes.
+
+For example, a quota could do the following:
+
+* match `Kind=Pod` where `spec.restartPolicy in (Always)`
+* match `Kind=Pod` where `spec.restartPolicy in (Never, OnFailure)`
+* match `Kind=Pod` where `status.qualityOfService in (BestEffort)`
+* match `Kind=Service` where `spec.type in (LoadBalancer)`
+ * see [#17484](https://github.com/kubernetes/kubernetes/issues/17484)
+
+Theoretically, it would enable support for fine-grained tracking
+on a variety of resource types. While extremely flexible, there
+are cons to to this approach that make it premature to pursue
+at this time.
+
+* Generic field selectors are not yet settled art
+ * see [#1362](https://github.com/kubernetes/kubernetes/issues/1362)
+ * see [#19084](https://github.com/kubernetes/kubernetes/pull/19804)
+* Discovery API Limitations
+ * Not possible to discover the set of field selectors supported by kind.
+ * Not possible to discover if a field is readonly, readwrite, or immutable
+ post-creation.
+
+The quota system would want to validate that a field selector is valid,
+and it would only want to select on those fields that are readonly/immutable
+post creation to make resource tracking work during update operations.
+
+The current proposal could grow to support a `FieldSelector` on a
+`ResourceQuotaSpec` and support a simple migration path to convert
+`scopes` to the matching `FieldSelector` once the project has identified
+how it wants to handle `fieldSelector` requirements longer term.
+
+This proposal previously discussed a solution that leveraged a
+`LabelSelector` as a mechanism to partition quota. This is potentially
+interesting to explore in the future to allow `namespace-admins` to
+quota workloads based on local knowledge. For example, a quota
+could match all kinds that match the selector
+`tier=cache, environment in (dev, qa)` separately from quota that
+matched `tier=cache, environment in (prod)`. This is interesting to
+explore in the future, but labels are insufficient selection targets
+for `cluster-administrators` to control footprint. In those instances,
+you need fields that are cluster controlled and not user-defined.
+
+## Example
+
+### Scenario 1
+
+The cluster-admin wants to restrict the following:
+
+* limit 2 best-effort pods
+* limit 2 terminating pods that can not use more than 1Gi of memory, and 2 cpu cores
+* limit 4 long-running pods that can not use more than 4Gi of memory, and 4 cpu cores
+* limit 6 pods in total, 10 replication controllers
+
+This would require the following quotas to be added to the namespace:
+
+```
+$ cat quota-best-effort
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+ name: quota-best-effort
+spec:
+ hard:
+ pods: "2"
+ scopes:
+ - BestEffort
+
+$ cat quota-terminating
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+ name: quota-terminating
+spec:
+ hard:
+ pods: "2"
+ memory.limit: 1Gi
+ cpu.limit: 2
+ scopes:
+ - Terminating
+ - NotBestEffort
+
+$ cat quota-longrunning
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+ name: quota-longrunning
+spec:
+ hard:
+ pods: "2"
+ memory.limit: 4Gi
+ cpu.limit: 4
+ scopes:
+ - NotTerminating
+ - NotBestEffort
+
+$ cat quota
+apiVersion: v1
+kind: ResourceQuota
+metadata:
+ name: quota
+spec:
+ hard:
+ pods: "6"
+ replicationcontrollers: "10"
+```
+
+In the above scenario, every pod creation will result in its usage being
+tracked by `quota` since it has no additional scoping. The pod will then
+be tracked by at 1 additional quota object based on the scope it
+matches. In order for the pod creation to succeed, it must not violate
+the constraint of any matching quota. So for example, a best-effort pod
+would only be created if there was available quota in `quota-best-effort`
+and `quota`.
+
+## Implementation
+
+### Assignee
+
+@derekwaynecarr
+
+### Work Items
+
+* Add support for requests and limits
+* Add support for scopes in quota-related admission and controller code
+
+## Dependencies
+
+None.
+
+Longer term, we should evaluate what we want to do with `fieldSelector` as
+the requests around different quota semantics will continue to grow.
+
+## Testing
+
+Appropriate unit and e2e testing will be authored.
+
+## Documentation Impact
+
+Existing resource quota documentation and examples will be updated.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-quota-scoping.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/resources.md b/contributors/design-proposals/resources.md
new file mode 100644
index 00000000..bb66885b
--- /dev/null
+++ b/contributors/design-proposals/resources.md
@@ -0,0 +1,370 @@
+**Note: this is a design doc, which describes features that have not been
+completely implemented. User documentation of the current state is
+[here](../user-guide/compute-resources.md). The tracking issue for
+implementation of this model is [#168](http://issue.k8s.io/168). Currently, both
+limits and requests of memory and cpu on containers (not pods) are supported.
+"memory" is in bytes and "cpu" is in milli-cores.**
+
+# The Kubernetes resource model
+
+To do good pod placement, Kubernetes needs to know how big pods are, as well as
+the sizes of the nodes onto which they are being placed. The definition of "how
+big" is given by the Kubernetes resource model &mdash; the subject of this
+document.
+
+The resource model aims to be:
+* simple, for common cases;
+* extensible, to accommodate future growth;
+* regular, with few special cases; and
+* precise, to avoid misunderstandings and promote pod portability.
+
+## The resource model
+
+A Kubernetes _resource_ is something that can be requested by, allocated to, or
+consumed by a pod or container. Examples include memory (RAM), CPU, disk-time,
+and network bandwidth.
+
+Once resources on a node have been allocated to one pod, they should not be
+allocated to another until that pod is removed or exits. This means that
+Kubernetes schedulers should ensure that the sum of the resources allocated
+(requested and granted) to its pods never exceeds the usable capacity of the
+node. Testing whether a pod will fit on a node is called _feasibility checking_.
+
+Note that the resource model currently prohibits over-committing resources; we
+will want to relax that restriction later.
+
+### Resource types
+
+All resources have a _type_ that is identified by their _typename_ (a string,
+e.g., "memory"). Several resource types are predefined by Kubernetes (a full
+list is below), although only two will be supported at first: CPU and memory.
+Users and system administrators can define their own resource types if they wish
+(e.g., Hadoop slots).
+
+A fully-qualified resource typename is constructed from a DNS-style _subdomain_,
+followed by a slash `/`, followed by a name.
+* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt)
+(e.g., `kubernetes.io`, `example.com`).
+* The name must be not more than 63 characters, consisting of upper- or
+lower-case alphanumeric characters, with the `-`, `_`, and `.` characters
+allowed anywhere except the first or last character.
+* As a shorthand, any resource typename that does not start with a subdomain and
+a slash will automatically be prefixed with the built-in Kubernetes _namespace_,
+`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for
+code in the open source Kubernetes repository; as a result, all user typenames
+MUST be fully qualified, and cannot be created in this namespace.
+
+Some example typenames include `memory` (which will be fully-qualified as
+`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
+
+For future reference, note that some resources, such as CPU and network
+bandwidth, are _compressible_, which means that their usage can potentially be
+throttled in a relatively benign manner. All other resources are
+_incompressible_, which means that any attempt to throttle them is likely to
+cause grief. This distinction will be important if a Kubernetes implementation
+supports over-committing of resources.
+
+### Resource quantities
+
+Initially, all Kubernetes resource types are _quantitative_, and have an
+associated _unit_ for quantities of the associated resource (e.g., bytes for
+memory, bytes per seconds for bandwidth, instances for software licences). The
+units will always be a resource type's natural base units (e.g., bytes, not MB),
+to avoid confusion between binary and decimal multipliers and the underlying
+unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
+
+Resource quantities can be added and subtracted: for example, a node has a fixed
+quantity of each resource type that can be allocated to pods/containers; once
+such an allocation has been made, the allocated resources cannot be made
+available to other pods/containers without over-committing the resources.
+
+To make life easier for people, quantities can be represented externally as
+unadorned integers, or as fixed-point integers with one of these SI suffices
+(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi,
+ Ki). For example, the following represent roughly the same value: 128974848,
+"129e6", "129M" , "123Mi". Small quantities can be represented directly as
+decimals (e.g., 0.3), or using milli-units (e.g., "300m").
+ * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML
+resource specifications that might be generated or read by people.
+ * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI
+suffix. There are no power-of-two equivalents for SI suffixes that represent
+multipliers less than 1.
+ * These conventions only apply to resource quantities, not arbitrary values.
+
+Internally (i.e., everywhere else), Kubernetes will represent resource
+quantities as integers so it can avoid problems with rounding errors, and will
+not use strings to represent numeric values. To achieve this, quantities that
+naturally have fractional parts (e.g., CPU seconds/second) will be scaled to
+integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in.
+Internal APIs, data structures, and protobufs will use these scaled integer
+units. Raw measurement data such as usage may still need to be tracked and
+calculated using floating point values, but internally they should be rescaled
+to avoid some values being in milli-units and some not.
+ * Note that reading in a resource quantity and writing it out again may change
+the way its values are represented, and truncate precision (e.g., 1.0001 may
+become 1.000), so comparison and difference operations (e.g., by an updater)
+must be done on the internal representations.
+ * Avoiding milli-units in external representations has advantages for people
+who will use Kubernetes, but runs the risk of developers forgetting to rescale
+or accidentally using floating-point representations. That seems like the right
+choice. We will try to reduce the risk by providing libraries that automatically
+do the quantization for JSON/YAML inputs.
+
+### Resource specifications
+
+Both users and a number of system components, such as schedulers, (horizontal)
+auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers
+need to reason about resource requirements of workloads, resource capacities of
+nodes, and resource usage. Kubernetes divides specifications of *desired state*,
+aka the Spec, and representations of *current state*, aka the Status. Resource
+requirements and total node capacity fall into the specification category, while
+resource usage, characterizations derived from usage (e.g., maximum usage,
+histograms), and other resource demand signals (e.g., CPU load) clearly fall
+into the status category and are discussed in the Appendix for now.
+
+Resource requirements for a container or pod should have the following form:
+
+```yaml
+resourceRequirementSpec: [
+ request: [ cpu: 2.5, memory: "40Mi" ],
+ limit: [ cpu: 4.0, memory: "99Mi" ],
+]
+```
+
+Where:
+* _request_ [optional]: the amount of resources being requested, or that were
+requested and have been allocated. Scheduler algorithms will use these
+quantities to test feasibility (whether a pod will fit onto a node).
+If a container (or pod) tries to use more resources than its _request_, any
+associated SLOs are voided &mdash; e.g., the program it is running may be
+throttled (compressible resource types), or the attempt may be denied. If
+_request_ is omitted for a container, it defaults to _limit_ if that is
+explicitly specified, otherwise to an implementation-defined value; this will
+always be 0 for a user-defined resource type. If _request_ is omitted for a pod,
+it defaults to the sum of the (explicit or implicit) _request_ values for the
+containers it encloses.
+
+* _limit_ [optional]: an upper bound or cap on the maximum amount of resources
+that will be made available to a container or pod; if a container or pod uses
+more resources than its _limit_, it may be terminated. The _limit_ defaults to
+"unbounded"; in practice, this probably means the capacity of an enclosing
+container, pod, or node, but may result in non-deterministic behavior,
+especially for memory.
+
+Total capacity for a node should have a similar structure:
+
+```yaml
+resourceCapacitySpec: [
+ total: [ cpu: 12, memory: "128Gi" ]
+]
+```
+
+Where:
+* _total_: the total allocatable resources of a node. Initially, the resources
+at a given scope will bound the resources of the sum of inner scopes.
+
+#### Notes
+
+ * It is an error to specify the same resource type more than once in each
+list.
+
+ * It is an error for the _request_ or _limit_ values for a pod to be less than
+the sum of the (explicit or defaulted) values for the containers it encloses.
+(We may relax this later.)
+
+ * If multiple pods are running on the same node and attempting to use more
+resources than they have requested, the result is implementation-defined. For
+example: unallocated or unused resources might be spread equally across
+claimants, or the assignment might be weighted by the size of the original
+request, or as a function of limits, or priority, or the phase of the moon,
+perhaps modulated by the direction of the tide. Thus, although it's not
+mandatory to provide a _request_, it's probably a good idea. (Note that the
+_request_ could be filled in by an automated system that is observing actual
+usage and/or historical data.)
+
+ * Internally, the Kubernetes master can decide the defaulting behavior and the
+kubelet implementation may expected an absolute specification. For example, if
+the master decided that "the default is unbounded" it would pass 2^64 to the
+kubelet.
+
+
+## Kubernetes-defined resource types
+
+The following resource types are predefined ("reserved") by Kubernetes in the
+`kubernetes.io` namespace, and so cannot be used for user-defined resources.
+Note that the syntax of all resource types in the resource spec is deliberately
+similar, but some resource types (e.g., CPU) may receive significantly more
+support than simply tracking quantities in the schedulers and/or the Kubelet.
+
+### Processor cycles
+
+ * Name: `cpu` (or `kubernetes.io/cpu`)
+ * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to
+a canonical "Kubernetes CPU")
+ * Internal representation: milli-KCUs
+ * Compressible? yes
+ * Qualities: this is a placeholder for the kind of thing that may be supported
+in the future &mdash; see [#147](http://issue.k8s.io/147)
+ * [future] `schedulingLatency`: as per lmctfy
+ * [future] `cpuConversionFactor`: property of a node: the speed of a CPU
+core on the node's processor divided by the speed of the canonical Kubernetes
+CPU (a floating point value; default = 1.0).
+
+To reduce performance portability problems for pods, and to avoid worse-case
+provisioning behavior, the units of CPU will be normalized to a canonical
+"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be
+equivalent to a single CPU hyperthreaded core for some recent x86 processor. The
+normalization may be implementation-defined, although some reasonable defaults
+will be provided in the open-source Kubernetes code.
+
+Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will
+be allocated &mdash; control of aspects like this will be handled by resource
+_qualities_ (a future feature).
+
+
+### Memory
+
+ * Name: `memory` (or `kubernetes.io/memory`)
+ * Units: bytes
+ * Compressible? no (at least initially)
+
+The precise meaning of what "memory" means is implementation dependent, but the
+basic idea is to rely on the underlying `memcg` mechanisms, support, and
+definitions.
+
+Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory
+quantities rather than decimal ones: "64MiB" rather than "64MB".
+
+
+## Resource metadata
+
+A resource type may have an associated read-only ResourceType structure, that
+contains metadata about the type. For example:
+
+```yaml
+resourceTypes: [
+ "kubernetes.io/memory": [
+ isCompressible: false, ...
+ ]
+ "kubernetes.io/cpu": [
+ isCompressible: true,
+ internalScaleExponent: 3, ...
+ ]
+ "kubernetes.io/disk-space": [ ... ]
+]
+```
+
+Kubernetes will provide ResourceType metadata for its predefined types. If no
+resource metadata can be found for a resource type, Kubernetes will assume that
+it is a quantified, incompressible resource that is not specified in
+milli-units, and has no default value.
+
+The defined properties are as follows:
+
+| field name | type | contents |
+| ---------- | ---- | -------- |
+| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) |
+| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) |
+| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". |
+| isCompressible | bool, default=false | true if the resource type is compressible |
+| defaultRequest | string, default=none | in the same format as a user-supplied value |
+| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). |
+
+
+# Appendix: future extensions
+
+The following are planned future extensions to the resource model, included here
+to encourage comments.
+
+## Usage data
+
+Because resource usage and related metrics change continuously, need to be
+tracked over time (i.e., historically), can be characterized in a variety of
+ways, and are fairly voluminous, we will not include usage in core API objects,
+such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs
+for accessing and managing that data. See the Appendix for possible
+representations of usage data, but the representation we'll use is TBD.
+
+Singleton values for observed and predicted future usage will rapidly prove
+inadequate, so we will support the following structure for extended usage
+information:
+
+```yaml
+resourceStatus: [
+ usage: [ cpu: <CPU-info>, memory: <memory-info> ],
+ maxusage: [ cpu: <CPU-info>, memory: <memory-info> ],
+ predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
+]
+```
+
+where a `<CPU-info>` or `<memory-info>` structure looks like this:
+
+```yaml
+{
+ mean: <value> # arithmetic mean
+ max: <value> # minimum value
+ min: <value> # maximum value
+ count: <value> # number of data points
+ percentiles: [ # map from %iles to values
+ "10": <10th-percentile-value>,
+ "50": <median-value>,
+ "99": <99th-percentile-value>,
+ "99.9": <99.9th-percentile-value>,
+ ...
+ ]
+}
+```
+
+All parts of this structure are optional, although we strongly encourage
+including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles.
+_[In practice, it will be important to include additional info such as the
+length of the time window over which the averages are calculated, the
+confidence level, and information-quality metrics such as the number of dropped
+or discarded data points.]_ and predicted
+
+## Future resource types
+
+### _[future] Network bandwidth_
+
+ * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`)
+ * Units: bytes per second
+ * Compressible? yes
+
+### _[future] Network operations_
+
+ * Name: "network-iops" (or `kubernetes.io/network-iops`)
+ * Units: operations (messages) per second
+ * Compressible? yes
+
+### _[future] Storage space_
+
+ * Name: "storage-space" (or `kubernetes.io/storage-space`)
+ * Units: bytes
+ * Compressible? no
+
+The amount of secondary storage space available to a container. The main target
+is local disk drives and SSDs, although this could also be used to qualify
+remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a
+disk array, or a file system fronting any of these, is left for future work.
+
+### _[future] Storage time_
+
+ * Name: storage-time (or `kubernetes.io/storage-time`)
+ * Units: seconds per second of disk time
+ * Internal representation: milli-units
+ * Compressible? yes
+
+This is the amount of time a container spends accessing disk, including actuator
+and transfer time. A standard disk drive provides 1.0 diskTime seconds per
+second.
+
+### _[future] Storage operations_
+
+ * Name: "storage-iops" (or `kubernetes.io/storage-iops`)
+ * Units: operations per second
+ * Compressible? yes
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/runtime-client-server.md b/contributors/design-proposals/runtime-client-server.md
new file mode 100644
index 00000000..16cc677c
--- /dev/null
+++ b/contributors/design-proposals/runtime-client-server.md
@@ -0,0 +1,206 @@
+# Client/Server container runtime
+
+## Abstract
+
+A proposal of client/server implementation of kubelet container runtime interface.
+
+## Motivation
+
+Currently, any container runtime has to be linked into the kubelet. This makes
+experimentation difficult, and prevents users from landing an alternate
+container runtime without landing code in core kubernetes.
+
+To facilitate experimentation and to enable user choice, this proposal adds a
+client/server implementation of the [new container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899). The main goal
+of this proposal is:
+
+- make it easy to integrate new container runtimes
+- improve code maintainability
+
+## Proposed design
+
+**Design of client/server container runtime**
+
+The main idea of client/server container runtime is to keep main control logic in kubelet while letting remote runtime only do dedicated actions. An alpha [container runtime API](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) is introduced for integrating new container runtimes. The API is based on [protobuf](https://developers.google.com/protocol-buffers/) and [gRPC](http://www.grpc.io) for a number of benefits:
+
+- Perform faster than json
+- Get client bindings for free: gRPC supports ten languages
+- No encoding/decoding codes needed
+- Manage api interfaces easily: server and client interfaces are generated automatically
+
+A new container runtime manager `KubeletGenericRuntimeManager` will be introduced to kubelet, which will
+
+- conforms to kubelet's [Runtime](../../pkg/kubelet/container/runtime.go#L58) interface
+- manage Pods and Containers lifecycle according to kubelet policies
+- call remote runtime's API to perform specific pod, container or image operations
+
+A simple workflow of invoking remote runtime API on starting a Pod with two containers can be shown:
+
+```
+Kubelet KubeletGenericRuntimeManager RemoteRuntime
+ + + +
+ | | |
+ +---------SyncPod------------->+ |
+ | | |
+ | +---- Create PodSandbox ------->+
+ | +<------------------------------+
+ | | |
+ | XXXXXXXXXXXX |
+ | | X |
+ | | NetworkPlugin. |
+ | | SetupPod |
+ | | X |
+ | XXXXXXXXXXXX |
+ | | |
+ | +<------------------------------+
+ | +---- Pull image1 -------->+
+ | +<------------------------------+
+ | +---- Create container1 ------->+
+ | +<------------------------------+
+ | +---- Start container1 -------->+
+ | +<------------------------------+
+ | | |
+ | +<------------------------------+
+ | +---- Pull image2 -------->+
+ | +<------------------------------+
+ | +---- Create container2 ------->+
+ | +<------------------------------+
+ | +---- Start container2 -------->+
+ | +<------------------------------+
+ | | |
+ | <-------Success--------------+ |
+ | | |
+ + + +
+```
+
+And deleting a pod can be shown:
+
+```
+Kubelet KubeletGenericRuntimeManager RemoteRuntime
+ + + +
+ | | |
+ +---------SyncPod------------->+ |
+ | | |
+ | +---- Stop container1 ----->+
+ | +<------------------------------+
+ | +---- Delete container1 ----->+
+ | +<------------------------------+
+ | | |
+ | +---- Stop container2 ------>+
+ | +<------------------------------+
+ | +---- Delete container2 ------>+
+ | +<------------------------------+
+ | | |
+ | XXXXXXXXXXXX |
+ | | X |
+ | | NetworkPlugin. |
+ | | TeardownPod |
+ | | X |
+ | XXXXXXXXXXXX |
+ | | |
+ | | |
+ | +---- Delete PodSandbox ------>+
+ | +<------------------------------+
+ | | |
+ | <-------Success--------------+ |
+ | | |
+ + + +
+```
+
+**API definition**
+
+Since we are going to introduce more image formats and want to separate image management from containers and pods, this proposal introduces two services `RuntimeService` and `ImageService`. Both services are defined at [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto):
+
+```proto
+// Runtime service defines the public APIs for remote container runtimes
+service RuntimeService {
+ // Version returns the runtime name, runtime version and runtime API version
+ rpc Version(VersionRequest) returns (VersionResponse) {}
+
+ // CreatePodSandbox creates a pod-level sandbox.
+ // The definition of PodSandbox is at https://github.com/kubernetes/kubernetes/pull/25899
+ rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse) {}
+ // StopPodSandbox stops the sandbox. If there are any running containers in the
+ // sandbox, they should be force terminated.
+ rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
+ // DeletePodSandbox deletes the sandbox. If there are any running containers in the
+ // sandbox, they should be force deleted.
+ rpc DeletePodSandbox(DeletePodSandboxRequest) returns (DeletePodSandboxResponse) {}
+ // PodSandboxStatus returns the Status of the PodSandbox.
+ rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
+ // ListPodSandbox returns a list of SandBox.
+ rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}
+
+ // CreateContainer creates a new container in specified PodSandbox
+ rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
+ // StartContainer starts the container.
+ rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
+ // StopContainer stops a running container with a grace period (i.e., timeout).
+ rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
+ // RemoveContainer removes the container. If the container is running, the container
+ // should be force removed.
+ rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
+ // ListContainers lists all containers by filters.
+ rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
+ // ContainerStatus returns status of the container.
+ rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
+
+ // Exec executes the command in the container.
+ rpc Exec(stream ExecRequest) returns (stream ExecResponse) {}
+}
+
+// Image service defines the public APIs for managing images
+service ImageService {
+ // ListImages lists existing images.
+ rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
+ // ImageStatus returns the status of the image.
+ rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
+ // PullImage pulls a image with authentication config.
+ rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
+ // RemoveImage removes the image.
+ rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
+}
+```
+
+Note that some types in [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto) are already defined at [Container runtime interface/integration](https://github.com/kubernetes/kubernetes/pull/25899).
+We should decide how to integrate the types in [#25899](https://github.com/kubernetes/kubernetes/pull/25899) with gRPC services:
+
+* Auto-generate those types into protobuf by [go2idl](../../cmd/libs/go2idl/)
+ - Pros:
+ - trace type changes automatically, all type changes in Go will be automatically generated into proto files
+ - Cons:
+ - type change may break existing API implementations, e.g. new fields added automatically may not noticed by remote runtime
+ - needs to convert Go types to gRPC generated types, and vise versa
+ - needs processing attributes order carefully so as not to break generated protobufs (this could be done by using [protobuf tag](https://developers.google.com/protocol-buffers/docs/gotutorial))
+ - go2idl doesn't support gRPC, [protoc-gen-gogo](https://github.com/gogo/protobuf) is still required for generating gRPC client
+* Embed those types as raw protobuf definitions and generate Go files by [protoc-gen-gogo](https://github.com/gogo/protobuf)
+ - Pros:
+ - decouple type definitions, all type changes in Go will be added to proto manually, so it's easier to track gRPC API version changes
+ - Kubelet could reuse Go types generated by `protoc-gen-gogo` to avoid type conversions
+ - Cons:
+ - duplicate definition of same types
+ - hard to track type changes automatically
+ - need to manage proto files manually
+
+For better version controlling and fast iterations, this proposal embeds all those types in `api.proto` directly.
+
+## Implementation
+
+Each new runtime should implement the [gRPC](http://www.grpc.io) server based on [pkg/kubelet/api/v1alpha1/runtime/api.proto](../../pkg/kubelet/api/v1alpha1/runtime/api.proto). For version controlling, `KubeletGenericRuntimeManager` will request `RemoteRuntime`'s `Version()` interface with the runtime api version. To keep backward compatibility, the API follows standard [protobuf guide](https://developers.google.com/protocol-buffers/docs/proto) to deprecate or add new interfaces.
+
+A new flag `--container-runtime-endpoint` (overrides `--container-runtime`) will be introduced to kubelet which identifies the unix socket file of the remote runtime service. And new flag `--image-service-endpoint` will be introduced to kubelet which identifies the unix socket file of the image service.
+
+To facilitate switching current container runtime (e.g. `docker` or `rkt`) to new runtime API, `KubeletGenericRuntimeManager` will provide a plugin mechanism allowing to specify local implementation or gRPC implementation.
+
+## Community Discussion
+
+This proposal is first filed by [@brendandburns](https://github.com/brendandburns) at [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768):
+
+* [kubernetes/13768](https://github.com/kubernetes/kubernetes/issues/13768)
+* [kubernetes/13709](https://github.com/kubernetes/kubernetes/pull/13079)
+* [New container runtime interface](https://github.com/kubernetes/kubernetes/pull/25899)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-client-server.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/runtime-pod-cache.md b/contributors/design-proposals/runtime-pod-cache.md
new file mode 100644
index 00000000..d4926c3e
--- /dev/null
+++ b/contributors/design-proposals/runtime-pod-cache.md
@@ -0,0 +1,173 @@
+# Kubelet: Runtime Pod Cache
+
+This proposal builds on top of the Pod Lifecycle Event Generator (PLEG) proposed
+in [#12802](https://issues.k8s.io/12802). It assumes that Kubelet subscribes to
+the pod lifecycle event stream to eliminate periodic polling of pod
+states. Please see [#12802](https://issues.k8s.io/12802). for the motivation and
+design concept for PLEG.
+
+Runtime pod cache is an in-memory cache which stores the *status* of
+all pods, and is maintained by PLEG. It serves as a single source of
+truth for internal pod status, freeing Kubelet from querying the
+container runtime.
+
+## Motivation
+
+With PLEG, Kubelet no longer needs to perform comprehensive state
+checking for all pods periodically. It only instructs a pod worker to
+start syncing when there is a change of its pod status. Nevertheless,
+during each sync, a pod worker still needs to construct the pod status
+by examining all containers (whether dead or alive) in the pod, due to
+the lack of the caching of previous states. With the integration of
+pod cache, we can further improve Kubelet's CPU usage by
+
+ 1. Lowering the number of concurrent requests to the container
+ runtime since pod workers no longer have to query the runtime
+ individually.
+ 2. Lowering the total number of inspect requests because there is no
+ need to inspect containers with no state changes.
+
+***Don't we already have a [container runtime cache]
+(https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/runtime_cache.go)?***
+
+The runtime cache is an optimization that reduces the number of `GetPods()`
+calls from the workers. However,
+
+ * The cache does not store all information necessary for a worker to
+ complete a sync (e.g., `docker inspect`); workers still need to inspect
+ containers individually to generate `api.PodStatus`.
+ * Workers sometimes need to bypass the cache in order to retrieve the
+ latest pod state.
+
+This proposal generalizes the cache and instructs PLEG to populate the cache, so
+that the content is always up-to-date.
+
+**Why can't each worker cache its own pod status?**
+
+The short answer is yes, they can. The longer answer is that localized
+caching limits the use of the cache content -- other components cannot
+access it. This often leads to caching at multiple places and/or passing
+objects around, complicating the control flow.
+
+## Runtime Pod Cache
+
+![pod cache](pod-cache.png)
+
+Pod cache stores the `PodStatus` for all pods on the node. `PodStatus` encompasses
+all the information required from the container runtime to generate
+`api.PodStatus` for a pod.
+
+```go
+// PodStatus represents the status of the pod and its containers.
+// api.PodStatus can be derived from examining PodStatus and api.Pod.
+type PodStatus struct {
+ ID types.UID
+ Name string
+ Namespace string
+ IP string
+ ContainerStatuses []*ContainerStatus
+}
+
+// ContainerStatus represents the status of a container.
+type ContainerStatus struct {
+ ID ContainerID
+ Name string
+ State ContainerState
+ CreatedAt time.Time
+ StartedAt time.Time
+ FinishedAt time.Time
+ ExitCode int
+ Image string
+ ImageID string
+ Hash uint64
+ RestartCount int
+ Reason string
+ Message string
+}
+```
+
+`PodStatus` is defined in the container runtime interface, hence is
+runtime-agnostic.
+
+PLEG is responsible for updating the entries pod cache, hence always keeping
+the cache up-to-date.
+
+1. Detect change of container state
+2. Inspect the pod for details
+3. Update the pod cache with the new PodStatus
+ - If there is no real change of the pod entry, do nothing
+ - Otherwise, generate and send out the corresponding pod lifecycle event
+
+Note that in (3), PLEG can check if there is any disparity between the old
+and the new pod entry to filter out duplicated events if needed.
+
+### Evict cache entries
+
+Note that the cache represents all the pods/containers known by the container
+runtime. A cache entry should only be evicted if the pod is no longer visible
+by the container runtime. PLEG is responsible for deleting entries in the
+cache.
+
+### Generate `api.PodStatus`
+
+Because pod cache stores the up-to-date `PodStatus` of the pods, Kubelet can
+generate the `api.PodStatus` by interpreting the cache entry at any
+time. To avoid sending intermediate status (e.g., while a pod worker
+is restarting a container), we will instruct the pod worker to generate a new
+status at the beginning of each sync.
+
+### Cache contention
+
+Cache contention should not be a problem when the number of pods is
+small. When Kubelet scales, we can always shard the pods by ID to
+reduce contention.
+
+### Disk management
+
+The pod cache is not capable to fulfill the needs of container/image garbage
+collectors as they may demand more than pod-level information. These components
+will still need to query the container runtime directly at times. We may
+consider extending the cache for these use cases, but they are beyond the scope
+of this proposal.
+
+
+## Impact on Pod Worker Control Flow
+
+A pod worker may perform various operations (e.g., start/kill a container)
+during a sync. They will expect to see the results of such operations reflected
+in the cache in the next sync. Alternately, they can bypass the cache and
+query the container runtime directly to get the latest status. However, this
+is not desirable since the cache is introduced exactly to eliminate unnecessary,
+concurrent queries. Therefore, a pod worker should be blocked until all expected
+results have been updated to the cache by PLEG.
+
+Depending on the type of PLEG (see [#12802](https://issues.k8s.io/12802)) in
+use, the methods to check whether a requirement is met can differ. For a
+PLEG that solely relies on relisting, a pod worker can simply wait until the
+relist timestamp is newer than the end of the worker's last sync. On the other
+hand, if pod worker knows what events to expect, they can also block until the
+events are observed.
+
+It should be noted that `api.PodStatus` will only be generated by the pod
+worker *after* the cache has been updated. This means that the perceived
+responsiveness of Kubelet (from querying the API server) will be affected by
+how soon the cache can be populated. For the pure-relisting PLEG, the relist
+period can become the bottleneck. On the other hand, A PLEG which watches the
+upstream event stream (and knows how what events to expect) is not restricted
+by such periods and should improve Kubelet's perceived responsiveness.
+
+## TODOs for v1.2
+
+ - Redefine container runtime types ([#12619](https://issues.k8s.io/12619)):
+ and introduce `PodStatus`. Refactor dockertools and rkt to use the new type.
+
+ - Add cache and instruct PLEG to populate it.
+
+ - Refactor Kubelet to use the cache.
+
+ - Deprecate the old runtime cache.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtime-pod-cache.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/runtimeconfig.md b/contributors/design-proposals/runtimeconfig.md
new file mode 100644
index 00000000..896ca130
--- /dev/null
+++ b/contributors/design-proposals/runtimeconfig.md
@@ -0,0 +1,69 @@
+# Overview
+
+Proposes adding a `--feature-config` to core kube system components:
+apiserver , scheduler, controller-manager, kube-proxy, and selected addons.
+This flag will be used to enable/disable alpha features on a per-component basis.
+
+## Motivation
+
+Motivation is enabling/disabling features that are not tied to
+an API group. API groups can be selectively enabled/disabled in the
+apiserver via existing `--runtime-config` flag on apiserver, but there is
+currently no mechanism to toggle alpha features that are controlled by
+e.g. annotations. This means the burden of controlling whether such
+features are enabled in a particular cluster is on feature implementors;
+they must either define some ad hoc mechanism for toggling (e.g. flag
+on component binary) or else toggle the feature on/off at compile time.
+
+By adding a`--feature-config` to all kube-system components, alpha features
+can be toggled on a per-component basis by passing `enableAlphaFeature=true|false`
+to `--feature-config` for each component that the feature touches.
+
+## Design
+
+The following components will all get a `--feature-config` flag,
+which loads a `config.ConfigurationMap`:
+
+- kube-apiserver
+- kube-scheduler
+- kube-controller-manager
+- kube-proxy
+- kube-dns
+
+(Note kubelet is omitted, it's dynamic config story is being addressed
+by #29459). Alpha features that are not accessed via an alpha API
+group should define an `enableFeatureName` flag and use it to toggle
+activation of the feature in each system component that the feature
+uses.
+
+## Suggested conventions
+
+This proposal only covers adding a mechanism to toggle features in
+system components. Implementation details will still depend on the alpha
+feature's owner(s). The following are suggested conventions:
+
+- Naming for feature config entries should follow the pattern
+ "enable<FeatureName>=true".
+- Features that touch multiple components should reserve the same key
+ in each component to toggle on/off.
+- Alpha features should be disabled by default. Beta features may
+ be enabled by default. Refer to docs/devel/api_changes.md#alpha-beta-and-stable-versions
+ for more detailed guidance on alpha vs. beta.
+
+## Upgrade support
+
+As the primary motivation for cluster config is toggling alpha
+features, upgrade support is not in scope. Enabling or disabling
+a feature is necessarily a breaking change, so config should
+not be altered in a running cluster.
+
+## Future work
+
+1. The eventual plan is for component config to be managed by versioned
+APIs and not flags (#12245). When that is added, toggling of features
+could be handled by versioned component config and the component flags
+deprecated.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/runtimeconfig.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/scalability-testing.md b/contributors/design-proposals/scalability-testing.md
new file mode 100644
index 00000000..d0fcd1be
--- /dev/null
+++ b/contributors/design-proposals/scalability-testing.md
@@ -0,0 +1,72 @@
+
+## Background
+
+We have a goal to be able to scale to 1000-node clusters by end of 2015.
+As a result, we need to be able to run some kind of regression tests and deliver
+a mechanism so that developers can test their changes with respect to performance.
+
+Ideally, we would like to run performance tests also on PRs - although it might
+be impossible to run them on every single PR, we may introduce a possibility for
+a reviewer to trigger them if the change has non obvious impact on the performance
+(something like "k8s-bot run scalability tests please" should be feasible).
+
+However, running performance tests on 1000-node clusters (or even bigger in the
+future is) is a non-starter. Thus, we need some more sophisticated infrastructure
+to simulate big clusters on relatively small number of machines and/or cores.
+
+This document describes two approaches to tackling this problem.
+Once we have a better understanding of their consequences, we may want to
+decide to drop one of them, but we are not yet in that position.
+
+
+## Proposal 1 - Kubmark
+
+In this proposal we are focusing on scalability testing of master components.
+We do NOT focus on node-scalability - this issue should be handled separately.
+
+Since we do not focus on the node performance, we don't need real Kubelet nor
+KubeProxy - in fact we don't even need to start real containers.
+All we actually need is to have some Kubelet-like and KubeProxy-like components
+that will be simulating the load on apiserver that their real equivalents are
+generating (e.g. sending NodeStatus updated, watching for pods, watching for
+endpoints (KubeProxy), etc.).
+
+What needs to be done:
+
+1. Determine what requests both KubeProxy and Kubelet are sending to apiserver.
+2. Create a KubeletSim that is generating the same load on apiserver that the
+ real Kubelet, but is not starting any containers. In the initial version we
+ can assume that pods never die, so it is enough to just react on the state
+ changes read from apiserver.
+ TBD: Maybe we can reuse a real Kubelet for it by just injecting some "fake"
+ interfaces to it?
+3. Similarly create a KubeProxySim that is generating the same load on apiserver
+ as a real KubeProxy. Again, since we are not planning to talk to those
+ containers, it basically doesn't need to do anything apart from that.
+ TBD: Maybe we can reuse a real KubeProxy for it by just injecting some "fake"
+ interfaces to it?
+4. Refactor kube-up/kube-down scripts (or create new ones) to allow starting
+ a cluster with KubeletSim and KubeProxySim instead of real ones and put
+ a bunch of them on a single machine.
+5. Create a load generator for it (probably initially it would be enough to
+ reuse tests that we use in gce-scalability suite).
+
+
+## Proposal 2 - Oversubscribing
+
+The other method we are proposing is to oversubscribe the resource,
+or in essence enable a single node to look like many separate nodes even though
+they reside on a single host. This is a well established pattern in many different
+cluster managers (for more details see
+http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html ).
+There are a couple of different ways to accomplish this, but the most viable method
+is to run privileged kubelet pods under a hosts kubelet process. These pods then
+register back with the master via the introspective service using modified names
+as not to collide.
+
+Complications may currently exist around container tracking and ownership in docker.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scalability-testing.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/scheduledjob.md b/contributors/design-proposals/scheduledjob.md
new file mode 100644
index 00000000..9c7e8d9f
--- /dev/null
+++ b/contributors/design-proposals/scheduledjob.md
@@ -0,0 +1,335 @@
+# ScheduledJob Controller
+
+## Abstract
+
+A proposal for implementing a new controller - ScheduledJob controller - which
+will be responsible for managing time based jobs, namely:
+* once at a specified point in time,
+* repeatedly at a specified point in time.
+
+There is already a discussion regarding this subject:
+* Distributed CRON jobs [#2156](https://issues.k8s.io/2156)
+
+There are also similar solutions available, already:
+* [Mesos Chronos](https://github.com/mesos/chronos)
+* [Quartz](http://quartz-scheduler.org/)
+
+
+## Use Cases
+
+1. Be able to schedule a job execution at a given point in time.
+1. Be able to create a periodic job, e.g. database backup, sending emails.
+
+
+## Motivation
+
+ScheduledJobs are needed for performing all time-related actions, namely backups,
+report generation and the like. Each of these tasks should be allowed to run
+repeatedly (once a day/month, etc.) or once at a given point in time.
+
+
+## Design Overview
+
+Users create a ScheduledJob object. One ScheduledJob object
+is like one line of a crontab file. It has a schedule of when to run,
+in [Cron](https://en.wikipedia.org/wiki/Cron) format.
+
+
+The ScheduledJob controller creates a Job object [Job](job.md)
+about once per execution time of the scheduled (e.g. once per
+day for a daily schedule.) We say "about" because there are certain
+circumstances where two jobs might be created, or no job might be
+created. We attempt to make these rare, but do not completely prevent
+them. Therefore, Jobs should be idempotent.
+
+The Job object is responsible for any retrying of Pods, and any parallelism
+among pods it creates, and determining the success or failure of the set of
+pods. The ScheduledJob does not examine pods at all.
+
+
+### ScheduledJob resource
+
+The new `ScheduledJob` object will have the following contents:
+
+```go
+// ScheduledJob represents the configuration of a single scheduled job.
+type ScheduledJob struct {
+ TypeMeta
+ ObjectMeta
+
+ // Spec is a structure defining the expected behavior of a job, including the schedule.
+ Spec ScheduledJobSpec
+
+ // Status is a structure describing current status of a job.
+ Status ScheduledJobStatus
+}
+
+// ScheduledJobList is a collection of scheduled jobs.
+type ScheduledJobList struct {
+ TypeMeta
+ ListMeta
+
+ Items []ScheduledJob
+}
+```
+
+The `ScheduledJobSpec` structure is defined to contain all the information how the actual
+job execution will look like, including the `JobSpec` from [Job API](job.md)
+and the schedule in [Cron](https://en.wikipedia.org/wiki/Cron) format. This implies
+that each ScheduledJob execution will be created from the JobSpec actual at a point
+in time when the execution will be started. This also implies that any changes
+to ScheduledJobSpec will be applied upon subsequent execution of a job.
+
+```go
+// ScheduledJobSpec describes how the job execution will look like and when it will actually run.
+type ScheduledJobSpec struct {
+
+ // Schedule contains the schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
+ Schedule string
+
+ // Optional deadline in seconds for starting the job if it misses scheduled
+ // time for any reason. Missed jobs executions will be counted as failed ones.
+ StartingDeadlineSeconds *int64
+
+ // ConcurrencyPolicy specifies how to treat concurrent executions of a Job.
+ ConcurrencyPolicy ConcurrencyPolicy
+
+ // Suspend flag tells the controller to suspend subsequent executions, it does
+ // not apply to already started executions. Defaults to false.
+ Suspend bool
+
+ // JobTemplate is the object that describes the job that will be created when
+ // executing a ScheduledJob.
+ JobTemplate *JobTemplateSpec
+}
+
+// JobTemplateSpec describes of the Job that will be created when executing
+// a ScheduledJob, including its standard metadata.
+type JobTemplateSpec struct {
+ ObjectMeta
+
+ // Specification of the desired behavior of the job.
+ Spec JobSpec
+}
+
+// ConcurrencyPolicy describes how the job will be handled.
+// Only one of the following concurrent policies may be specified.
+// If none of the following policies is specified, the default one
+// is AllowConcurrent.
+type ConcurrencyPolicy string
+
+const (
+ // AllowConcurrent allows ScheduledJobs to run concurrently.
+ AllowConcurrent ConcurrencyPolicy = "Allow"
+
+ // ForbidConcurrent forbids concurrent runs, skipping next run if previous
+ // hasn't finished yet.
+ ForbidConcurrent ConcurrencyPolicy = "Forbid"
+
+ // ReplaceConcurrent cancels currently running job and replaces it with a new one.
+ ReplaceConcurrent ConcurrencyPolicy = "Replace"
+)
+```
+
+`ScheduledJobStatus` structure is defined to contain information about scheduled
+job executions. The structure holds a list of currently running job instances
+and additional information about overall successful and unsuccessful job executions.
+
+```go
+// ScheduledJobStatus represents the current state of a Job.
+type ScheduledJobStatus struct {
+ // Active holds pointers to currently running jobs.
+ Active []ObjectReference
+
+ // Successful tracks the overall amount of successful completions of this job.
+ Successful int64
+
+ // Failed tracks the overall amount of failures of this job.
+ Failed int64
+
+ // LastScheduleTime keeps information of when was the last time the job was successfully scheduled.
+ LastScheduleTime Time
+}
+```
+
+Users must use a generated selector for the job.
+
+## Modifications to Job resource
+
+TODO for beta: forbid manual selector since that could cause confusing between
+subsequent jobs.
+
+### Running ScheduledJobs using kubectl
+
+A user should be able to easily start a Scheduled Job using `kubectl` (similarly
+to running regular jobs). For example to run a job with a specified schedule,
+a user should be able to type something simple like:
+
+```
+kubectl run pi --image=perl --restart=OnFailure --runAt="0 14 21 7 *" -- perl -Mbignum=bpi -wle 'print bpi(2000)'
+```
+
+In the above example:
+
+* `--restart=OnFailure` implies creating a job instead of replicationController.
+* `--runAt="0 14 21 7 *"` implies the schedule with which the job should be run, here
+ July 21, 2pm. This value will be validated according to the same rules which
+ apply to `.spec.schedule`.
+
+## Fields Added to Job Template
+
+When the controller creates a Job from the JobTemplateSpec in the ScheduledJob, it
+adds the following fields to the Job:
+
+- a name, based on the ScheduledJob's name, but with a suffix to distinguish
+ multiple executions, which may overlap.
+- the standard created-by annotation on the Job, pointing to the SJ that created it
+ The standard key is `kubernetes.io/created-by`. The value is a serialized JSON object, like
+ `{ "kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ScheduledJob","namespace":"default",`
+ `"name":"nightly-earnings-report","uid":"5ef034e0-1890-11e6-8935-42010af0003e","apiVersion":...`
+ This serialization contains the UID of the parent. This is used to match the Job to the SJ that created
+ it.
+
+## Updates to ScheduledJobs
+
+If the schedule is updated on a ScheduledJob, it will:
+- continue to use the Status.Active list of jobs to detect conflicts.
+- try to fulfill all recently-passed times for the new schedule, by starting
+ new jobs. But it will not try to fulfill times prior to the
+ Status.LastScheduledTime.
+ - Example: If you have a schedule to run every 30 minutes, and change that to hourly, then the previously started
+ top-of-the-hour run, in Status.Active, will be seen and no new job started.
+ - Example: If you have a schedule to run every hour, change that to 30-minutely, at 31 minutes past the hour,
+ one run will be started immediately for the starting time that has just passed.
+
+If the job template of a ScheduledJob is updated, then future executions use the new template
+but old ones still satisfy the schedule and are not re-run just because the template changed.
+
+If you delete and replace a ScheduledJob with one of the same name, it will:
+- not use any old Status.Active, and not consider any existing running or terminated jobs from the previous
+ ScheduledJob (with a different UID) at all when determining coflicts, what needs to be started, etc.
+- If there is an existing Job with the same time-based hash in its name (see below), then
+ new instances of that job will not be able to be created. So, delete it if you want to re-run.
+with the same name as conflicts.
+- not "re-run" jobs for "start times" before the creation time of the new ScheduledJobJob object.
+- not consider executions from the previous UID when making decisions about what executions to
+ start, or status, etc.
+- lose the history of the old SJ.
+
+To preserve status, you can suspend the old one, and make one with a new name, or make a note of the old status.
+
+
+## Fault-Tolerance
+
+### Starting Jobs in the face of controller failures
+
+If the process with the scheduledJob controller in it fails,
+and takes a while to restart, the scheduledJob controller
+may miss the time window and it is too late to start a job.
+
+With a single scheduledJob controller process, we cannot give
+very strong assurances about not missing starting jobs.
+
+With a suggested HA configuration, there are multiple controller
+processes, and they use master election to determine which one
+is active at any time.
+
+If the Job's StartingDeadlineSeconds is long enough, and the
+lease for the master lock is short enough, and other controller
+processes are running, then a Job will be started.
+
+TODO: consider hard-coding the minimum StartingDeadlineSeconds
+at say 1 minute. Then we can offer a clearer guarantee,
+assuming we know what the setting of the lock lease duration is.
+
+### Ensuring jobs are run at most once
+
+There are three problems here:
+
+- ensure at most one Job created per "start time" of a schedule.
+- ensure that at most one Pod is created per Job
+- ensure at most one container start occurs per Pod
+
+#### Ensuring one Job
+
+Multiple jobs might be created in the following sequence:
+
+1. scheduled job controller sends request to start Job J1 to fulfill start time T.
+1. the create request is accepted by the apiserver and enqueued but not yet written to etcd.
+1. scheduled job controller crashes
+1. new scheduled job controller starts, and lists the existing jobs, and does not see one created.
+1. it creates a new one.
+1. the first one eventually gets written to etcd.
+1. there are now two jobs for the same start time.
+
+We can solve this in several ways:
+
+1. with three-phase protocol, e.g.:
+ 1. controller creates a "suspended" job.
+ 1. controller writes writes an annotation in the SJ saying that it created a job for this time.
+ 1. controller unsuspends that job.
+1. by picking a deterministic name, so that at most one object create can succeed.
+
+#### Ensuring one Pod
+
+Job object does not currently have a way to ask for this.
+Even if it did, controller is not written to support it.
+Same problem as above.
+
+#### Ensuring one container invocation per Pod
+
+Kubelet is not written to ensure at-most-one-container-start per pod.
+
+#### Decision
+
+This is too hard to do for the alpha version. We will await user
+feedback to see if the "at most once" property is needed in the beta version.
+
+This is awkward but possible for a containerized application ensure on it own, as it needs
+to know what ScheduledJob name and Start Time it is from, and then record the attempt
+in a shared storage system. We should ensure it could extract this data from its annotations
+using the downward API.
+
+## Name of Jobs
+
+A ScheduledJob creates one Job at each time when a Job should run.
+Since there may be concurrent jobs, and since we might want to keep failed
+non-overlapping Jobs around as a debugging record, each Job created by the same ScheduledJob
+needs a distinct name.
+
+To make the Jobs from the same ScheduledJob distinct, we could use a random string,
+in the way that pods have a `generateName`. For example, a scheduledJob named `nightly-earnings-report`
+in namespace `ns1` might create a job `nightly-earnings-report-3m4d3`, and later create
+a job called `nightly-earnings-report-6k7ts`. This is consistent with pods, but
+does not give the user much information.
+
+Alternatively, we can use time as a uniquifier. For example, the same scheduledJob could
+create a job called `nightly-earnings-report-2016-May-19`.
+However, for Jobs that run more than once per day, we would need to represent
+time as well as date. Standard date formats (e.g. RFC 3339) use colons for time.
+Kubernetes names cannot include time. Using a non-standard date format without colons
+will annoy some users.
+
+Also, date strings are much longer than random suffixes, which means that
+the pods will also have long names, and that we are more likely to exceed the
+253 character name limit when combining the scheduled-job name,
+the time suffix, and pod random suffix.
+
+One option would be to compute a hash of the nominal start time of the job,
+and use that as a suffix. This would not provide the user with an indication
+of the start time, but it would prevent creation of the same execution
+by two instances (replicated or restarting) of the controller process.
+
+We chose to use the hashed-date suffix approach.
+
+## Future evolution
+
+Below are the possible future extensions to the Job controller:
+* Be able to specify workflow template in `.spec` field. This relates to the work
+ happening in [#18827](https://issues.k8s.io/18827).
+* Be able to specify more general template in `.spec` field, to create arbitrary
+ types of resources. This relates to the work happening in [#18215](https://issues.k8s.io/18215).
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/scheduledjob.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/scheduler_extender.md b/contributors/design-proposals/scheduler_extender.md
new file mode 100644
index 00000000..1f362242
--- /dev/null
+++ b/contributors/design-proposals/scheduler_extender.md
@@ -0,0 +1,105 @@
+# Scheduler extender
+
+There are three ways to add new scheduling rules (predicates and priority
+functions) to Kubernetes: (1) by adding these rules to the scheduler and
+recompiling (described here:
+https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md),
+(2) implementing your own scheduler process that runs instead of, or alongside
+of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender"
+process that the standard Kubernetes scheduler calls out to as a final pass when
+making scheduling decisions.
+
+This document describes the third approach. This approach is needed for use
+cases where scheduling decisions need to be made on resources not directly
+managed by the standard Kubernetes scheduler. The extender helps make scheduling
+decisions based on such resources. (Note that the three approaches are not
+mutually exclusive.)
+
+When scheduling a pod, the extender allows an external process to filter and
+prioritize nodes. Two separate http/https calls are issued to the extender, one
+for "filter" and one for "prioritize" actions. To use the extender, you must
+create a scheduler policy configuration file. The configuration specifies how to
+reach the extender, whether to use http or https and the timeout.
+
+```go
+// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
+// it is assumed that the extender chose not to provide that extension.
+type ExtenderConfig struct {
+ // URLPrefix at which the extender is available
+ URLPrefix string `json:"urlPrefix"`
+ // Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
+ FilterVerb string `json:"filterVerb,omitempty"`
+ // Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
+ PrioritizeVerb string `json:"prioritizeVerb,omitempty"`
+ // The numeric multiplier for the node scores that the prioritize call generates.
+ // The weight should be a positive integer
+ Weight int `json:"weight,omitempty"`
+ // EnableHttps specifies whether https should be used to communicate with the extender
+ EnableHttps bool `json:"enableHttps,omitempty"`
+ // TLSConfig specifies the transport layer security config
+ TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"`
+ // HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
+ // timeout is ignored, k8s/other extenders priorities are used to select the node.
+ HTTPTimeout time.Duration `json:"httpTimeout,omitempty"`
+}
+```
+
+A sample scheduler policy file with extender configuration:
+
+```json
+{
+ "predicates": [
+ {
+ "name": "HostName"
+ },
+ {
+ "name": "MatchNodeSelector"
+ },
+ {
+ "name": "PodFitsResources"
+ }
+ ],
+ "priorities": [
+ {
+ "name": "LeastRequestedPriority",
+ "weight": 1
+ }
+ ],
+ "extenders": [
+ {
+ "urlPrefix": "http://127.0.0.1:12345/api/scheduler",
+ "filterVerb": "filter",
+ "enableHttps": false
+ }
+ ]
+}
+```
+
+Arguments passed to the FilterVerb endpoint on the extender are the set of nodes
+filtered through the k8s predicates and the pod. Arguments passed to the
+PrioritizeVerb endpoint on the extender are the set of nodes filtered through
+the k8s predicates and extender predicates and the pod.
+
+```go
+// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
+// nodes for a pod.
+type ExtenderArgs struct {
+ // Pod being scheduled
+ Pod api.Pod `json:"pod"`
+ // List of candidate nodes where the pod can be scheduled
+ Nodes api.NodeList `json:"nodes"`
+}
+```
+
+The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call
+returns priorities for each node (schedulerapi.HostPriorityList).
+
+The "filter" call may prune the set of nodes based on its predicates. Scores
+returned by the "prioritize" call are added to the k8s scores (computed through
+its priority functions) and used for final host selection.
+
+Multiple extenders can be configured in the scheduler policy.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/seccomp.md b/contributors/design-proposals/seccomp.md
new file mode 100644
index 00000000..de00cbc0
--- /dev/null
+++ b/contributors/design-proposals/seccomp.md
@@ -0,0 +1,266 @@
+## Abstract
+
+A proposal for adding **alpha** support for
+[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a
+system call filtering facility in the Linux kernel which lets applications
+define limits on system calls they may make, and what should happen when
+system calls are made. Seccomp is used to reduce the attack surface available
+to applications.
+
+## Motivation
+
+Applications use seccomp to restrict the set of system calls they can make.
+Recently, container runtimes have begun adding features to allow the runtime
+to interact with seccomp on behalf of the application, which eliminates the
+need for applications to link against libseccomp directly. Adding support in
+the Kubernetes API for describing seccomp profiles will allow administrators
+greater control over the security of workloads running in Kubernetes.
+
+Goals of this design:
+
+1. Describe how to reference seccomp profiles in containers that use them
+
+## Constraints and Assumptions
+
+This design should:
+
+* build upon previous security context work
+* be container-runtime agnostic
+* allow use of custom profiles
+* facilitate containerized applications that link directly to libseccomp
+
+## Use Cases
+
+1. As an administrator, I want to be able to grant access to a seccomp profile
+ to a class of users
+2. As a user, I want to run an application with a seccomp profile similar to
+ the default one provided by my container runtime
+3. As a user, I want to run an application which is already libseccomp-aware
+ in a container, and for my application to manage interacting with seccomp
+ unmediated by Kubernetes
+4. As a user, I want to be able to use a custom seccomp profile and use
+ it with my containers
+
+### Use Case: Administrator access control
+
+Controlling access to seccomp profiles is a cluster administrator
+concern. It should be possible for an administrator to control which users
+have access to which profiles.
+
+The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893)
+API extension governs the ability of users to make requests that affect pod
+and container security contexts. The proposed design should deal with
+required changes to control access to new functionality.
+
+### Use Case: Seccomp profiles similar to container runtime defaults
+
+Many users will want to use images that make assumptions about running in the
+context of their chosen container runtime. Such images are likely to
+frequently assume that they are running in the context of the container
+runtime's default seccomp settings. Therefore, it should be possible to
+express a seccomp profile similar to a container runtime's defaults.
+
+As an example, all dockerhub 'official' images are compatible with the Docker
+default seccomp profile. So, any user who wanted to run one of these images
+with seccomp would want the default profile to be accessible.
+
+### Use Case: Applications that link to libseccomp
+
+Some applications already link to libseccomp and control seccomp directly. It
+should be possible to run these applications unmodified in Kubernetes; this
+implies there should be a way to disable seccomp control in Kubernetes for
+certain containers, or to run with a "no-op" or "unconfined" profile.
+
+Sometimes, applications that link to seccomp can use the default profile for a
+container runtime, and restrict further on top of that. It is important to
+note here that in this case, applications can only place _further_
+restrictions on themselves. It is not possible to re-grant the ability of a
+process to make a system call once it has been removed with seccomp.
+
+As an example, elasticsearch manages its own seccomp filters in its code.
+Currently, elasticsearch is capable of running in the context of the default
+Docker profile, but if in the future, elasticsearch needed to be able to call
+`ioperm` or `iopr` (both of which are disallowed in the default profile), it
+should be possible to run elasticsearch by delegating the seccomp controls to
+the pod.
+
+### Use Case: Custom profiles
+
+Different applications have different requirements for seccomp profiles; it
+should be possible to specify an arbitrary seccomp profile and use it in a
+container. This is more of a concern for applications which need a higher
+level of privilege than what is granted by the default profile for a cluster,
+since applications that want to restrict privileges further can always make
+additional calls in their own code.
+
+An example of an application that requires the use of a syscall disallowed in
+the Docker default profile is Chrome, which needs `clone` to create a new user
+namespace. Another example would be a program which uses `ptrace` to
+implement a sandbox for user-provided code, such as
+[eval.in](https://eval.in/).
+
+## Community Work
+
+### Container runtime support for seccomp
+
+#### Docker / opencontainers
+
+Docker supports the open container initiative's API for
+seccomp, which is very close to the libseccomp API. It allows full
+specification of seccomp filters, with arguments, operators, and actions.
+
+Docker allows the specification of a single seccomp filter. There are
+community requests for:
+
+Issues:
+
+* [docker/22109](https://github.com/docker/docker/issues/22109): composable
+ seccomp filters
+* [docker/21105](https://github.com/docker/docker/issues/22105): custom
+ seccomp filters for builds
+
+#### rkt / appcontainers
+
+The `rkt` runtime delegates to systemd for seccomp support; there is an open
+issue to add support once `appc` supports it. The `appc` project has an open
+issue to be able to describe seccomp as an isolator in an appc pod.
+
+The systemd seccomp facility is based on a whitelist of system calls that can
+be made, rather than a full filter specification.
+
+Issues:
+
+* [appc/529](https://github.com/appc/spec/issues/529)
+* [rkt/1614](https://github.com/coreos/rkt/issues/1614)
+
+#### HyperContainer
+
+[HyperContainer](https://hypercontainer.io) does not support seccomp.
+
+### Other platforms and seccomp-like capabilities
+
+FreeBSD has a seccomp/capability-like facility called
+[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
+
+#### lxd
+
+[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile.
+
+Issues:
+
+* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp
+
+## Proposed Design
+
+### Seccomp API Resource?
+
+An earlier draft of this proposal described a new global API resource that
+could be used to describe seccomp profiles. After some discussion, it was
+determined that without a feedback signal from users indicating a need to
+describe new profiles in the Kubernetes API, it is not possible to know
+whether a new API resource is warranted.
+
+That being the case, we will not propose a new API resource at this time. If
+there is strong community desire for such a resource, we may consider it in
+the future.
+
+Instead of implementing a new API resource, we propose that pods be able to
+reference seccomp profiles by name. Since this is an alpha feature, we will
+use annotations instead of extending the API with new fields.
+
+### API changes?
+
+In the alpha version of this feature we will use annotations to store the
+names of seccomp profiles. The keys will be:
+
+`container.seccomp.security.alpha.kubernetes.io/<container name>`
+
+which will be used to set the seccomp profile of a container, and:
+
+`seccomp.security.alpha.kubernetes.io/pod`
+
+which will set the seccomp profile for the containers of an entire pod. If a
+pod-level annotation is present, and a container-level annotation present for
+a container, then the container-level profile takes precedence.
+
+The value of these keys should be container-runtime agnostic. We will
+establish a format that expresses the conventions for distinguishing between
+an unconfined profile, the container runtime's default, or a custom profile.
+Since format of profile is likely to be runtime dependent, we will consider
+profiles to be opaque to kubernetes for now.
+
+The following format is scoped as follows:
+
+1. `runtime/default` - the default profile for the container runtime
+2. `unconfined` - unconfined profile, ie, no seccomp sandboxing
+3. `localhost/<profile-name>` - the profile installed to the node's local seccomp profile root
+
+Since seccomp profile schemes may vary between container runtimes, we will
+treat the contents of profiles as opaque for now and avoid attempting to find
+a common way to describe them. It is up to the container runtime to be
+sensitive to the annotations proposed here and to interpret instructions about
+local profiles.
+
+A new area on disk (which we will call the seccomp profile root) must be
+established to hold seccomp profiles. A field will be added to the Kubelet
+for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to
+allow admins to set it. If unset, it should default to the `seccomp`
+subdirectory of the kubelet root directory.
+
+### Pod Security Policy annotation
+
+The `PodSecurityPolicy` type should be annotated with the allowed seccomp
+profiles using the key
+`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this
+key should be a comma delimited list.
+
+## Examples
+
+### Unconfined profile
+
+Here's an example of a pod that uses the unconfined profile:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: trustworthy-pod
+ annotations:
+ seccomp.security.alpha.kubernetes.io/pod: unconfined
+spec:
+ containers:
+ - name: trustworthy-container
+ image: sotrustworthy:latest
+```
+
+### Custom profile
+
+Here's an example of a pod that uses a profile called `example-explorer-
+profile` using the container-level annotation:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: explorer
+ annotations:
+ container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile
+spec:
+ containers:
+ - name: explorer
+ image: gcr.io/google_containers/explorer:1.0
+ args: ["-port=8080"]
+ ports:
+ - containerPort: 8080
+ protocol: TCP
+ volumeMounts:
+ - mountPath: "/mount/test-volume"
+ name: test-volume
+ volumes:
+ - name: test-volume
+ emptyDir: {}
+```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md b/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md
new file mode 100644
index 00000000..42def9bf
--- /dev/null
+++ b/contributors/design-proposals/secret-configmap-downwarapi-file-mode.md
@@ -0,0 +1,186 @@
+# Secrets, configmaps and downwardAPI file mode bits
+
+Author: Rodrigo Campos (@rata), Tim Hockin (@thockin)
+
+Date: July 2016
+
+Status: Design in progress
+
+# Goal
+
+Allow users to specify permission mode bits for a secret/configmap/downwardAPI
+file mounted as a volume. For example, if a secret has several keys, a user
+should be able to specify the permission mode bits for any file, and they may
+all have different modes.
+
+Let me say that with "permission" I only refer to the file mode here and I may
+use them interchangeably. This is not about the file owners, although let me
+know if you prefer to discuss that here too.
+
+
+# Motivation
+
+There is currently no way to set permissions on secret files mounted as volumes.
+This can be a problem for applications that enforce files to have permissions
+only for the owner (like fetchmail, ssh, pgpass file in postgres[1], etc.) and
+it's just not possible to run them without changing the file mode. Also,
+in-house applications may have this restriction too.
+
+It doesn't seem totally wrong if someone wants to make a secret, that is
+sensitive information, not world-readable (or group, too) as it is by default.
+Although it's already in a container that is (hopefully) running only one
+process and it might not be so bad. But people running more than one process in
+a container asked for this too[2].
+
+For example, my use case is that we are migrating to kubernetes, the migration
+is in progress (and will take a while) and we have migrated our deployment web
+interface to kubernetes. But this interface connects to the servers via ssh, so
+it needs the ssh keys, and ssh will only work if the ssh key file mode is the
+one it expects.
+
+This was asked on the mailing list here[2] and here[3], too.
+
+[1]: https://www.postgresql.org/docs/9.1/static/libpq-pgpass.html
+[2]: https://groups.google.com/forum/#!topic/kubernetes-dev/eTnfMJSqmaM
+[3]: https://groups.google.com/forum/#!topic/google-containers/EcaOPq4M758
+
+# Alternatives considered
+
+Several alternatives have been considered:
+
+ * Add a mode to the API definition when using secrets: this is backward
+ compatible as described in (docs/devel/api_changes.md) IIUC and seems like the
+ way to go. Also @thockin said in the ML that he would consider such an
+ approach. But it might be worth to consider if we want to do the same for
+ configmaps or owners, but there is no need to do it now either.
+
+ * Change the default file mode for secrets: I think this is unacceptable as it
+ is stated in the api_changes doc. And besides it doesn't feel correct IMHO, it
+ is technically one option. The argument for this might be that world and group
+ readable for a secret is not a nice default, we already take care of not
+ writing it to disk, etc. but the file is created world-readable anyways. Such a
+ default change has been done recently: the default was 0444 in kubernetes <= 1.2
+ and is now 0644 in kubernetes >= 1.3 (and the file is not a regular file,
+ it's a symlink now). This change was done here to minimize differences between
+ configmaps and secrets: https://github.com/kubernetes/kubernetes/pull/25285. But
+ doing it again, and changing to something more restrictive (now is 0644 and it
+ should be 0400 to work with ssh and most apps) seems too risky, it's even more
+ restrictive than in k8s 1.2. Specially if there is no way to revert to the old
+ permissions and some use case is broken by this. And if we are adding a way to
+ change it, like in the option above, there is no need to rush changing the
+ default. So I would discard this.
+
+ * We don't want to people be able to change this, at least for now, and the
+ ones who do, suggest that do it as a "postStart" command. This is acceptable
+ if we don't want to change kubernetes core for some reason, although there
+ seem to be valid use cases. But if the user want's to use the "postStart" for
+ something else, then it is more disturbing to do both things (have a script
+ in the docker image that deals with this, but is not probably concern of the
+ project so it's not nice, or specify several commands by using "sh").
+
+# Proposed implementation
+
+The proposed implementation goes with the first alternative: adding a `mode`
+to the API.
+
+There will be a `defaultMode`, type `int`, in: `type SecretVolumeSource`, `type
+ConfigMapVolumeSource` and `type DownwardAPIVolumeSource`. And a `mode`, type
+`int` too, in `type KeyToPath` and `DownwardAPIVolumeFile`.
+
+The mask provided in any of these fields will be ANDed with 0777 to disallow
+setting sticky and setuid bits. It's not clear that use case is needed nor
+really understood. And directories within the volume will be created as before
+and are not affected by this setting.
+
+In other words, the fields will look like this:
+
+```
+type SecretVolumeSource struct {
+ // Name of the secret in the pod's namespace to use.
+ SecretName string `json:"secretName,omitempty"`
+ // If unspecified, each key-value pair in the Data field of the referenced
+ // Secret will be projected into the volume as a file whose name is the
+ // key and content is the value. If specified, the listed keys will be
+ // projected into the specified paths, and unlisted keys will not be
+ // present. If a key is specified which is not present in the Secret,
+ // the volume setup will error. Paths must be relative and may not contain
+ // the '..' path or start with '..'.
+ Items []KeyToPath `json:"items,omitempty"`
+ // Mode bits to use on created files by default. The used mode bits will
+ // be the provided AND 0777.
+ // Directories within the path are not affected by this setting
+ DefaultMode int32 `json:"defaultMode,omitempty"`
+}
+
+type ConfigMapVolumeSource struct {
+ LocalObjectReference `json:",inline"`
+ // If unspecified, each key-value pair in the Data field of the referenced
+ // ConfigMap will be projected into the volume as a file whose name is the
+ // key and content is the value. If specified, the listed keys will be
+ // projected into the specified paths, and unlisted keys will not be
+ // present. If a key is specified which is not present in the ConfigMap,
+ // the volume setup will error. Paths must be relative and may not contain
+ // the '..' path or start with '..'.
+ Items []KeyToPath `json:"items,omitempty"`
+ // Mode bits to use on created files by default. The used mode bits will
+ // be the provided AND 0777.
+ // Directories within the path are not affected by this setting
+ DefaultMode int32 `json:"defaultMode,omitempty"`
+}
+
+type KeyToPath struct {
+ // The key to project.
+ Key string `json:"key"`
+
+ // The relative path of the file to map the key to.
+ // May not be an absolute path.
+ // May not contain the path element '..'.
+ // May not start with the string '..'.
+ Path string `json:"path"`
+ // Mode bits to use on this file. The used mode bits will be the
+ // provided AND 0777.
+ Mode int32 `json:"mode,omitempty"`
+}
+
+type DownwardAPIVolumeSource struct {
+ // Items is a list of DownwardAPIVolume file
+ Items []DownwardAPIVolumeFile `json:"items,omitempty"`
+ // Mode bits to use on created files by default. The used mode bits will
+ // be the provided AND 0777.
+ // Directories within the path are not affected by this setting
+ DefaultMode int32 `json:"defaultMode,omitempty"`
+}
+
+type DownwardAPIVolumeFile struct {
+ // Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..'
+ Path string `json:"path"`
+ // Required: Selects a field of the pod: only annotations, labels, name and namespace are supported.
+ FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
+ // Selects a resource of the container: only resources limits and requests
+ // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
+ ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
+ // Mode bits to use on this file. The used mode bits will be the
+ // provided AND 0777.
+ Mode int32 `json:"mode,omitempty"`
+}
+```
+
+Adding it there allows the user to change the mode bits of every file in the
+object, so it achieves the goal, while having the option to have a default and
+not specify all files in the object.
+
+The are two downside:
+
+ * The files are symlinks pointint to the real file, and the realfile
+ permissions are only set. The symlink has the classic symlink permissions.
+ This is something already present in 1.3, and it seems applications like ssh
+ work just fine with that. Something worth mentioning, but doesn't seem to be
+ an issue.
+ * If the secret/configMap/downwardAPI is mounted in more than one container,
+ the file permissions will be the same on all. This is already the case for
+ Key mappings and doesn't seem like a big issue either.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/secret-configmap-downwarapi-file-mode.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/secrets.md b/contributors/design-proposals/secrets.md
new file mode 100644
index 00000000..29d18411
--- /dev/null
+++ b/contributors/design-proposals/secrets.md
@@ -0,0 +1,628 @@
+## Abstract
+
+A proposal for the distribution of [secrets](../user-guide/secrets.md)
+(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using
+a custom [volume](../user-guide/volumes.md#secrets) type. See the
+[secrets example](../user-guide/secrets/) for more information.
+
+## Motivation
+
+Secrets are needed in containers to access internal resources like the
+Kubernetes master or external resources such as git repositories, databases,
+etc. Users may also want behaviors in the kubelet that depend on secret data
+(credentials for image pull from a docker registry) associated with pods.
+
+Goals of this design:
+
+1. Describe a secret resource
+2. Define the various challenges attendant to managing secrets on the node
+3. Define a mechanism for consuming secrets in containers without modification
+
+## Constraints and Assumptions
+
+* This design does not prescribe a method for storing secrets; storage of
+secrets should be pluggable to accommodate different use-cases
+* Encryption of secret data and node security are orthogonal concerns
+* It is assumed that node and master are secure and that compromising their
+security could also compromise secrets:
+ * If a node is compromised, the only secrets that could potentially be
+exposed should be the secrets belonging to containers scheduled onto it
+ * If the master is compromised, all secrets in the cluster may be exposed
+* Secret rotation is an orthogonal concern, but it should be facilitated by
+this proposal
+* A user who can consume a secret in a container can know the value of the
+secret; secrets must be provisioned judiciously
+
+## Use Cases
+
+1. As a user, I want to store secret artifacts for my applications and consume
+them securely in containers, so that I can keep the configuration for my
+applications separate from the images that use them:
+ 1. As a cluster operator, I want to allow a pod to access the Kubernetes
+master using a custom `.kubeconfig` file, so that I can securely reach the
+master
+ 2. As a cluster operator, I want to allow a pod to access a Docker registry
+using credentials from a `.dockercfg` file, so that containers can push images
+ 3. As a cluster operator, I want to allow a pod to access a git repository
+using SSH keys, so that I can push to and fetch from the repository
+2. As a user, I want to allow containers to consume supplemental information
+about services such as username and password which should be kept secret, so
+that I can share secrets about a service amongst the containers in my
+application securely
+3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a
+secret and have the kubelet implement some reserved behaviors based on the types
+of secrets the service account consumes:
+ 1. Use credentials for a docker registry to pull the pod's docker image
+ 2. Present Kubernetes auth token to the pod or transparently decorate
+traffic between the pod and master service
+4. As a user, I want to be able to indicate that a secret expires and for that
+secret's value to be rotated once it expires, so that the system can help me
+follow good practices
+
+### Use-Case: Configuration artifacts
+
+Many configuration files contain secrets intermixed with other configuration
+information. For example, a user's application may contain a properties file
+than contains database credentials, SaaS API tokens, etc. Users should be able
+to consume configuration artifacts in their containers and be able to control
+the path on the container's filesystems where the artifact will be presented.
+
+### Use-Case: Metadata about services
+
+Most pieces of information about how to use a service are secrets. For example,
+a service that provides a MySQL database needs to provide the username,
+password, and database name to consumers so that they can authenticate and use
+the correct database. Containers in pods consuming the MySQL service would also
+consume the secrets associated with the MySQL service.
+
+### Use-Case: Secrets associated with service accounts
+
+[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple
+capabilities and security contexts from individual human users. A
+`ServiceAccount` contains references to some number of secrets. A `Pod` can
+specify that it is associated with a `ServiceAccount`. Secrets should have a
+`Type` field to allow the Kubelet and other system components to take action
+based on the secret's type.
+
+#### Example: service account consumes auth token secret
+
+As an example, the service account proposal discusses service accounts consuming
+secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod
+associated with a service account which consumes this type of secret, the
+Kubelet may take a number of actions:
+
+1. Expose the secret in a `.kubernetes_auth` file in a well-known location in
+the container's file system
+2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod
+to the `kubernetes-master` service with the auth token, e. g. by adding a header
+to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
+
+#### Example: service account consumes docker registry credentials
+
+Another example use case is where a pod is associated with a secret containing
+docker registry credentials. The Kubelet could use these credentials for the
+docker pull to retrieve the image.
+
+### Use-Case: Secret expiry and rotation
+
+Rotation is considered a good practice for many types of secret data. It should
+be possible to express that a secret has an expiry date; this would make it
+possible to implement a system component that could regenerate expired secrets.
+As an example, consider a component that rotates expired secrets. The rotator
+could periodically regenerate the values for expired secrets of common types and
+update their expiry dates.
+
+## Deferral: Consuming secrets as environment variables
+
+Some images will expect to receive configuration items as environment variables
+instead of files. We should consider what the best way to allow this is; there
+are a few different options:
+
+1. Force the user to adapt files into environment variables. Users can store
+secrets that need to be presented as environment variables in a format that is
+easy to consume from a shell:
+
+ $ cat /etc/secrets/my-secret.txt
+ export MY_SECRET_ENV=MY_SECRET_VALUE
+
+ The user could `source` the file at `/etc/secrets/my-secret` prior to
+executing the command for the image either inline in the command or in an init
+script.
+
+2. Give secrets an attribute that allows users to express the intent that the
+platform should generate the above syntax in the file used to present a secret.
+The user could consume these files in the same manner as the above option.
+
+3. Give secrets attributes that allow the user to express that the secret
+should be presented to the container as an environment variable. The container's
+environment would contain the desired values and the software in the container
+could use them without accommodation the command or setup script.
+
+For our initial work, we will treat all secrets as files to narrow the problem
+space. There will be a future proposal that handles exposing secrets as
+environment variables.
+
+## Flow analysis of secret data with respect to the API server
+
+There are two fundamentally different use-cases for access to secrets:
+
+1. CRUD operations on secrets by their owners
+2. Read-only access to the secrets needed for a particular node by the kubelet
+
+### Use-Case: CRUD operations by owners
+
+In use cases for CRUD operations, the user experience for secrets should be no
+different than for other API resources.
+
+#### Data store backing the REST API
+
+The data store backing the REST API should be pluggable because different
+cluster operators will have different preferences for the central store of
+secret data. Some possibilities for storage:
+
+1. An etcd collection alongside the storage for other API resources
+2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
+3. A secrets server like [Vault](https://www.vaultproject.io/) or
+[Keywhiz](https://square.github.io/keywhiz/)
+4. An external datastore such as an external etcd, RDBMS, etc.
+
+#### Size limit for secrets
+
+There should be a size limit for secrets in order to:
+
+1. Prevent DOS attacks against the API server
+2. Allow kubelet implementations that prevent secret data from touching the
+node's filesystem
+
+The size limit should satisfy the following conditions:
+
+1. Large enough to store common artifact types (encryption keypairs,
+certificates, small configuration files)
+2. Small enough to avoid large impact on node resource consumption (storage,
+RAM for tmpfs, etc)
+
+To begin discussion, we propose an initial value for this size limit of **1MB**.
+
+#### Other limitations on secrets
+
+Defining a policy for limitations on how a secret may be referenced by another
+API resource and how constraints should be applied throughout the cluster is
+tricky due to the number of variables involved:
+
+1. Should there be a maximum number of secrets a pod can reference via a
+volume?
+2. Should there be a maximum number of secrets a service account can reference?
+3. Should there be a total maximum number of secrets a pod can reference via
+its own spec and its associated service account?
+4. Should there be a total size limit on the amount of secret data consumed by
+a pod?
+5. How will cluster operators want to be able to configure these limits?
+6. How will these limits impact API server validations?
+7. How will these limits affect scheduling?
+
+For now, we will not implement validations around these limits. Cluster
+operators will decide how much node storage is allocated to secrets. It will be
+the operator's responsibility to ensure that the allocated storage is sufficient
+for the workload scheduled onto a node.
+
+For now, kubelets will only attach secrets to api-sourced pods, and not file-
+or http-sourced ones. Doing so would:
+ - confuse the secrets admission controller in the case of mirror pods.
+ - create an apiserver-liveness dependency -- avoiding this dependency is a
+main reason to use non-api-source pods.
+
+### Use-Case: Kubelet read of secrets for node
+
+The use-case where the kubelet reads secrets has several additional requirements:
+
+1. Kubelets should only be able to receive secret data which is required by
+pods scheduled onto the kubelet's node
+2. Kubelets should have read-only access to secret data
+3. Secret data should not be transmitted over the wire insecurely
+4. Kubelets must ensure pods do not have access to each other's secrets
+
+#### Read of secret data by the Kubelet
+
+The Kubelet should only be allowed to read secrets which are consumed by pods
+scheduled onto that Kubelet's node and their associated service accounts.
+Authorization of the Kubelet to read this data would be delegated to an
+authorization plugin and associated policy rule.
+
+#### Secret data on the node: data at rest
+
+Consideration must be given to whether secret data should be allowed to be at
+rest on the node:
+
+1. If secret data is not allowed to be at rest, the size of secret data becomes
+another draw on the node's RAM - should it affect scheduling?
+2. If secret data is allowed to be at rest, should it be encrypted?
+ 1. If so, how should be this be done?
+ 2. If not, what threats exist? What types of secret are appropriate to
+store this way?
+
+For the sake of limiting complexity, we propose that initially secret data
+should not be allowed to be at rest on a node; secret data should be stored on a
+node-level tmpfs filesystem. This filesystem can be subdivided into directories
+for use by the kubelet and by the volume plugin.
+
+#### Secret data on the node: resource consumption
+
+The Kubelet will be responsible for creating the per-node tmpfs file system for
+secret storage. It is hard to make a prescriptive declaration about how much
+storage is appropriate to reserve for secrets because different installations
+will vary widely in available resources, desired pod to node density, overcommit
+policy, and other operation dimensions. That being the case, we propose for
+simplicity that the amount of secret storage be controlled by a new parameter to
+the kubelet with a default value of **64MB**. It is the cluster operator's
+responsibility to handle choosing the right storage size for their installation
+and configuring their Kubelets correctly.
+
+Configuring each Kubelet is not the ideal story for operator experience; it is
+more intuitive that the cluster-wide storage size be readable from a central
+configuration store like the one proposed in [#1553](http://issue.k8s.io/1553).
+When such a store exists, the Kubelet could be modified to read this
+configuration item from the store.
+
+When the Kubelet is modified to advertise node resources (as proposed in
+[#4441](http://issue.k8s.io/4441)), the capacity calculation
+for available memory should factor in the potential size of the node-level tmpfs
+in order to avoid memory overcommit on the node.
+
+#### Secret data on the node: isolation
+
+Every pod will have a [security context](security_context.md).
+Secret data on the node should be isolated according to the security context of
+the container. The Kubelet volume plugin API will be changed so that a volume
+plugin receives the security context of a volume along with the volume spec.
+This will allow volume plugins to implement setting the security context of
+volumes they manage.
+
+## Community work
+
+Several proposals / upstream patches are notable as background for this
+proposal:
+
+1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
+2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
+3. [Kubernetes service account proposal](service_accounts.md)
+4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075)
+5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697)
+
+## Proposed Design
+
+We propose a new `Secret` resource which is mounted into containers with a new
+volume type. Secret volumes will be handled by a volume plugin that does the
+actual work of fetching the secret and storing it. Secrets contain multiple
+pieces of data that are presented as different files within the secret volume
+(example: SSH key pair).
+
+In order to remove the burden from the end user in specifying every file that a
+secret consists of, it should be possible to mount all files provided by a
+secret with a single `VolumeMount` entry in the container specification.
+
+### Secret API Resource
+
+A new resource for secrets will be added to the API:
+
+```go
+type Secret struct {
+ TypeMeta
+ ObjectMeta
+
+ // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN.
+ // The serialized form of the secret data is a base64 encoded string,
+ // representing the arbitrary (possibly non-string) data value here.
+ Data map[string][]byte `json:"data,omitempty"`
+
+ // Used to facilitate programmatic handling of secret data.
+ Type SecretType `json:"type,omitempty"`
+}
+
+type SecretType string
+
+const (
+ SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default)
+ SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token
+ SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth
+ SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth
+ // FUTURE: other type values
+)
+
+const MaxSecretSize = 1 * 1024 * 1024
+```
+
+A Secret can declare a type in order to provide type information to system
+components that work with secrets. The default type is `opaque`, which
+represents arbitrary user-owned data.
+
+Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must
+be valid DNS subdomains.
+
+A new REST API and registry interface will be added to accompany the `Secret`
+resource. The default implementation of the registry will store `Secret`
+information in etcd. Future registry implementations could store the `TypeMeta`
+and `ObjectMeta` fields in etcd and store the secret data in another data store
+entirely, or store the whole object in another data store.
+
+#### Other validations related to secrets
+
+Initially there will be no validations for the number of secrets a pod
+references, or the number of secrets that can be associated with a service
+account. These may be added in the future as the finer points of secrets and
+resource allocation are fleshed out.
+
+### Secret Volume Source
+
+A new `SecretSource` type of volume source will be added to the `VolumeSource`
+struct in the API:
+
+```go
+type VolumeSource struct {
+ // Other fields omitted
+
+ // SecretSource represents a secret that should be presented in a volume
+ SecretSource *SecretSource `json:"secret"`
+}
+
+type SecretSource struct {
+ Target ObjectReference
+}
+```
+
+Secret volume sources are validated to ensure that the specified object
+reference actually points to an object of type `Secret`.
+
+In the future, the `SecretSource` will be extended to allow:
+
+1. Fine-grained control over which pieces of secret data are exposed in the
+volume
+2. The paths and filenames for how secret data are exposed
+
+### Secret Volume Plugin
+
+A new Kubelet volume plugin will be added to handle volumes with a secret
+source. This plugin will require access to the API server to retrieve secret
+data and therefore the volume `Host` interface will have to change to expose a
+client interface:
+
+```go
+type Host interface {
+ // Other methods omitted
+
+ // GetKubeClient returns a client interface
+ GetKubeClient() client.Interface
+}
+```
+
+The secret volume plugin will be responsible for:
+
+1. Returning a `volume.Mounter` implementation from `NewMounter` that:
+ 1. Retrieves the secret data for the volume from the API server
+ 2. Places the secret data onto the container's filesystem
+ 3. Sets the correct security attributes for the volume based on the pod's
+`SecurityContext`
+2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that
+cleans the volume from the container's filesystem
+
+### Kubelet: Node-level secret storage
+
+The Kubelet must be modified to accept a new parameter for the secret storage
+size and to create a tmpfs file system of that size to store secret data. Rough
+accounting of specific changes:
+
+1. The Kubelet should have a new field added called `secretStorageSize`; units
+are megabytes
+2. `NewMainKubelet` should accept a value for secret storage size
+3. The Kubelet server should have a new flag added for secret storage size
+4. The Kubelet's `setupDataDirs` method should be changed to create the secret
+storage
+
+### Kubelet: New behaviors for secrets associated with service accounts
+
+For use-cases where the Kubelet's behavior is affected by the secrets associated
+with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example,
+if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the
+Kubelet will need to be changed to accommodate this. Subsequent proposals can
+address this on a type-by-type basis.
+
+## Examples
+
+For clarity, let's examine some detailed examples of some common use-cases in
+terms of the suggested changes. All of these examples are assumed to be created
+in a namespace called `example`.
+
+### Use-Case: Pod with ssh keys
+
+To create a pod that uses an ssh key stored as a secret, we first need to create
+a secret:
+
+```json
+{
+ "kind": "Secret",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "ssh-key-secret"
+ },
+ "data": {
+ "id-rsa": "dmFsdWUtMg0KDQo=",
+ "id-rsa.pub": "dmFsdWUtMQ0K"
+ }
+}
+```
+
+**Note:** The serialized JSON and YAML values of secret data are encoded as
+base64 strings. Newlines are not valid within these strings and must be
+omitted.
+
+Now we can create a pod which references the secret with the ssh key and
+consumes it in a volume:
+
+```json
+{
+ "kind": "Pod",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "secret-test-pod",
+ "labels": {
+ "name": "secret-test"
+ }
+ },
+ "spec": {
+ "volumes": [
+ {
+ "name": "secret-volume",
+ "secret": {
+ "secretName": "ssh-key-secret"
+ }
+ }
+ ],
+ "containers": [
+ {
+ "name": "ssh-test-container",
+ "image": "mySshImage",
+ "volumeMounts": [
+ {
+ "name": "secret-volume",
+ "readOnly": true,
+ "mountPath": "/etc/secret-volume"
+ }
+ ]
+ }
+ ]
+ }
+}
+```
+
+When the container's command runs, the pieces of the key will be available in:
+
+ /etc/secret-volume/id-rsa.pub
+ /etc/secret-volume/id-rsa
+
+The container is then free to use the secret data to establish an ssh
+connection.
+
+### Use-Case: Pods with prod / test credentials
+
+This example illustrates a pod which consumes a secret containing prod
+credentials and another pod which consumes a secret with test environment
+credentials.
+
+The secrets:
+
+```json
+{
+ "apiVersion": "v1",
+ "kind": "List",
+ "items":
+ [{
+ "kind": "Secret",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "prod-db-secret"
+ },
+ "data": {
+ "password": "dmFsdWUtMg0KDQo=",
+ "username": "dmFsdWUtMQ0K"
+ }
+ },
+ {
+ "kind": "Secret",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "test-db-secret"
+ },
+ "data": {
+ "password": "dmFsdWUtMg0KDQo=",
+ "username": "dmFsdWUtMQ0K"
+ }
+ }]
+}
+```
+
+The pods:
+
+```json
+{
+ "apiVersion": "v1",
+ "kind": "List",
+ "items":
+ [{
+ "kind": "Pod",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "prod-db-client-pod",
+ "labels": {
+ "name": "prod-db-client"
+ }
+ },
+ "spec": {
+ "volumes": [
+ {
+ "name": "secret-volume",
+ "secret": {
+ "secretName": "prod-db-secret"
+ }
+ }
+ ],
+ "containers": [
+ {
+ "name": "db-client-container",
+ "image": "myClientImage",
+ "volumeMounts": [
+ {
+ "name": "secret-volume",
+ "readOnly": true,
+ "mountPath": "/etc/secret-volume"
+ }
+ ]
+ }
+ ]
+ }
+ },
+ {
+ "kind": "Pod",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "test-db-client-pod",
+ "labels": {
+ "name": "test-db-client"
+ }
+ },
+ "spec": {
+ "volumes": [
+ {
+ "name": "secret-volume",
+ "secret": {
+ "secretName": "test-db-secret"
+ }
+ }
+ ],
+ "containers": [
+ {
+ "name": "db-client-container",
+ "image": "myClientImage",
+ "volumeMounts": [
+ {
+ "name": "secret-volume",
+ "readOnly": true,
+ "mountPath": "/etc/secret-volume"
+ }
+ ]
+ }
+ ]
+ }
+ }]
+}
+```
+
+The specs for the two pods differ only in the value of the object referred to by
+the secret volume source. Both containers will have the following files present
+on their filesystems:
+
+ /etc/secret-volume/username
+ /etc/secret-volume/password
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/security-context-constraints.md b/contributors/design-proposals/security-context-constraints.md
new file mode 100644
index 00000000..ae966e21
--- /dev/null
+++ b/contributors/design-proposals/security-context-constraints.md
@@ -0,0 +1,348 @@
+## Abstract
+
+PodSecurityPolicy allows cluster administrators to control the creation and validation of a security
+context for a pod and containers.
+
+## Motivation
+
+Administration of a multi-tenant cluster requires the ability to provide varying sets of permissions
+among the tenants, the infrastructure components, and end users of the system who may themselves be
+administrators within their own isolated namespace.
+
+Actors in a cluster may include infrastructure that is managed by administrators, infrastructure
+that is exposed to end users (builds, deployments), the isolated end user namespaces in the cluster, and
+the individual users inside those namespaces. Infrastructure components that operate on behalf of a
+user (builds, deployments) should be allowed to run at an elevated level of permissions without
+granting the user themselves an elevated set of permissions.
+
+## Goals
+
+1. Associate [service accounts](../design/service_accounts.md), groups, and users with
+a set of constraints that dictate how a security context is established for a pod and the pod's containers.
+1. Provide the ability for users and infrastructure components to run pods with elevated privileges
+on behalf of another user or within a namespace where privileges are more restrictive.
+1. Secure the ability to reference elevated permissions or to change the constraints under which
+a user runs.
+
+## Use Cases
+
+Use case 1:
+As an administrator, I can create a namespace for a person that can't create privileged containers
+AND enforce that the UID of the containers is set to a certain value
+
+Use case 2:
+As a cluster operator, an infrastructure component should be able to create a pod with elevated
+privileges in a namespace where regular users cannot create pods with these privileges or execute
+commands in that pod.
+
+Use case 3:
+As a cluster administrator, I can allow a given namespace (or service account) to create privileged
+pods or to run root pods
+
+Use case 4:
+As a cluster administrator, I can allow a project administrator to control the security contexts of
+pods and service accounts within a project
+
+
+## Requirements
+
+1. Provide a set of restrictions that controls how a security context is created for pods and containers
+as a new cluster-scoped object called `PodSecurityPolicy`.
+1. User information in `user.Info` must be available to admission controllers. (Completed in
+https://github.com/GoogleCloudPlatform/kubernetes/pull/8203)
+1. Some authorizers may restrict a user’s ability to reference a service account. Systems requiring
+the ability to secure service accounts on a user level must be able to add a policy that enables
+referencing specific service accounts themselves.
+1. Admission control must validate the creation of Pods against the allowed set of constraints.
+
+## Design
+
+### Model
+
+PodSecurityPolicy objects exist in the root scope, outside of a namespace. The
+PodSecurityPolicy will reference users and groups that are allowed
+to operate under the constraints. In order to support this, `ServiceAccounts` must be mapped
+to a user name or group list by the authentication/authorization layers. This allows the security
+context to treat users, groups, and service accounts uniformly.
+
+Below is a list of PodSecurityPolicies which will likely serve most use cases:
+
+1. A default policy object. This object is permissioned to something which covers all actors, such
+as a `system:authenticated` group, and will likely be the most restrictive set of constraints.
+1. A default constraints object for service accounts. This object can be identified as serving
+a group identified by `system:service-accounts`, which can be imposed by the service account authenticator / token generator.
+1. Cluster admin constraints identified by `system:cluster-admins` group - a set of constraints with elevated privileges that can be used
+by an administrative user or group.
+1. Infrastructure components constraints which can be identified either by a specific service
+account or by a group containing all service accounts.
+
+```go
+// PodSecurityPolicy governs the ability to make requests that affect the SecurityContext
+// that will be applied to a pod and container.
+type PodSecurityPolicy struct {
+ unversioned.TypeMeta `json:",inline"`
+ api.ObjectMeta `json:"metadata,omitempty"`
+
+ // Spec defines the policy enforced.
+ Spec PodSecurityPolicySpec `json:"spec,omitempty"`
+}
+
+// PodSecurityPolicySpec defines the policy enforced.
+type PodSecurityPolicySpec struct {
+ // Privileged determines if a pod can request to be run as privileged.
+ Privileged bool `json:"privileged,omitempty"`
+ // Capabilities is a list of capabilities that can be added.
+ Capabilities []api.Capability `json:"capabilities,omitempty"`
+ // Volumes allows and disallows the use of different types of volume plugins.
+ Volumes VolumeSecurityPolicy `json:"volumes,omitempty"`
+ // HostNetwork determines if the policy allows the use of HostNetwork in the pod spec.
+ HostNetwork bool `json:"hostNetwork,omitempty"`
+ // HostPorts determines which host port ranges are allowed to be exposed.
+ HostPorts []HostPortRange `json:"hostPorts,omitempty"`
+ // HostPID determines if the policy allows the use of HostPID in the pod spec.
+ HostPID bool `json:"hostPID,omitempty"`
+ // HostIPC determines if the policy allows the use of HostIPC in the pod spec.
+ HostIPC bool `json:"hostIPC,omitempty"`
+ // SELinuxContext is the strategy that will dictate the allowable labels that may be set.
+ SELinuxContext SELinuxContextStrategyOptions `json:"seLinuxContext,omitempty"`
+ // RunAsUser is the strategy that will dictate the allowable RunAsUser values that may be set.
+ RunAsUser RunAsUserStrategyOptions `json:"runAsUser,omitempty"`
+
+ // The users who have permissions to use this policy
+ Users []string `json:"users,omitempty"`
+ // The groups that have permission to use this policy
+ Groups []string `json:"groups,omitempty"`
+}
+
+// HostPortRange defines a range of host ports that will be enabled by a policy
+// for pods to use. It requires both the start and end to be defined.
+type HostPortRange struct {
+ // Start is the beginning of the port range which will be allowed.
+ Start int `json:"start"`
+ // End is the end of the port range which will be allowed.
+ End int `json:"end"`
+}
+
+// VolumeSecurityPolicy allows and disallows the use of different types of volume plugins.
+type VolumeSecurityPolicy struct {
+ // HostPath allows or disallows the use of the HostPath volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#hostpath
+ HostPath bool `json:"hostPath,omitempty"`
+ // EmptyDir allows or disallows the use of the EmptyDir volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#emptydir
+ EmptyDir bool `json:"emptyDir,omitempty"`
+ // GCEPersistentDisk allows or disallows the use of the GCEPersistentDisk volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#gcepersistentdisk
+ GCEPersistentDisk bool `json:"gcePersistentDisk,omitempty"`
+ // AWSElasticBlockStore allows or disallows the use of the AWSElasticBlockStore volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#awselasticblockstore
+ AWSElasticBlockStore bool `json:"awsElasticBlockStore,omitempty"`
+ // GitRepo allows or disallows the use of the GitRepo volume plugin.
+ GitRepo bool `json:"gitRepo,omitempty"`
+ // Secret allows or disallows the use of the Secret volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#secrets
+ Secret bool `json:"secret,omitempty"`
+ // NFS allows or disallows the use of the NFS volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/volumes#nfs
+ NFS bool `json:"nfs,omitempty"`
+ // ISCSI allows or disallows the use of the ISCSI volume plugin.
+ // More info: http://releases.k8s.io/HEAD/examples/volumes/iscsi/README.md
+ ISCSI bool `json:"iscsi,omitempty"`
+ // Glusterfs allows or disallows the use of the Glusterfs volume plugin.
+ // More info: http://releases.k8s.io/HEAD/examples/volumes/glusterfs/README.md
+ Glusterfs bool `json:"glusterfs,omitempty"`
+ // PersistentVolumeClaim allows or disallows the use of the PersistentVolumeClaim volume plugin.
+ // More info: http://kubernetes.io/docs/user-guide/persistent-volumes#persistentvolumeclaims
+ PersistentVolumeClaim bool `json:"persistentVolumeClaim,omitempty"`
+ // RBD allows or disallows the use of the RBD volume plugin.
+ // More info: http://releases.k8s.io/HEAD/examples/volumes/rbd/README.md
+ RBD bool `json:"rbd,omitempty"`
+ // Cinder allows or disallows the use of the Cinder volume plugin.
+ // More info: http://releases.k8s.io/HEAD/examples/mysql-cinder-pd/README.md
+ Cinder bool `json:"cinder,omitempty"`
+ // CephFS allows or disallows the use of the CephFS volume plugin.
+ CephFS bool `json:"cephfs,omitempty"`
+ // DownwardAPI allows or disallows the use of the DownwardAPI volume plugin.
+ DownwardAPI bool `json:"downwardAPI,omitempty"`
+ // FC allows or disallows the use of the FC volume plugin.
+ FC bool `json:"fc,omitempty"`
+}
+
+// SELinuxContextStrategyOptions defines the strategy type and any options used to create the strategy.
+type SELinuxContextStrategyOptions struct {
+ // Type is the strategy that will dictate the allowable labels that may be set.
+ Type SELinuxContextStrategy `json:"type"`
+ // seLinuxOptions required to run as; required for MustRunAs
+ // More info: http://releases.k8s.io/HEAD/docs/design/security_context.md#security-context
+ SELinuxOptions *api.SELinuxOptions `json:"seLinuxOptions,omitempty"`
+}
+
+// SELinuxContextStrategyType denotes strategy types for generating SELinux options for a
+// SecurityContext.
+type SELinuxContextStrategy string
+
+const (
+ // container must have SELinux labels of X applied.
+ SELinuxStrategyMustRunAs SELinuxContextStrategy = "MustRunAs"
+ // container may make requests for any SELinux context labels.
+ SELinuxStrategyRunAsAny SELinuxContextStrategy = "RunAsAny"
+)
+
+// RunAsUserStrategyOptions defines the strategy type and any options used to create the strategy.
+type RunAsUserStrategyOptions struct {
+ // Type is the strategy that will dictate the allowable RunAsUser values that may be set.
+ Type RunAsUserStrategy `json:"type"`
+ // UID is the user id that containers must run as. Required for the MustRunAs strategy if not using
+ // a strategy that supports pre-allocated uids.
+ UID *int64 `json:"uid,omitempty"`
+ // UIDRangeMin defines the min value for a strategy that allocates by a range based strategy.
+ UIDRangeMin *int64 `json:"uidRangeMin,omitempty"`
+ // UIDRangeMax defines the max value for a strategy that allocates by a range based strategy.
+ UIDRangeMax *int64 `json:"uidRangeMax,omitempty"`
+}
+
+// RunAsUserStrategyType denotes strategy types for generating RunAsUser values for a
+// SecurityContext.
+type RunAsUserStrategy string
+
+const (
+ // container must run as a particular uid.
+ RunAsUserStrategyMustRunAs RunAsUserStrategy = "MustRunAs"
+ // container must run as a particular uid.
+ RunAsUserStrategyMustRunAsRange RunAsUserStrategy = "MustRunAsRange"
+ // container must run as a non-root uid
+ RunAsUserStrategyMustRunAsNonRoot RunAsUserStrategy = "MustRunAsNonRoot"
+ // container may make requests for any uid.
+ RunAsUserStrategyRunAsAny RunAsUserStrategy = "RunAsAny"
+)
+```
+
+### PodSecurityPolicy Lifecycle
+
+As reusable objects in the root scope, PodSecurityPolicy follows the lifecycle of the
+cluster itself. Maintenance of constraints such as adding, assigning, or changing them is the
+responsibility of the cluster administrator.
+
+Creating a new user within a namespace should not require the cluster administrator to
+define the user's PodSecurityPolicy. They should receive the default set of policies
+that the administrator has defined for the groups they are assigned.
+
+
+## Default PodSecurityPolicy And Overrides
+
+In order to establish policy for service accounts and users, there must be a way
+to identify the default set of constraints that is to be used. This is best accomplished by using
+groups. As mentioned above, groups may be used by the authentication/authorization layer to ensure
+that every user maps to at least one group (with a default example of `system:authenticated`) and it
+is up to the cluster administrator to ensure that a `PodSecurityPolicy` object exists that
+references the group.
+
+If an administrator would like to provide a user with a changed set of security context permissions,
+they may do the following:
+
+1. Create a new `PodSecurityPolicy` object and add a reference to the user or a group
+that the user belongs to.
+1. Add the user (or group) to an existing `PodSecurityPolicy` object with the proper
+elevated privileges.
+
+## Admission
+
+Admission control using an authorizer provides the ability to control the creation of resources
+based on capabilities granted to a user. In terms of the `PodSecurityPolicy`, it means
+that an admission controller may inspect the user info made available in the context to retrieve
+an appropriate set of policies for validation.
+
+The appropriate set of PodSecurityPolicies is defined as all of the policies
+available that have reference to the user or groups that the user belongs to.
+
+Admission will use the PodSecurityPolicy to ensure that any requests for a
+specific security context setting are valid and to generate settings using the following approach:
+
+1. Determine all the available `PodSecurityPolicy` objects that are allowed to be used
+1. Sort the `PodSecurityPolicy` objects in a most restrictive to least restrictive order.
+1. For each `PodSecurityPolicy`, generate a `SecurityContext` for each container. The generation phase will not override
+any user requested settings in the `SecurityContext`, and will rely on the validation phase to ensure that
+the user requests are valid.
+1. Validate the generated `SecurityContext` to ensure it falls within the boundaries of the `PodSecurityPolicy`
+1. If all containers validate under a single `PodSecurityPolicy` then the pod will be admitted
+1. If all containers DO NOT validate under the `PodSecurityPolicy` then try the next `PodSecurityPolicy`
+1. If no `PodSecurityPolicy` validates for the pod then the pod will not be admitted
+
+
+## Creation of a SecurityContext Based on PodSecurityPolicy
+
+The creation of a `SecurityContext` based on a `PodSecurityPolicy` is based upon the configured
+settings of the `PodSecurityPolicy`.
+
+There are three scenarios under which a `PodSecurityPolicy` field may fall:
+
+1. Governed by a boolean: fields of this type will be defaulted to the most restrictive value.
+For instance, `AllowPrivileged` will always be set to false if unspecified.
+
+1. Governed by an allowable set: fields of this type will be checked against the set to ensure
+their value is allowed. For example, `AllowCapabilities` will ensure that only capabilities
+that are allowed to be requested are considered valid. `HostNetworkSources` will ensure that
+only pods created from source X are allowed to request access to the host network.
+1. Governed by a strategy: Items that have a strategy to generate a value will provide a
+mechanism to generate the value as well as a mechanism to ensure that a specified value falls into
+the set of allowable values. See the Types section for the description of the interfaces that
+strategies must implement.
+
+Strategies have the ability to become dynamic. In order to support a dynamic strategy it should be
+possible to make a strategy that has the ability to either be pre-populated with dynamic data by
+another component (such as an admission controller) or has the ability to retrieve the information
+itself based on the data in the pod. An example of this would be a pre-allocated UID for the namespace.
+A dynamic `RunAsUser` strategy could inspect the namespace of the pod in order to find the required pre-allocated
+UID and generate or validate requests based on that information.
+
+
+```go
+// SELinuxStrategy defines the interface for all SELinux constraint strategies.
+type SELinuxStrategy interface {
+ // Generate creates the SELinuxOptions based on constraint rules.
+ Generate(pod *api.Pod, container *api.Container) (*api.SELinuxOptions, error)
+ // Validate ensures that the specified values fall within the range of the strategy.
+ Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList
+}
+
+// RunAsUserStrategy defines the interface for all uid constraint strategies.
+type RunAsUserStrategy interface {
+ // Generate creates the uid based on policy rules.
+ Generate(pod *api.Pod, container *api.Container) (*int64, error)
+ // Validate ensures that the specified values fall within the range of the strategy.
+ Validate(pod *api.Pod, container *api.Container) fielderrors.ValidationErrorList
+}
+```
+
+## Escalating Privileges by an Administrator
+
+An administrator may wish to create a resource in a namespace that runs with
+escalated privileges. By allowing security context
+constraints to operate on both the requesting user and the pod's service account, administrators are able to
+create pods in namespaces with elevated privileges based on the administrator's security context
+constraints.
+
+This also allows the system to guard commands being executed in the non-conforming container. For
+instance, an `exec` command can first check the security context of the pod against the security
+context constraints of the user or the user's ability to reference a service account.
+If it does not validate then it can block users from executing the command. Since the validation
+will be user aware, administrators would still be able to run the commands that are restricted to normal users.
+
+## Interaction with the Kubelet
+
+In certain cases, the Kubelet may need provide information about
+the image in order to validate the security context. An example of this is a cluster
+that is configured to run with a UID strategy of `MustRunAsNonRoot`.
+
+In this case the admission controller can set the existing `MustRunAsNonRoot` flag on the `SecurityContext`
+based on the UID strategy of the `SecurityPolicy`. It should still validate any requests on the pod
+for a specific UID and fail early if possible. However, if the `RunAsUser` is not set on the pod
+it should still admit the pod and allow the Kubelet to ensure that the image does not run as
+`root` with the existing non-root checks.
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/security-context-constraints.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/security.md b/contributors/design-proposals/security.md
new file mode 100644
index 00000000..b1aeacbd
--- /dev/null
+++ b/contributors/design-proposals/security.md
@@ -0,0 +1,218 @@
+# Security in Kubernetes
+
+Kubernetes should define a reasonable set of security best practices that allows
+processes to be isolated from each other, from the cluster infrastructure, and
+which preserves important boundaries between those who manage the cluster, and
+those who use the cluster.
+
+While Kubernetes today is not primarily a multi-tenant system, the long term
+evolution of Kubernetes will increasingly rely on proper boundaries between
+users and administrators. The code running on the cluster must be appropriately
+isolated and secured to prevent malicious parties from affecting the entire
+cluster.
+
+
+## High Level Goals
+
+1. Ensure a clear isolation between the container and the underlying host it
+runs on
+2. Limit the ability of the container to negatively impact the infrastructure
+or other containers
+3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) -
+ensure components are only authorized to perform the actions they need, and
+limit the scope of a compromise by limiting the capabilities of individual
+components
+4. Reduce the number of systems that have to be hardened and secured by
+defining clear boundaries between components
+5. Allow users of the system to be cleanly separated from administrators
+6. Allow administrative functions to be delegated to users where necessary
+7. Allow applications to be run on the cluster that have "secret" data (keys,
+certs, passwords) which is properly abstracted from "public" data.
+
+## Use cases
+
+### Roles
+
+We define "user" as a unique identity accessing the Kubernetes API server, which
+may be a human or an automated process. Human users fall into the following
+categories:
+
+1. k8s admin - administers a Kubernetes cluster and has access to the underlying
+components of the system
+2. k8s project administrator - administrates the security of a small subset of
+the cluster
+3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster
+resources
+
+Automated process users fall into the following categories:
+
+1. k8s container user - a user that processes running inside a container (on the
+cluster) can use to access other cluster resources independent of the human
+users attached to a project
+2. k8s infrastructure user - the user that Kubernetes infrastructure components
+use to perform cluster functions with clearly defined roles
+
+### Description of roles
+
+* Developers:
+ * write pod specs.
+ * making some of their own images, and using some "community" docker images
+ * know which pods need to talk to which other pods
+ * decide which pods should share files with other pods, and which should not.
+ * reason about application level security, such as containing the effects of a
+local-file-read exploit in a webserver pod.
+ * do not often reason about operating system or organizational security.
+ * are not necessarily comfortable reasoning about the security properties of a
+system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
+
+* Project Admins:
+ * allocate identity and roles within a namespace
+ * reason about organizational security within a namespace
+ * don't give a developer permissions that are not needed for role.
+ * protect files on shared storage from unnecessary cross-team access
+ * are less focused about application security
+
+* Administrators:
+ * are less focused on application security. Focused on operating system
+security.
+ * protect the node from bad actors in containers, and properly-configured
+innocent containers from bad actors in other containers.
+ * comfortable reasoning about the security properties of a system at the level
+of detail of Linux Capabilities, SELinux, AppArmor, etc.
+ * decides who can use which Linux Capabilities, run privileged containers, use
+hostPath, etc.
+ * e.g. a team that manages Ceph or a mysql server might be trusted to have
+raw access to storage devices in some organizations, but teams that develop the
+applications at higher layers would not.
+
+
+## Proposed Design
+
+A pod runs in a *security context* under a *service account* that is defined by
+an administrator or project administrator, and the *secrets* a pod has access to
+is limited by that *service account*.
+
+
+1. The API should authenticate and authorize user actions [authn and authz](access.md)
+2. All infrastructure components (kubelets, kube-proxies, controllers,
+scheduler) should have an infrastructure user that they can authenticate with
+and be authorized to perform only the functions they require against the API.
+3. Most infrastructure components should use the API as a way of exchanging data
+and changing the system, and only the API should have access to the underlying
+data store (etcd)
+4. When containers run on the cluster and need to talk to other containers or
+the API server, they should be identified and authorized clearly as an
+autonomous process via a [service account](service_accounts.md)
+ 1. If the user who started a long-lived process is removed from access to
+the cluster, the process should be able to continue without interruption
+ 2. If the user who started processes are removed from the cluster,
+administrators may wish to terminate their processes in bulk
+ 3. When containers run with a service account, the user that created /
+triggered the service account behavior must be associated with the container's
+action
+5. When container processes run on the cluster, they should run in a
+[security context](security_context.md) that isolates those processes via Linux
+user security, user namespaces, and permissions.
+ 1. Administrators should be able to configure the cluster to automatically
+confine all container processes as a non-root, randomly assigned UID
+ 2. Administrators should be able to ensure that container processes within
+the same namespace are all assigned the same unix user UID
+ 3. Administrators should be able to limit which developers and project
+administrators have access to higher privilege actions
+ 4. Project administrators should be able to run pods within a namespace
+under different security contexts, and developers must be able to specify which
+of the available security contexts they may use
+ 5. Developers should be able to run their own images or images from the
+community and expect those images to run correctly
+ 6. Developers may need to ensure their images work within higher security
+requirements specified by administrators
+ 7. When available, Linux kernel user namespaces can be used to ensure 5.2
+and 5.4 are met.
+ 8. When application developers want to share filesystem data via distributed
+filesystems, the Unix user ids on those filesystems must be consistent across
+different container processes
+6. Developers should be able to define [secrets](secrets.md) that are
+automatically added to the containers when pods are run
+ 1. Secrets are files injected into the container whose values should not be
+displayed within a pod. Examples:
+ 1. An SSH private key for git cloning remote data
+ 2. A client certificate for accessing a remote system
+ 3. A private key and certificate for a web server
+ 4. A .kubeconfig file with embedded cert / token data for accessing the
+Kubernetes master
+ 5. A .dockercfg file for pulling images from a protected registry
+ 2. Developers should be able to define the pod spec so that a secret lands
+in a specific location
+ 3. Project administrators should be able to limit developers within a
+namespace from viewing or modifying secrets (anyone who can launch an arbitrary
+pod can view secrets)
+ 4. Secrets are generally not copied from one namespace to another when a
+developer's application definitions are copied
+
+
+### Related design discussion
+
+* [Authorization and authentication](access.md)
+* [Secret distribution via files](http://pr.k8s.io/2030)
+* [Docker secrets](https://github.com/docker/docker/pull/6697)
+* [Docker vault](https://github.com/docker/docker/issues/10310)
+* [Service Accounts:](service_accounts.md)
+* [Secret volumes](http://pr.k8s.io/4126)
+
+## Specific Design Points
+
+### TODO: authorization, authentication
+
+### Isolate the data store from the nodes and supporting infrastructure
+
+Access to the central data store (etcd) in Kubernetes allows an attacker to run
+arbitrary containers on hosts, to gain access to any protected information
+stored in either volumes or in pods (such as access tokens or shared secrets
+provided as environment variables), to intercept and redirect traffic from
+running services by inserting middlemen, or to simply delete the entire history
+of the cluster.
+
+As a general principle, access to the central data store should be restricted to
+the components that need full control over the system and which can apply
+appropriate authorization and authentication of change requests. In the future,
+etcd may offer granular access control, but that granularity will require an
+administrator to understand the schema of the data to properly apply security.
+An administrator must be able to properly secure Kubernetes at a policy level,
+rather than at an implementation level, and schema changes over time should not
+risk unintended security leaks.
+
+Both the Kubelet and Kube Proxy need information related to their specific roles -
+for the Kubelet, the set of pods it should be running, and for the Proxy, the
+set of services and endpoints to load balance. The Kubelet also needs to provide
+information about running pods and historical termination data. The access
+pattern for both Kubelet and Proxy to load their configuration is an efficient
+"wait for changes" request over HTTP. It should be possible to limit the Kubelet
+and Proxy to only access the information they need to perform their roles and no
+more.
+
+The controller manager for Replication Controllers and other future controllers
+act on behalf of a user via delegation to perform automated maintenance on
+Kubernetes resources. Their ability to access or modify resource state should be
+strictly limited to their intended duties and they should be prevented from
+accessing information not pertinent to their role. For example, a replication
+controller needs only to create a copy of a known pod configuration, to
+determine the running state of an existing pod, or to delete an existing pod
+that it created - it does not need to know the contents or current state of a
+pod, nor have access to any data in the pods attached volumes.
+
+The Kubernetes pod scheduler is responsible for reading data from the pod to fit
+it onto a node in the cluster. At a minimum, it needs access to view the ID of a
+pod (to craft the binding), its current state, any resource information
+necessary to identify placement, and other data relevant to concerns like
+anti-affinity, zone or region preference, or custom logic. It does not need the
+ability to modify pods or see other resources, only to create bindings. It
+should not need the ability to delete bindings unless the scheduler takes
+control of relocating components on failed hosts (which could be implemented by
+a separate component that can delete bindings but not create them). The
+scheduler may need read access to user or project-container information to
+determine preferential location (underspecified at this time).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/security_context.md b/contributors/design-proposals/security_context.md
new file mode 100644
index 00000000..76bc8ee8
--- /dev/null
+++ b/contributors/design-proposals/security_context.md
@@ -0,0 +1,192 @@
+# Security Contexts
+
+## Abstract
+
+A security context is a set of constraints that are applied to a container in
+order to achieve the following goals (from [security design](security.md)):
+
+1. Ensure a clear isolation between container and the underlying host it runs
+on
+2. Limit the ability of the container to negatively impact the infrastructure
+or other containers
+
+## Background
+
+The problem of securing containers in Kubernetes has come up
+[before](http://issue.k8s.io/398) and the potential problems with container
+security are [well known](http://opensource.com/business/14/7/docker-security-selinux).
+Although it is not possible to completely isolate Docker containers from their
+hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304)
+make it possible to greatly reduce the attack surface.
+
+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running
+on the host, containers should only be granted the access they need to perform
+their work. To this end it should be possible to take advantage of Docker
+features such as the ability to
+[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration)
+and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
+to the container process.
+
+Support for user namespaces has recently been
+[merged](https://github.com/docker/libcontainer/pull/304) into Docker's
+libcontainer project and should soon surface in Docker itself. It will make it
+possible to assign a range of unprivileged uids and gids from the host to each
+container, improving the isolation between host and container and between
+containers.
+
+### External integration with shared storage
+
+In order to support external integration with shared storage, processes running
+in a Kubernetes cluster should be able to be uniquely identified by their Unix
+UID, such that a chain of ownership can be established. Processes in pods will
+need to have consistent UID/GID/SELinux category labels in order to access
+shared disks.
+
+## Constraints and Assumptions
+
+* It is out of the scope of this document to prescribe a specific set of
+constraints to isolate containers from their host. Different use cases need
+different settings.
+* The concept of a security context should not be tied to a particular security
+mechanism or platform (i.e. SELinux, AppArmor)
+* Applying a different security context to a scope (namespace or pod) requires
+a solution such as the one proposed for [service accounts](service_accounts.md).
+
+## Use Cases
+
+In order of increasing complexity, following are example use cases that would
+be addressed with security contexts:
+
+1. Kubernetes is used to run a single cloud application. In order to protect
+nodes from containers:
+ * All containers run as a single non-root user
+ * Privileged containers are disabled
+ * All containers run with a particular MCS label
+ * Kernel capabilities like CHOWN and MKNOD are removed from containers
+
+2. Just like case #1, except that I have more than one application running on
+the Kubernetes cluster.
+ * Each application is run in its own namespace to avoid name collisions
+ * For each application a different uid and MCS label is used
+
+3. Kubernetes is used as the base for a PAAS with multiple projects, each
+project represented by a namespace.
+ * Each namespace is associated with a range of uids/gids on the node that
+are mapped to uids/gids on containers using linux user namespaces.
+ * Certain pods in each namespace have special privileges to perform system
+actions such as talking back to the server for deployment, run docker builds,
+etc.
+ * External NFS storage is assigned to each namespace and permissions set
+using the range of uids/gids assigned to that namespace.
+
+## Proposed Design
+
+### Overview
+
+A *security context* consists of a set of constraints that determine how a
+container is secured before getting created and run. A security context resides
+on the container and represents the runtime parameters that will be used to
+create and run the container via container APIs. A *security context provider*
+is passed to the Kubelet so it can have a chance to mutate Docker API calls in
+order to apply the security context.
+
+It is recommended that this design be implemented in two phases:
+
+1. Implement the security context provider extension point in the Kubelet
+so that a default security context can be applied on container run and creation.
+2. Implement a security context structure that is part of a service account. The
+default context provider can then be used to apply a security context based on
+the service account associated with the pod.
+
+### Security Context Provider
+
+The Kubelet will have an interface that points to a `SecurityContextProvider`.
+The `SecurityContextProvider` is invoked before creating and running a given
+container:
+
+```go
+type SecurityContextProvider interface {
+ // ModifyContainerConfig is called before the Docker createContainer call.
+ // The security context provider can make changes to the Config with which
+ // the container is created.
+ // An error is returned if it's not possible to secure the container as
+ // requested with a security context.
+ ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config)
+
+ // ModifyHostConfig is called before the Docker runContainer call.
+ // The security context provider can make changes to the HostConfig, affecting
+ // security options, whether the container is privileged, volume binds, etc.
+ // An error is returned if it's not possible to secure the container as requested
+ // with a security context.
+ ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig)
+}
+```
+
+If the value of the SecurityContextProvider field on the Kubelet is nil, the
+kubelet will create and run the container as it does today.
+
+### Security Context
+
+A security context resides on the container and represents the runtime
+parameters that will be used to create and run the container via container APIs.
+Following is an example of an initial implementation:
+
+```go
+type Container struct {
+ ... other fields omitted ...
+ // Optional: SecurityContext defines the security options the pod should be run with
+ SecurityContext *SecurityContext
+}
+
+// SecurityContext holds security configuration that will be applied to a container. SecurityContext
+// contains duplication of some existing fields from the Container resource. These duplicate fields
+// will be populated based on the Container configuration if they are not set. Defining them on
+// both the Container AND the SecurityContext will result in an error.
+type SecurityContext struct {
+ // Capabilities are the capabilities to add/drop when running the container
+ Capabilities *Capabilities
+
+ // Run the container in privileged mode
+ Privileged *bool
+
+ // SELinuxOptions are the labels to be applied to the container
+ // and volumes
+ SELinuxOptions *SELinuxOptions
+
+ // RunAsUser is the UID to run the entrypoint of the container process.
+ RunAsUser *int64
+}
+
+// SELinuxOptions are the labels to be applied to the container.
+type SELinuxOptions struct {
+ // SELinux user label
+ User string
+
+ // SELinux role label
+ Role string
+
+ // SELinux type label
+ Type string
+
+ // SELinux level label.
+ Level string
+}
+```
+
+### Admission
+
+It is up to an admission plugin to determine if the security context is
+acceptable or not. At the time of writing, the admission control plugin for
+security contexts will only allow a context that has defined capabilities or
+privileged. Contexts that attempt to define a UID or SELinux options will be
+denied by default. In the future the admission plugin will base this decision
+upon configurable policies that reside within the [service account](http://pr.k8s.io/2297).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/selector-generation.md b/contributors/design-proposals/selector-generation.md
new file mode 100644
index 00000000..efb32cf2
--- /dev/null
+++ b/contributors/design-proposals/selector-generation.md
@@ -0,0 +1,180 @@
+Design
+=============
+
+# Goals
+
+Make it really hard to accidentally create a job which has an overlapping
+selector, while still making it possible to chose an arbitrary selector, and
+without adding complex constraint solving to the APIserver.
+
+# Use Cases
+
+1. user can leave all label and selector fields blank and system will fill in
+reasonable ones: non-overlappingness guaranteed.
+2. user can put on the pod template some labels that are useful to the user,
+without reasoning about non-overlappingness. System adds additional label to
+assure not overlapping.
+3. If user wants to reparent pods to new job (very rare case) and knows what
+they are doing, they can completely disable this behavior and specify explicit
+selector.
+4. If a controller that makes jobs, like scheduled job, wants to use different
+labels, such as the time and date of the run, it can do that.
+5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and
+just changes the API group, the user should not automatically be allowed to
+specify a selector, since this is very rarely what people want to do and is
+error prone.
+6. If User downloads an existing job definition, e.g. with
+`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not
+create an overlapping job.
+7. If User downloads an existing job definition, e.g. with
+`kubectl get jobs/old -o yaml` and tries to modify and post it, and he
+accidentally copies the uniquifying label from the old one, then he should not
+get an error from a label-key conflict, nor get erratic behavior.
+8. If user reads swagger docs and sees the selector field, he should not be able
+to set it without realizing the risks.
+8. (Deferred requirement:) If user wants to specify a preferred name for the
+non-overlappingness key, they can pick a name.
+
+# Proposed changes
+
+## API
+
+`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as
+follows.
+
+Field `job.spec.manualSelector` is added. It controls whether selectors are
+automatically generated. In automatic mode, user cannot make the mistake of
+creating non-unique selectors. In manual mode, certain rare use cases are
+supported.
+
+Validation is not changed. A selector must be provided, and it must select the
+pod template.
+
+Defaulting changes. Defaulting happens in one of two modes:
+
+### Automatic Mode
+
+- User does not specify `job.spec.selector`.
+- User is probably unaware of the `job.spec.manualSelector` field and does not
+think about it.
+- User optionally puts labels on pod template (optional). User does not think
+about uniqueness, just labeling for user's own reasons.
+- Defaulting logic sets `job.spec.selector` to
+`matchLabels["controller-uid"]="$UIDOFJOB"`
+- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
+ - The first label is controller-uid=$UIDOFJOB.
+ - The second label is "job-name=$NAMEOFJOB".
+
+### Manual Mode
+
+- User means User or Controller for the rest of this list.
+- User does specify `job.spec.selector`.
+- User does specify `job.spec.manualSelector=true`
+- User puts a unique label or label(s) on pod template (required). User does
+think carefully about uniqueness.
+- No defaulting of pod labels or the selector happen.
+
+### Rationale
+
+UID is better than Name in that:
+- it allows cross-namespace control someday if we need it.
+- it is unique across all kinds. `controller-name=foo` does not ensure
+uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a
+problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the
+latter cannot use label `job-name=foo`, though there is a temptation to do so.
+- it uniquely identifies the controller across time. This prevents the case
+where, for example, someone deletes a job via the REST api or client
+(where cascade=false), leaving pods around. We don't want those to be picked up
+unintentionally. It also prevents the case where a user looks at an old job that
+finished but is not deleted, and tries to select its pods, and gets the wrong
+impression that it is still running.
+
+Job name is more user friendly. It is self documenting
+
+Commands like `kubectl get pods -l job-name=myjob` should do exactly what is
+wanted 99.9% of the time. Automated control loops should still use the
+controller-uid=label.
+
+Using both gets the benefits of both, at the cost of some label verbosity.
+
+The field is a `*bool`. Since false is expected to be much more common,
+and since the feature is complex, it is better to leave it unspecified so that
+users looking at a stored pod spec do not need to be aware of this field.
+
+### Overriding Unique Labels
+
+If user does specify `job.spec.selector` then the user must also specify
+`job.spec.manualSelector`. This ensures the user knows that what he is doing is
+not the normal thing to do.
+
+To prevent users from copying the `job.spec.manualSelector` flag from existing
+jobs, it will be optional and default to false, which means when you ask GET and
+existing job back that didn't use this feature, you don't even see the
+`job.spec.manualSelector` flag, so you are not tempted to wonder if you should
+fiddle with it.
+
+## Job Controller
+
+No changes
+
+## Kubectl
+
+No required changes. Suggest moving SELECTOR to wide output of `kubectl get
+jobs` since users do not write the selector.
+
+## Docs
+
+Remove examples that use selector and remove labels from pod templates.
+Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job.
+
+# Conversion
+
+The following applies to Job, as well as to other types that adopt this pattern:
+
+- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`.
+- Both the internal type and the `batch/v1` type will get
+`job.spec.manualSelector`.
+- The fields `manualSelector` and `autoSelector` have opposite meanings.
+- Each field defaults to false when unset, and so v1beta1 has a different
+default than v1 and internal. This is intentional: we want new uses to default
+to the less error-prone behavior, and we do not want to change the behavior of
+v1beta1.
+
+*Note*: since the internal default is changing, client library consumers that
+create Jobs may need to add "job.spec.manualSelector=true" to keep working, or
+switch to auto selectors.
+
+Conversion is as follows:
+- `extensions/__internal` to `extensions/v1beta1`: the value of
+`__internal.Spec.ManualSelector` is defaulted to false if nil, negated,
+defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`.
+- `extensions/v1beta1` to `extensions/__internal`: the value of
+`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to
+nil if false, and written to `__internal.Spec.ManualSelector`.
+
+This conversion gives the following properties.
+
+1. Users that previously used v1beta1 do not start seeing a new field when they
+get back objects.
+2. Distinction between originally unset versus explicitly set to false is not
+preserved (would have been nice to do so, but requires more complicated
+solution).
+3. Users who only created v1beta1 examples or v1 examples, will not ever see the
+existence of either field.
+4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd)
+does not need to change, allowing scriptable rollforward/rollback.
+
+# Future Work
+
+Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if
+it works well for job.
+
+Docs will be edited to show examples without a `job.spec.selector`.
+
+We probably want as much as possible the same behavior for Job and
+ReplicationController.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/self-hosted-kubelet.md b/contributors/design-proposals/self-hosted-kubelet.md
new file mode 100644
index 00000000..d2318bea
--- /dev/null
+++ b/contributors/design-proposals/self-hosted-kubelet.md
@@ -0,0 +1,135 @@
+# Proposal: Self-hosted kubelet
+
+## Abstract
+
+In a self-hosted Kubernetes deployment (see [this
+comment](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959)
+for background on self hosted kubernetes), we have the initial bootstrap problem.
+When running self-hosted components, there needs to be a mechanism for pivoting
+from the initial bootstrap state to the kubernetes-managed (self-hosted) state.
+In the case of a self-hosted kubelet, this means pivoting from the initial
+kubelet defined and run on the host, to the kubelet pod which has been scheduled
+to the node.
+
+This proposal presents a solution to the kubelet bootstrap, and assumes a
+functioning control plane (e.g. an apiserver, controller-manager, scheduler, and
+etcd cluster), and a kubelet that can securely contact the API server. This
+functioning control plane can be temporary, and not necessarily the "production"
+control plane that will be used after the initial pivot / bootstrap.
+
+## Background and Motivation
+
+In order to understand the goals of this proposal, one must understand what
+"self-hosted" means. This proposal defines "self-hosted" as a kubernetes cluster
+that is installed and managed by the kubernetes installation itself. This means
+that each kubernetes component is described by a kubernetes manifest (Daemonset,
+Deployment, etc) and can be updated via kubernetes.
+
+The overall goal of this proposal is to make kubernetes easier to install and
+upgrade. We can then treat kubernetes itself just like any other application
+hosted in a kubernetes cluster, and have access to easy upgrades, monitoring,
+and durability for core kubernetes components themselves.
+
+We intend to achieve this by using kubernetes to manage itself. However, in
+order to do that we must first "bootstrap" the cluster, by using kubernetes to
+install kubernetes components. This is where this proposal fits in, by
+describing the necessary modifications, and required procedures, needed to run a
+self-hosted kubelet.
+
+The approach being proposed for a self-hosted kubelet is a "pivot" style
+installation. This procedure assumes a short-lived “bootstrap” kubelet will run
+and start a long-running “self-hosted” kubelet. Once the self-hosted kubelet is
+running the bootstrap kubelet will exit. As part of this, we propose introducing
+a new `--bootstrap` flag to the kubelet. The behaviour of that flag will be
+explained in detail below.
+
+## Proposal
+
+We propose adding a new flag to the kubelet, the `--bootstrap` flag, which is
+assumed to be used in conjunction with the `--lock-file` flag. The `--lock-file`
+flag is used to ensure only a single kubelet is running at any given time during
+this pivot process. When the `--bootstrap` flag is provided, after the kubelet
+acquires the file lock, it will begin asynchronously waiting on
+[inotify](http://man7.org/linux/man-pages/man7/inotify.7.html) events. Once an
+"open" event is received, the kubelet will assume another kubelet is attempting
+to take control and will exit by calling `exit(0)`.
+
+Thus, the initial bootstrap becomes:
+
+1. "bootstrap" kubelet is started by $init system.
+1. "bootstrap" kubelet pulls down "self-hosted" kubelet as a pod from a
+ daemonset
+1. "self-hosted" kubelet attempts to acquire the file lock, causing "bootstrap"
+ kubelet to exit
+1. "self-hosted" kubelet acquires lock and takes over
+1. "bootstrap" kubelet is restarted by $init system and blocks on acquiring the
+ file lock
+
+During an upgrade of the kubelet, for simplicity we will consider 3 kubelets,
+namely "bootstrap", "v1", and "v2". We imagine the following scenario for
+upgrades:
+
+1. Cluster administrator introduces "v2" kubelet daemonset
+1. "v1" kubelet pulls down and starts "v2"
+1. Cluster administrator removes "v1" kubelet daemonset
+1. "v1" kubelet is killed
+1. Both "bootstrap" and "v2" kubelets race for file lock
+1. If "v2" kubelet acquires lock, process has completed
+1. If "bootstrap" kubelet acquires lock, it is assumed that "v2" kubelet will
+ fail a health check and be killed. Once restarted, it will try to acquire the
+ lock, triggering the "bootstrap" kubelet to exit.
+
+Alternatively, it would also be possible via this mechanism to delete the "v1"
+daemonset first, allow the "bootstrap" kubelet to take over, and then introduce
+the "v2" kubelet daemonset, effectively eliminating the race between "bootstrap"
+and "v2" for lock acquisition, and the reliance on the failing health check
+procedure.
+
+Eventually this could be handled by a DaemonSet upgrade policy.
+
+This will allow a "self-hosted" kubelet with minimal new concepts introduced
+into the core Kubernetes code base, and remains flexible enough to work well
+with future [bootstrapping
+services](https://github.com/kubernetes/kubernetes/issues/5754).
+
+## Production readiness considerations / Out of scope issues
+
+* Deterministically pulling and running kubelet pod: we would prefer not to have
+ to loop until we finally get a kubelet pod.
+* It is possible that the bootstrap kubelet version is incompatible with the
+ newer versions that were run in the node. For example, the cgroup
+ configurations might be incompatible. In the beginning, we will require
+ cluster admins to keep the configuration in sync. Since we want the bootstrap
+ kubelet to come up and run even if the API server is not available, we should
+ persist the configuration for bootstrap kubelet on the node. Once we have
+ checkpointing in kubelet, we will checkpoint the updated config and have the
+ bootstrap kubelet use the updated config, if it were to take over.
+* Currently best practice when upgrading the kubelet on a node is to drain all
+ pods first. Automatically draining of the node during kubelet upgrade is out
+ of scope for this proposal. It is assumed that either the cluster
+ administrator or the daemonset upgrade policy will handle this.
+
+## Other discussion
+
+Various similar approaches have been discussed
+[here](https://github.com/kubernetes/kubernetes/issues/246#issuecomment-64533959)
+and
+[here](https://github.com/kubernetes/kubernetes/issues/23073#issuecomment-198478997).
+Other discussion around the kubelet being able to be run inside a container is
+[here](https://github.com/kubernetes/kubernetes/issues/4869). Note this isn't a
+strict requirement as the kubelet could be run in a chroot jail via rkt fly or
+other such similar approach.
+
+Additionally, [Taints and
+Tolerations](../../docs/design/taint-toleration-dedicated.md), whose design has
+already been accepted, would make the overall kubelet bootstrap more
+deterministic. With this, we would also need the ability for a kubelet to
+register itself with a given taint when it first contacts the API server. Given
+that, a kubelet could register itself with a given taint such as
+“component=kubelet”, and a kubelet pod could exist that has a toleration to that
+taint, ensuring it is the only pod the “bootstrap” kubelet runs.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/self-hosted-kubelet.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/selinux-enhancements.md b/contributors/design-proposals/selinux-enhancements.md
new file mode 100644
index 00000000..3b3e168a
--- /dev/null
+++ b/contributors/design-proposals/selinux-enhancements.md
@@ -0,0 +1,209 @@
+## Abstract
+
+Presents a proposal for enhancing the security of Kubernetes clusters using
+SELinux and simplifying the implementation of SELinux support within the
+Kubelet by removing the need to label the Kubelet directory with an SELinux
+context usable from a container.
+
+## Motivation
+
+The current Kubernetes codebase relies upon the Kubelet directory being
+labeled with an SELinux context usable from a container. This means that a
+container escaping namespace isolation will be able to use any file within the
+Kubelet directory without defeating kernel
+[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control).
+In order to limit the attack surface, we should enhance the Kubelet to relabel
+any bind-mounts into containers into a usable SELinux context without depending
+on the Kubelet directory's SELinux context.
+
+## Constraints and Assumptions
+
+1. No API changes allowed
+2. Behavior must be fully backward compatible
+3. No new admission controllers - make incremental improvements without huge
+ refactorings
+
+## Use Cases
+
+1. As a cluster operator, I want to avoid having to label the Kubelet
+ directory with a label usable from a container, so that I can limit the
+ attack surface available to a container escaping its namespace isolation
+2. As a user, I want to run a pod without an SELinux context explicitly
+ specified and be isolated using MCS (multi-category security) on systems
+ where SELinux is enabled, so that the pods on each host are isolated from
+ one another
+3. As a user, I want to run a pod that uses the host IPC or PID namespace and
+ want the system to do the right thing with regard to SELinux, so that no
+ unnecessary relabel actions are performed
+
+### Labeling the Kubelet directory
+
+As previously stated, the current codebase relies on the Kubelet directory
+being labeled with an SELinux context usable from a container. The Kubelet
+uses the SELinux context of this directory to determine what SELinux context
+`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive.
+The problem with this is that it opens an attack surface to a container that
+escapes its namespace isolation; such a container would be able to use any
+file in the Kubelet directory without defeating kernel MAC.
+
+### SELinux when no context is specified
+
+When no SELinux context is specified, Kubernetes should just do the right
+thing, where doing the right thing is defined as isolating pods with a node-
+unique set of categories. Node-uniqueness means unique among the pods
+scheduled onto the node. Long-term, we want to have a cluster-wide allocator
+for MCS labels. Node-unique MCS labels are a good middle ground that is
+possible without a new, large, feature.
+
+### SELinux and host IPC and PID namespaces
+
+Containers in pods that use the host IPC or PID namespaces need access to
+other processes and IPC mechanisms on the host. Therefore, these containers
+should be run with the `spc_t` SELinux type by the container runtime. The
+`spc_t` type is an unconfined type that other SELinux domains are allowed to
+connect to. In the case where a pod uses one of these host namespaces, it
+should be unnecessary to relabel the pod's volumes.
+
+## Analysis
+
+### Libcontainer SELinux library
+
+Docker and rkt both use the libcontainer SELinux library. This library
+provides a method, `GetLxcContexts`, that returns the a unique SELinux
+contexts for container processes and files used by them. `GetLxcContexts`
+reads the base SELinux context information from a file at `/etc/selinux/<policy-
+name>/contexts/lxc_contexts` and then adds a process-unique MCS label.
+
+Docker and rkt both leverage this call to determine the 'starting' SELinux
+contexts for containers.
+
+### Docker
+
+Docker's behavior when no SELinux context is defined for a container is to
+give the container a node-unique MCS label.
+
+#### Sharing IPC namespaces
+
+On the Docker runtime, the containers in a Kubernetes pod share the IPC and
+PID namespaces of the pod's infra container.
+
+Docker's behavior for containers sharing these namespaces is as follows: if a
+container B shares the IPC namespace of another container A, container B is
+given the SELinux context of container A. Therefore, for Kubernetes pods
+running on docker, in a vacuum the containers in a pod should have the same
+SELinux context.
+
+[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When
+the seccomp profile is set on a docker container that shares the IPC namespace
+of another container, that container will not receive the other container's
+SELinux context.
+
+#### Host IPC and PID namespaces
+
+In the case of a pod that shares the host IPC or PID namespace, this flag is
+simply ignored and the container receives the `spc_t` SELinux type. The
+`spc_t` type is unconfined, and so no relabeling needs to be done for volumes
+for these pods. Currently, however, there is code which relabels volumes into
+explicitly specified SELinux contexts for these pods. This code is unnecessary
+and should be removed.
+
+#### Relabeling bind-mounts
+
+Docker is capable of relabeling bind-mounts into containers using the `:Z`
+bind-mount flag. However, in the current implementation of the docker runtime
+in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext
+contains an SELinux context. We could easily implement the correct behaviors
+by always setting `:Z` on systems where SELinux is enabled.
+
+### rkt
+
+rkt's behavior when no SELinux context is defined for a pod is similar to
+Docker's -- an SELinux context with a node-unique MCS label is given to the
+containers of a pod.
+
+#### Sharing IPC namespaces
+
+Containers (apps, in rkt terminology) in rkt pods share an IPC and PID
+namespace by default.
+
+#### Relabeling bind-mounts
+
+Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux
+context.
+
+#### Host IPC and PID namespaces
+
+Using the host IPC and PID namespaces is not currently supported by rkt.
+
+## Proposed Changes
+
+### Refactor `pkg/util/selinux`
+
+1. The `selinux` package should provide a method `SELinuxEnabled` that returns
+ whether SELinux is enabled, and is built for all platforms (the
+ libcontainer SELinux is only built on linux)
+2. The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner`
+ and be changed to have the same method names and signatures as the
+ libcontainer methods its implementations wrap
+3. The `SELinuxRunner` interface only needs `Getfilecon`, which is used by
+ the rkt code
+
+```go
+package selinux
+
+// Note: the libcontainer SELinux package is only built for Linux, so it is
+// necessary to have a NOP wrapper which is built for non-Linux platforms to
+// allow code that links to this package not to differentiate its own methods
+// for Linux and non-Linux platforms.
+//
+// SELinuxRunner wraps certain libcontainer SELinux calls. For more
+// information, see:
+//
+// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
+type SELinuxRunner interface {
+ // Getfilecon returns the SELinux context for the given path or returns an
+ // error.
+ Getfilecon(path string) (string, error)
+}
+```
+
+### Kubelet Changes
+
+1. The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can
+ be removed
+2. The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no
+ longer call `relabelVolumes`
+3. The `makeHostsMount` method in `kubelet_pods.go` should set the
+ `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true`
+
+### Changes to `pkg/kubelet/dockertools/`
+
+1. The `makeMountBindings` should be changed to:
+ 1. No longer accept the `podHasSELinuxLabel` parameter
+ 2. Always use the `:Z` bind-mount flag when SELinux is enabled and the mount
+ has the `SELinuxRelabel` attribute set to `true`
+2. The `runContainer` method should be changed to always use the `:Z`
+ bind-mount flag on the termination message mount when SELinux is enabled
+
+### Changes to `pkg/kubelet/rkt`
+
+The should not be any required changes for the rkt runtime; we should test to
+ensure things work as expected under rkt.
+
+### Changes to volume plugins and infrastructure
+
+1. The `VolumeHost` interface contains a method called `GetRootContext`; this
+ is an artifact of the old assumptions about the Kubelet directory's SELinux
+ context and can be removed
+2. The `empty_dir.go` file should be changed to be completely agnostic of
+ SELinux; no behavior in this plugin needs to be differentiated when SELinux
+ is enabled
+
+### Changes to `pkg/controller/...`
+
+The `VolumeHost` abstraction is used in a couple of PV controllers as NOP
+implementations. These should be altered to no longer include `GetRootContext`.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/selinux.md b/contributors/design-proposals/selinux.md
new file mode 100644
index 00000000..ece83d44
--- /dev/null
+++ b/contributors/design-proposals/selinux.md
@@ -0,0 +1,317 @@
+## Abstract
+
+A proposal for enabling containers in a pod to share volumes using a pod level SELinux context.
+
+## Motivation
+
+Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin
+authors should not have to explicitly account for SELinux except for volume types that require
+special handling of the SELinux context during setup.
+
+Currently, each container in a pod has an SELinux context. This is not an ideal factoring for
+sharing resources using SELinux.
+
+We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a
+generic way.
+
+Goals of this design:
+
+1. Describe the problems with a container SELinux context
+2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context
+ which is backward compatible with the v1.0.0 API
+
+## Constraints and Assumptions
+
+1. We will not support securing containers within a pod from one another
+2. Volume plugins should not have to handle setting SELinux context on volumes
+3. We will not deal with shared storage
+
+## Current State Overview
+
+### Docker
+
+Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux
+context of a container can be overridden with the `SecurityOpt` api that allows setting the different
+parts of the SELinux context individually.
+
+Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different
+use-cases:
+
+1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
+ SELinux context
+2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
+ SElinux context, but remove the MCS labels, making the volume shareable between containers
+
+We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container
+(from an SELinux standpoint) can use the volume.
+
+### rkt
+
+rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts`
+and allocates a unique MCS label per pod.
+
+### Kubernetes
+
+
+There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the
+EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a
+patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem
+in general of handling SELinux in kubernetes to merging this PR.
+
+A new `PodSecurityContext` type has been added that carries information about security attributes
+that apply to the entire pod and that apply to all containers in a pod. See:
+
+1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939)
+1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823)
+
+## Use Cases
+
+1. As a cluster operator, I want to support securing pods from one another using SELinux when
+ SELinux integration is enabled in the cluster
+2. As a user, I want volumes sharing to work correctly amongst containers in pods
+
+#### SELinux context: pod- or container- level?
+
+Currently, SELinux context is specifiable only at the container level. This is an inconvenient
+factoring for sharing volumes and other SELinux-secured resources between containers because there
+is no way in SELinux to share resources between processes with different MCS labels except to
+remove MCS labels from the shared resource. This is a big security risk: _any container_ in the
+system can work with a resource which has the same SELinux context as it and no MCS labels. Since
+we are also not interested in isolating containers in a pod from one another, the SELinux context
+should be shared by all containers in a pod to facilitate isolation from the containers in other
+pods and sharing resources amongst all the containers of a pod.
+
+#### Volumes
+
+Kubernetes volumes can be divided into two broad categories:
+
+1. Unshared storage:
+ 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret,
+ downward api. All volumes in this category delegate to `EmptyDir` for their underlying
+ storage.
+ 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively
+ by a single pod*.
+2. Shared storage:
+ 1. `hostPath` is shared storage because it is necessarily used by a container and the host
+ 2. Network file systems such as NFS, Glusterfs, Cephfs, etc.
+ 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because
+ they may be used simultaneously by multiple pods.
+
+For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon`
+operation on the volume directory after running the volume plugin's `Setup` function. For these
+volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume
+plugin code. Some volume plugins may need to use the SELinux context during a mount operation in
+certain cases. To account for this, our design must have a way for volume plugins to state that
+a particular volume should or should not receive generic label management.
+
+For shared storage, the picture is murkier. Labels for existing shared storage will be managed
+outside Kubernetes and administrators will have to set the SELinux context of pods correctly.
+The problem of solving SELinux label management for new shared storage is outside the scope for
+this proposal.
+
+## Analysis
+
+The system needs to be able to:
+
+1. Model correctly which volumes require SELinux label management
+1. Relabel volumes with the correct SELinux context when required
+
+### Modeling whether a volume requires label management
+
+#### Unshared storage: volumes derived from `EmptyDir`
+
+Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure
+that the ownership and SELinux context (when relevant) are set correctly for the volume to be
+usable.
+
+#### Unshared storage: network block devices
+
+Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way
+as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir`
+volumes, permissions and ownership can be managed on the client side by the Kubelet when used
+exclusively by one pod. When the volumes are used outside of a persistent volume, or with the
+`ReadWriteOnce` mode, they are effectively unshared storage.
+
+When used by multiple pods, there are many additional use-cases to analyze before we can be
+confident that we can support SELinux label management robustly with these file systems. The right
+design is one that makes it easy to experiment and develop support for ownership management with
+volume plugins to enable developers and cluster operators to continue exploring these issues.
+
+#### Shared storage: hostPath
+
+The `hostPath` volume should only be used by effective-root users, and the permissions of paths
+exposed into containers via hostPath volumes should always be managed by the cluster operator. If
+the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath`
+volume could affect changes in the state of arbitrary paths within the host's filesystem. This
+would be a severe security risk, so we will consider hostPath a corner case that the kubelet should
+never perform ownership management for.
+
+#### Shared storage: network
+
+Ownership management of shared storage is a complex topic. SELinux labels for existing shared
+storage will be managed externally from Kubernetes. For this case, our API should make it simple to
+express whether a particular volume should have these concerns managed by Kubernetes.
+
+We will not attempt to address the concerns of new shared storage in this proposal.
+
+When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany`
+modes, it is shared storage, and thus outside the scope of this proposal.
+
+#### API requirements
+
+From the above, we know that label management must be applied:
+
+1. To some volume types always
+2. To some volume types never
+3. To some volume types *sometimes*
+
+Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it
+is desirable for other container runtime implementations to provide similar functionality.
+
+Relabeling should be an optional aspect of a volume plugin to accommodate:
+
+1. volume types for which generalized relabeling support is not sufficient
+2. testing for each volume plugin individually
+
+## Proposed Design
+
+Our design should minimize code for handling SELinux labelling required in the Kubelet and volume
+plugins.
+
+### Deferral: MCS label allocation
+
+Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the
+primitives for higher level composition; making these automatic is a longer-term goal. Allocating
+groups and MCS labels are fairly complex problems in their own right, and so our proposal will not
+encompass either of these topics. There are several problems that the solution for allocation
+depends on:
+
+1. Users and groups in Kubernetes
+2. General auth policy in Kubernetes
+3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893)
+
+### API changes
+
+The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823)
+adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is
+the addition of the semantics to this field:
+
+* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership
+management in the Kubelet have their SELinuxContext set from this field.
+
+```go
+package api
+
+type PodSecurityContext struct {
+ // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's
+ // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
+ //
+ // This field will be used to set the SELinux of volumes that support SELinux label management
+ // by the kubelet.
+ SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+}
+```
+
+The V1 API is extended with the same semantics:
+
+```go
+package v1
+
+type PodSecurityContext struct {
+ // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's
+ // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
+ //
+ // This field will be used to set the SELinux of volumes that support SELinux label management
+ // by the kubelet.
+ SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
+}
+```
+
+#### API backward compatibility
+
+Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive
+SELinux label management for their volumes. This is acceptable since old clients won't know about
+this field and won't have any expectation of their volumes being managed this way.
+
+The existing backward compatibility semantics for SELinux do not change at all with this proposal.
+
+### Kubelet changes
+
+The Kubelet should be modified to perform SELinux label management when required for a volume. The
+criteria to activate the kubelet SELinux label management for volumes are:
+
+1. SELinux integration is enabled in the cluster
+2. SELinux is enabled on the node
+3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set
+4. The volume plugin supports SELinux label management
+
+The `volume.Mounter` interface should have a new method added that indicates whether the plugin
+supports SELinux label management:
+
+```go
+package volume
+
+type Builder interface {
+ // other methods omitted
+ SupportsSELinux() bool
+}
+```
+
+Individual volume plugins are responsible for correctly reporting whether they support label
+management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its
+derivations will be tested with ownership management support:
+
+| Plugin Name | SupportsOwnershipManagement |
+|-------------------------|-------------------------------|
+| `hostPath` | false |
+| `emptyDir` | true |
+| `gitRepo` | true |
+| `secret` | true |
+| `downwardAPI` | true |
+| `gcePersistentDisk` | false |
+| `awsElasticBlockStore` | false |
+| `nfs` | false |
+| `iscsi` | false |
+| `glusterfs` | false |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd` | false |
+| `cinder` | false |
+| `cephfs` | false |
+
+Ultimately, the matrix will theoretically look like:
+
+| Plugin Name | SupportsOwnershipManagement |
+|-------------------------|-------------------------------|
+| `hostPath` | false |
+| `emptyDir` | true |
+| `gitRepo` | true |
+| `secret` | true |
+| `downwardAPI` | true |
+| `gcePersistentDisk` | true |
+| `awsElasticBlockStore` | true |
+| `nfs` | false |
+| `iscsi` | true |
+| `glusterfs` | false |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd` | true |
+| `cinder` | false |
+| `cephfs` | false |
+
+In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a
+function of the container runtime implementations. Initially, we will modify the docker runtime
+implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish
+generic label management for docker containers.
+
+Volume types that require SELinux context information at mount must be injected with and respect the
+enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism
+will be used to carry information about label management enablement to the volume plugins that have
+to manage labels individually.
+
+This allows the volume plugins to determine when they do and don't want this type of support from
+the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selinux.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/service-discovery.md b/contributors/design-proposals/service-discovery.md
new file mode 100644
index 00000000..28d1f8d4
--- /dev/null
+++ b/contributors/design-proposals/service-discovery.md
@@ -0,0 +1,69 @@
+# Service Discovery Proposal
+
+## Goal of this document
+
+To consume a service, a developer needs to know the full URL and a description of the API. Kubernetes contains the host and port information of a service, but it lacks the scheme and the path information needed if the service is not bound at the root. In this document we propose some standard kubernetes service annotations to fix these gaps. It is important that these annotations are a standard to allow for standard service discovery across Kubernetes implementations. Note that the example largely speaks to consuming WebServices but that the same concepts apply to other types of services.
+
+## Endpoint URL, Service Type
+
+A URL can accurately describe the location of a Service. A generic URL is of the following form
+
+ scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
+
+however for the purpose of service discovery we can simplify this to the following form
+
+ scheme:[//host[:port]][/]path
+
+If a user and/or password is required then this information can be passed using Kubernetes Secrets. Kubernetes contains the host and port of each service but it lacks the scheme and path.
+
+`Service Path` - Every Service has one (or more) endpoint. As a rule the endpoint should be located at the root "/" of the location URL, i.e. `http://172.100.1.52/`. There are cases where this is not possible and the actual service endpoint could be located at `http://172.100.1.52/cxfcdi`. The Kubernetes metadata for a service does not capture the path part, making it hard to consume this service.
+
+`Service Scheme` - Services can be deployed using different schemes. Some popular schemes include `http`,`https`,`file`,`ftp` and `jdbc`.
+
+`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isn’t a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected.
+
+## Service Description
+
+The API of a service is the point of interaction with a service consumer. The description of the API is an essential piece of information at creation time of the service consumer. It has become common to publish a service definition document on a know location on the service itself. This 'well known' place it not very standard, so it is proposed the service developer provides the service description path and the type of Definition Language (DL) used.
+
+`Service Description Path` - To facilitate the consumption of the service by client, the location this document would be greatly helpful to the service consumer. In some cases the client side code can be generated from such a document. It is assumed that the service description document is published somewhere on the service endpoint itself.
+
+`Service Description Language` - A number of Definition Languages (DL) have been developed to describe the service. Some of examples are `WSDL`, `WADL` and `Swagger`. In order to consume a description document it is good to know the type of DL used.
+
+## Standard Service Annotations
+
+Kubernetes allows the creation of Service Annotations. Here we propose the use of the following standard annotations
+
+* `api.service.kubernetes.io/path` - the path part of the service endpoint url. An example value could be `cxfcdi`,
+* `api.service.kubernetes.io/scheme` - the scheme part of the service endpoint url. Some values could be `http` or `https`.
+* `api.service.kubernetes.io/protocol` - the protocol of the service. Known values are `SOAP`, `XML-RPC` and `REST`,
+* `api.service.kubernetes.io/description-path` - the path part of the service description document’s endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`,
+* `api.kubernetes.io/description-language` - the type of Description Language used. Known values are `WSDL`, `WADL`, `SwaggerJSON`, `SwaggerYAML`.
+
+The fragment below is taken from the service section of the kubernetes.json were these annotations are used
+
+ ...
+ "objects" : [ {
+ "apiVersion" : "v1",
+ "kind" : "Service",
+ "metadata" : {
+ "annotations" : {
+ "api.service.kubernetes.io/protocol" : "REST",
+ "api.service.kubernetes.io/scheme" "http",
+ "api.service.kubernetes.io/path" : "cxfcdi",
+ "api.service.kubernetes.io/description-path" : "cxfcdi/swagger.json",
+ "api.service.kubernetes.io/description-language" : "SwaggerJSON"
+ },
+ ...
+
+## Conclusion
+
+Five service annotations are proposed as a standard way to describe a service endpoint. These five annotation are promoted as a Kubernetes standard, so that services can be discovered and a service catalog can be build to facilitate service consumers.
+
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-discovery.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/service-external-name.md b/contributors/design-proposals/service-external-name.md
new file mode 100644
index 00000000..798da87f
--- /dev/null
+++ b/contributors/design-proposals/service-external-name.md
@@ -0,0 +1,161 @@
+# Service externalName
+
+Author: Tim Hockin (@thockin), Rodrigo Campos (@rata), Rudi C (@therc)
+
+Date: August 2016
+
+Status: Implementation in progress
+
+# Goal
+
+Allow a service to have a CNAME record in the cluster internal DNS service. For
+example, the lookup for a `db` service could return a CNAME that points to the
+RDS resource `something.rds.aws.amazon.com`. No proxying is involved.
+
+# Motivation
+
+There were many related issues, but we'll try to summarize them here. More info
+is on GitHub issues/PRs: #13748, #11838, #13358, #23921
+
+One motivation is to present as native cluster services, services that are
+hosted externally. Some cloud providers, like AWS, hand out hostnames (IPs are
+not static) and the user wants to refer to these services using regular
+Kubernetes tools. This was requested in bugs, at least for AWS, for RedShift,
+RDS, Elasticsearch Service, ELB, etc.
+
+Other users just want to use an external service, for example `oracle`, with dns
+name `oracle-1.testdev.mycompany.com`, without having to keep DNS in sync, and
+are fine with a CNAME.
+
+Another use case is to "integrate" some services for local development. For
+example, consider a search service running in Kubernetes in staging, let's say
+`search-1.stating.mycompany.com`. It's running on AWS, so it resides behind an
+ELB (which has no static IP, just a hostname). A developer is building an app
+that consumes `search-1`, but doesn't want to run it on their machine (before
+Kubernetes, they didn't, either). They can just create a service that has a
+CNAME to the `search-1` endpoint in staging and be happy as before.
+
+Also, Openshift needs this for "service refs". Service ref is really just the
+three use cases mentioned above, but in the future a way to automatically inject
+"service ref"s into namespaces via "service catalog"[1] might be considered. And
+service ref is the natural way to integrate an external service, since it takes
+advantage of native DNS capabilities already in wide use.
+
+[1]: https://github.com/kubernetes/kubernetes/pull/17543
+
+# Alternatives considered
+
+In the issues linked above, some alternatives were also considered. A partial
+summary of them follows.
+
+One option is to add the hostname to endpoints, as proposed in
+https://github.com/kubernetes/kubernetes/pull/11838. This is problematic, as
+endpoints are used in many places and users assume the required fields (such as
+IP address) are always present and valid (and check that, too). If the field is
+not required anymore or if there is just a hostname instead of the IP,
+applications could break. Even assuming those cases could be solved, the
+hostname will have to be resolved, which presents further questions and issues:
+the timeout to use, whether the lookup is synchronous or asynchronous, dealing
+with DNS TTL and more. One imperfect approach was to only resolve the hostname
+upon creation, but this was considered not a great idea. A better approach
+would be at a higher level, maybe a service type.
+
+There are more ideas described in #13748, but all raised further issues,
+ranging from using another upstream DNS server to creating a Name object
+associated with DNSs.
+
+# Proposed solution
+
+The proposed solution works at the service layer, by adding a new `externalName`
+type for services. This will create a CNAME record in the internal cluster DNS
+service. No virtual IP or proxying is involved.
+
+Using a CNAME gets rid of unnecessary DNS lookups. There's no need for the
+Kubernetes control plane to issue them, to pick a timeout for them and having to
+refresh them when the TTL for a record expires. It's way simpler to implement,
+while solving the right problem. And addressing it at the service layer avoids
+all the complications mentioned above about doing it at the endpoints layer.
+
+The solution was outlined by Tim Hockin in
+https://github.com/kubernetes/kubernetes/issues/13748#issuecomment-230397975
+
+Currently a ServiceSpec looks like this, with comments edited for clarity:
+
+```
+type ServiceSpec struct {
+ Ports []ServicePort
+
+ // If not specified, the associated Endpoints object is not automatically managed
+ Selector map[string]string
+
+ // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced
+ ClusterIP string
+
+ // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
+ Type ServiceType
+
+ // Only applies if clusterIP != "None"
+ ExternalIPs []string
+ SessionAffinity ServiceAffinity
+
+ // Only applies to type=LoadBalancer
+ LoadBalancerIP string
+ LoadBalancerSourceRanges []string
+```
+
+The proposal is to change it to:
+
+```
+type ServiceSpec struct {
+ Ports []ServicePort
+
+ // If not specified, the associated Endpoints object is not automatically managed
++ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored.
+ Selector map[string]string
+
+ // "", a real IP, or "None". If not specified, this is default allocated. If "None", this Service is not load-balanced.
++ // Only applies if type is ClusterIP, NodePort, or LoadBalancer. If type is ExternalName, this is ignored.
+ ClusterIP string
+
+- // ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
++ // ExternalName, ClusterIP, NodePort, LoadBalancer. Only applies if clusterIP != "None"
+ Type ServiceType
+
++ // Only applies if type is ExternalName
++ ExternalName string
+
+ // Only applies if clusterIP != "None"
+ ExternalIPs []string
+ SessionAffinity ServiceAffinity
+
+ // Only applies to type=LoadBalancer
+ LoadBalancerIP string
+ LoadBalancerSourceRanges []string
+```
+
+For example, it can be used like this:
+
+```
+apiVersion: v1
+kind: Service
+metadata:
+ name: my-rds
+spec:
+ ports:
+ - port: 12345
+type: ExternalName
+externalName: myapp.rds.whatever.aws.says
+```
+
+There is one issue to take into account, that no other alternative considered
+fixes, either: TLS. If the service is a CNAME for an endpoint that uses TLS,
+connecting with the Kubernetes name `my-service.my-ns.svc.cluster.local` may
+result in a failure during server certificate validation. This is acknowledged
+and left for future consideration. For the time being, users and administrators
+might need to ensure that the server certificates also mentions the Kubernetes
+name as an alternate host name.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/service-external-name.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/service_accounts.md b/contributors/design-proposals/service_accounts.md
new file mode 100644
index 00000000..89a3771b
--- /dev/null
+++ b/contributors/design-proposals/service_accounts.md
@@ -0,0 +1,210 @@
+# Service Accounts
+
+## Motivation
+
+Processes in Pods may need to call the Kubernetes API. For example:
+ - scheduler
+ - replication controller
+ - node controller
+ - a map-reduce type framework which has a controller that then tries to make a
+dynamically determined number of workers and watch them
+ - continuous build and push system
+ - monitoring system
+
+They also may interact with services other than the Kubernetes API, such as:
+ - an image repository, such as docker -- both when the images are pulled to
+start the containers, and for writing images in the case of pods that generate
+images.
+ - accessing other cloud services, such as blob storage, in the context of a
+large, integrated, cloud offering (hosted or private).
+ - accessing files in an NFS volume attached to the pod
+
+## Design Overview
+
+A service account binds together several things:
+ - a *name*, understood by users, and perhaps by peripheral systems, for an
+identity
+ - a *principal* that can be authenticated and [authorized](../admin/authorization.md)
+ - a [security context](security_context.md), which defines the Linux
+Capabilities, User IDs, Groups IDs, and other capabilities and controls on
+interaction with the file system and OS.
+ - a set of [secrets](secrets.md), which a container may use to access various
+networked resources.
+
+## Design Discussion
+
+A new object Kind is added:
+
+```go
+type ServiceAccount struct {
+ TypeMeta `json:",inline" yaml:",inline"`
+ ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"`
+
+ username string
+ securityContext ObjectReference // (reference to a securityContext object)
+ secrets []ObjectReference // (references to secret objects
+}
+```
+
+The name ServiceAccount is chosen because it is widely used already (e.g. by
+Kerberos and LDAP) to refer to this type of account. Note that it has no
+relation to Kubernetes Service objects.
+
+The ServiceAccount object does not include any information that could not be
+defined separately:
+ - username can be defined however users are defined.
+ - securityContext and secrets are only referenced and are created using the
+REST API.
+
+The purpose of the serviceAccount object is twofold:
+ - to bind usernames to securityContexts and secrets, so that the username can
+be used to refer succinctly in contexts where explicitly naming securityContexts
+and secrets would be inconvenient
+ - to provide an interface to simplify allocation of new securityContexts and
+secrets.
+
+These features are explained later.
+
+### Names
+
+From the standpoint of the Kubernetes API, a `user` is any principal which can
+authenticate to Kubernetes API. This includes a human running `kubectl` on her
+desktop and a container in a Pod on a Node making API calls.
+
+There is already a notion of a username in Kubernetes, which is populated into a
+request context after authentication. However, there is no API object
+representing a user. While this may evolve, it is expected that in mature
+installations, the canonical storage of user identifiers will be handled by a
+system external to Kubernetes.
+
+Kubernetes does not dictate how to divide up the space of user identifier
+strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or
+may be qualified to allow for federated identity (`alice@example.com` vs.
+`alice@example.org`.) Naming convention may distinguish service accounts from
+user accounts (e.g. `alice@example.com` vs.
+`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but
+Kubernetes does not require this.
+
+Kubernetes also does not require that there be a distinction between human and
+Pod users. It will be possible to setup a cluster where Alice the human talks to
+the Kubernetes API as username `alice` and starts pods that also talk to the API
+as user `alice` and write files to NFS as user `alice`. But, this is not
+recommended.
+
+Instead, it is recommended that Pods and Humans have distinct identities, and
+reference implementations will make this distinction.
+
+The distinction is useful for a number of reasons:
+ - the requirements for humans and automated processes are different:
+ - Humans need a wide range of capabilities to do their daily activities.
+Automated processes often have more narrowly-defined activities.
+ - Humans may better tolerate the exceptional conditions created by
+expiration of a token. Remembering to handle this in a program is more annoying.
+So, either long-lasting credentials or automated rotation of credentials is
+needed.
+ - A Human typically keeps credentials on a machine that is not part of the
+cluster and so not subject to automatic management. A VM with a
+role/service-account can have its credentials automatically managed.
+ - the identity of a Pod cannot in general be mapped to a single human.
+ - If policy allows, it may be created by one human, and then updated by
+another, and another, until its behavior cannot be attributed to a single human.
+
+**TODO**: consider getting rid of separate serviceAccount object and just
+rolling its parts into the SecurityContext or Pod Object.
+
+The `secrets` field is a list of references to /secret objects that an process
+started as that service account should have access to be able to assert that
+role.
+
+The secrets are not inline with the serviceAccount object. This way, most or
+all users can have permission to `GET /serviceAccounts` so they can remind
+themselves what serviceAccounts are available for use.
+
+Nothing will prevent creation of a serviceAccount with two secrets of type
+`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and
+client libraries will have some behavior, TBD, to handle the case of multiple
+secrets of a given type (pick first or provide all and try each in order, etc).
+
+When a serviceAccount and a matching secret exist, then a `User.Info` for the
+serviceAccount and a `BearerToken` from the secret are added to the map of
+tokens used by the authentication process in the apiserver, and similarly for
+other types. (We might have some types that do not do anything on apiserver but
+just get pushed to the kubelet.)
+
+### Pods
+
+The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If
+this is unset, then a default value is chosen. If it is set, then the
+corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account
+Finalizer (see below).
+
+TBD: how policy limits which users can make pods with which service accounts.
+
+### Authorization
+
+Kubernetes API Authorization Policies refer to users. Pods created with a
+`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to
+authenticate to the Kubernetes APIserver as a particular user. So any policy
+that is desired can be applied to them.
+
+A higher level workflow is needed to coordinate creation of serviceAccounts,
+secrets and relevant policy objects. Users are free to extend Kubernetes to put
+this business logic wherever is convenient for them, though the Service Account
+Finalizer is one place where this can happen (see below).
+
+### Kubelet
+
+The kubelet will treat as "not ready to run" (needing a finalizer to act on it)
+any Pod which has an empty SecurityContext.
+
+The kubelet will set a default, restrictive, security context for any pods
+created from non-Apiserver config sources (http, file).
+
+Kubelet watches apiserver for secrets which are needed by pods bound to it.
+
+**TODO**: how to only let kubelet see secrets it needs to know.
+
+### The service account finalizer
+
+There are several ways to use Pods with SecurityContexts and Secrets.
+
+One way is to explicitly specify the securityContext and all secrets of a Pod
+when the pod is initially created, like this:
+
+**TODO**: example of pod with explicit refs.
+
+Another way is with the *Service Account Finalizer*, a plugin process which is
+optional, and which handles business logic around service accounts.
+
+The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount
+definitions.
+
+First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no
+`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext
+and secrets references for the corresponding `serviceAccount`.
+
+Second, if ServiceAccount definitions change, it may take some actions.
+
+**TODO**: decide what actions it takes when a serviceAccount definition changes.
+Does it stop pods, or just allow someone to list ones that are out of spec? In
+general, people may want to customize this?
+
+Third, if a new namespace is created, it may create a new serviceAccount for
+that namespace. This may include a new username (e.g.
+`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`),
+a new securityContext, a newly generated secret to authenticate that
+serviceAccount to the Kubernetes API, and default policies for that service
+account.
+
+**TODO**: more concrete example. What are typical default permissions for
+default service account (e.g. readonly access to services in the same namespace
+and read-write access to events in that namespace?)
+
+Finally, it may provide an interface to automate creation of new
+serviceAccounts. In that case, the user may want to GET serviceAccounts to see
+what has been created.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/simple-rolling-update.md b/contributors/design-proposals/simple-rolling-update.md
new file mode 100644
index 00000000..c4a5f671
--- /dev/null
+++ b/contributors/design-proposals/simple-rolling-update.md
@@ -0,0 +1,131 @@
+## Simple rolling update
+
+This is a lightweight design document for simple
+[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`.
+
+Complete execution flow can be found [here](#execution-details). See the
+[example of rolling update](../user-guide/update-demo/) for more information.
+
+### Lightweight rollout
+
+Assume that we have a current replication controller named `foo` and it is
+running image `image:v1`
+
+`kubectl rolling-update foo [foo-v2] --image=myimage:v2`
+
+If the user doesn't specify a name for the 'next' replication controller, then
+the 'next' replication controller is renamed to
+the name of the original replication controller.
+
+Obviously there is a race here, where if you kill the client between delete foo,
+and creating the new version of 'foo' you might be surprised about what is
+there, but I think that's ok. See [Recovery](#recovery) below
+
+If the user does specify a name for the 'next' replication controller, then the
+'next' replication controller is retained with its existing name, and the old
+'foo' replication controller is deleted. For the purposes of the rollout, we add
+a unique-ifying label `kubernetes.io/deployment` to both the `foo` and
+`foo-next` replication controllers. The value of that label is the hash of the
+complete JSON representation of the`foo-next` or`foo` replication controller.
+The name of this label can be overridden by the user with the
+`--deployment-label-key` flag.
+
+#### Recovery
+
+If a rollout fails or is terminated in the middle, it is important that the user
+be able to resume the roll out. To facilitate recovery in the case of a crash of
+the updating process itself, we add the following annotations to each
+replication controller in the `kubernetes.io/` annotation namespace:
+ * `desired-replicas` The desired number of replicas for this replication
+controller (either N or zero)
+ * `update-partner` A pointer to the replication controller resource that is
+the other half of this update (syntax `<name>` the namespace is assumed to be
+identical to the namespace of this replication controller.)
+
+Recovery is achieved by issuing the same command again:
+
+```sh
+kubectl rolling-update foo [foo-v2] --image=myimage:v2
+```
+
+Whenever the rolling update command executes, the kubectl client looks for
+replication controllers called `foo` and `foo-next`, if they exist, an attempt
+is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is
+created, and the rollout is a new rollout. If `foo` doesn't exist, then it is
+assumed that the rollout is nearly completed, and `foo-next` is renamed to
+`foo`. Details of the execution flow are given below.
+
+
+### Aborting a rollout
+
+Abort is assumed to want to reverse a rollout in progress.
+
+`kubectl rolling-update foo [foo-v2] --rollback`
+
+This is really just semantic sugar for:
+
+`kubectl rolling-update foo-v2 foo`
+
+With the added detail that it moves the `desired-replicas` annotation from
+`foo-v2` to `foo`
+
+
+### Execution Details
+
+For the purposes of this example, assume that we are rolling from `foo` to
+`foo-next` where the only change is an image update from `v1` to `v2`
+
+If the user doesn't specify a `foo-next` name, then it is either discovered from
+the `update-partner` annotation on `foo`. If that annotation doesn't exist,
+then `foo-next` is synthesized using the pattern
+`<controller-name>-<hash-of-next-controller-JSON>`
+
+#### Initialization
+
+ * If `foo` and `foo-next` do not exist:
+ * Exit, and indicate an error to the user, that the specified controller
+doesn't exist.
+ * If `foo` exists, but `foo-next` does not:
+ * Create `foo-next` populate it with the `v2` image, set
+`desired-replicas` to `foo.Spec.Replicas`
+ * Goto Rollout
+ * If `foo-next` exists, but `foo` does not:
+ * Assume that we are in the rename phase.
+ * Goto Rename
+ * If both `foo` and `foo-next` exist:
+ * Assume that we are in a partial rollout
+ * If `foo-next` is missing the `desired-replicas` annotation
+ * Populate the `desired-replicas` annotation to `foo-next` using the
+current size of `foo`
+ * Goto Rollout
+
+#### Rollout
+
+ * While size of `foo-next` < `desired-replicas` annotation on `foo-next`
+ * increase size of `foo-next`
+ * if size of `foo` > 0
+ decrease size of `foo`
+ * Goto Rename
+
+#### Rename
+
+ * delete `foo`
+ * create `foo` that is identical to `foo-next`
+ * delete `foo-next`
+
+#### Abort
+
+ * If `foo-next` doesn't exist
+ * Exit and indicate to the user that they may want to simply do a new
+rollout with the old version
+ * If `foo` doesn't exist
+ * Exit and indicate not found to the user
+ * Otherwise, `foo-next` and `foo` both exist
+ * Set `desired-replicas` annotation on `foo` to match the annotation on
+`foo-next`
+ * Goto Rollout with `foo` and `foo-next` trading places.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/stateful-apps.md b/contributors/design-proposals/stateful-apps.md
new file mode 100644
index 00000000..c5196f2a
--- /dev/null
+++ b/contributors/design-proposals/stateful-apps.md
@@ -0,0 +1,363 @@
+# StatefulSets: Running pods which need strong identity and storage
+
+## Motivation
+
+Many examples of clustered software systems require stronger guarantees per instance than are provided
+by the Replication Controller (aka Replication Controllers). Instances of these systems typically require:
+
+1. Data per instance which should not be lost even if the pod is deleted, typically on a persistent volume
+ * Some cluster instances may have tens of TB of stored data - forcing new instances to replicate data
+ from other members over the network is onerous
+2. A stable and unique identity associated with that instance of the storage - such as a unique member id
+3. A consistent network identity that allows other members to locate the instance even if the pod is deleted
+4. A predictable number of instances to ensure that systems can form a quorum
+ * This may be necessary during initialization
+5. Ability to migrate from node to node with stable network identity (DNS name)
+6. The ability to scale up in a controlled fashion, but are very rarely scaled down without human
+ intervention
+
+Kubernetes should expose a pod controller (a StatefulSet) that satisfies these requirements in a flexible
+manner. It should be easy for users to manage and reason about the behavior of this set. An administrator
+with familiarity in a particular cluster system should be able to leverage this controller and its
+supporting documentation to run that clustered system on Kubernetes. It is expected that some adaptation
+is required to support each new cluster.
+
+This resource is **stateful** because it offers an easy way to link a pod's network identity to its storage
+identity and because it is intended to be used to run software that is the holders of state for other
+components. That does not mean that all stateful applications *must* use StatefulSets, but the tradeoffs
+in this resource are intended to facilitate holding state in the cluster.
+
+
+## Use Cases
+
+The software listed below forms the primary use-cases for a StatefulSet on the cluster - problems encountered
+while adapting these for Kubernetes should be addressed in a final design.
+
+* Quorum with Leader Election
+ * MongoDB - in replica set mode forms a quorum with an elected leader, but instances must be preconfigured
+ and have stable network identities.
+ * ZooKeeper - forms a quorum with an elected leader, but is sensitive to cluster membership changes and
+ replacement instances *must* present consistent identities
+ * etcd - forms a quorum with an elected leader, can alter cluster membership in a consistent way, and
+ requires stable network identities
+* Decentralized Quorum
+ * Cassandra - allows flexible consistency and distributes data via innate hash ring sharding, is also
+ flexible to scaling, more likely to support members that come and go. Scale down may trigger massive
+ rebalances.
+* Active-active
+ * Galera - has multiple active masters which must remain in sync
+* Leader-followers
+ * Spark in standalone mode - A single unilateral leader and a set of workers
+
+
+## Background
+
+Replica sets are designed with a weak guarantee - that there should be N replicas of a particular
+pod template. Each pod instance varies only by name, and the replication controller errs on the side of
+ensuring that N replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful
+deletion, for instance, or by being able to pick arbitrary pods to scale down). In addition, pods by design
+have no stable network identity other than their assigned pod IP, which can change over the lifetime of a pod
+resource. ReplicaSets are best leveraged for stateless, shared-nothing, zero-coordination,
+embarassingly-parallel, or fungible software.
+
+While it is possible to emulate the guarantees described above by leveraging multiple replication controllers
+(for distinct pod templates and pod identities) and multiple services (for stable network identity), the
+resulting objects are hard to maintain and must be copied manually in order to scale a cluster.
+
+By constrast, a DaemonSet *can* offer some of the guarantees above, by leveraging Nodes as stable, long-lived
+entities. An administrator might choose a set of nodes, label them a particular way, and create a
+DaemonSet that maps pods to each node. The storage of the node itself (which could be network attached
+storage, or a local SAN) is the persistent storage. The network identity of the node is the stable
+identity. However, while there are examples of clustered software that benefit from close association to
+a node, this creates an undue burden on administrators to design their cluster to satisfy these
+constraints, when a goal of Kubernetes is to decouple system administration from application management.
+
+
+## Design Assumptions
+
+* **Specialized Controller** - Rather than increase the complexity of the ReplicaSet to satisfy two distinct
+ use cases, create a new resource that assists users in solving this particular problem.
+* **Safety first** - Running a clustered system on Kubernetes should be no harder
+ than running a clustered system off Kube. Authors should be given tools to guard against common cluster
+ failure modes (split brain, phantom member) to prevent introducing more failure modes. Sophisticated
+ distributed systems designers can implement more sophisticated solutions than StatefulSet if necessary -
+ new users should not become vulnerable to additional failure modes through an overly flexible design.
+* **Controlled scaling** - While flexible scaling is important for some clusters, other examples of clusters
+ do not change scale without significant external intervention. Human intervention may be required after
+ scaling. Changing scale during cluster operation can lead to split brain in quorum systems. It should be
+ possible to scale, but there may be responsibilities on the set author to correctly manage the scale.
+* **No generic cluster lifecycle** - Rather than design a general purpose lifecycle for clustered software,
+ focus on ensuring the information necessary for the software to function is available. For example,
+ rather than providing a "post-creation" hook invoked when the cluster is complete, provide the necessary
+ information to the "first" (or last) pod to determine the identity of the remaining cluster members and
+ allow it to manage its own initialization.
+
+
+## Proposed Design
+
+Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each
+individual can safely be replaced-- the name **StatefulSet** is chosen to convey that the individual members of
+the set are themselves "stateful" and thus each one is preserved. Each member has an identity, and there will
+always be a member that thinks it is the "first" one.
+
+The StatefulSet is responsible for creating and maintaining a set of **identities** and ensuring that there is
+one pod and zero or more **supporting resources** for each identity. There should never be more than one pod
+or unique supporting resource per identity at any one time. A new pod can be created for an identity only
+if a previous pod has been fully terminated (reached its graceful termination limit or cleanly exited).
+
+A StatefulSet has 0..N **members**, each with a unique **identity** which is a name that is unique within the
+set.
+
+```
+type StatefulSet struct {
+ ObjectMeta
+
+ Spec StatefulSetSpec
+ ...
+}
+
+type StatefulSetSpec struct {
+ // Replicas is the desired number of replicas of the given template.
+ // Each replica is assigned a unique name of the form `name-$replica`
+ // where replica is in the range `0 - (replicas-1)`.
+ Replicas int
+
+ // A label selector that "owns" objects created under this set
+ Selector *LabelSelector
+
+ // Template is the object describing the pod that will be created - each
+ // pod created by this set will match the template, but have a unique identity.
+ Template *PodTemplateSpec
+
+ // VolumeClaimTemplates is a list of claims that members are allowed to reference.
+ // The StatefulSet controller is responsible for mapping network identities to
+ // claims in a way that maintains the identity of a member. Every claim in
+ // this list must have at least one matching (by name) volumeMount in one
+ // container in the template. A claim in this list takes precedence over
+ // any volumes in the template, with the same name.
+ VolumeClaimTemplates []PersistentVolumeClaim
+
+ // ServiceName is the name of the service that governs this StatefulSet.
+ // This service must exist before the StatefulSet, and is responsible for
+ // the network identity of the set. Members get DNS/hostnames that follow the
+ // pattern: member-specific-string.serviceName.default.svc.cluster.local
+ // where "member-specific-string" is managed by the StatefulSet controller.
+ ServiceName string
+}
+```
+
+Like a replication controller, a StatefulSet may be targeted by an autoscaler. The StatefulSet makes no assumptions
+about upgrading or altering the pods in the set for now - instead, the user can trigger graceful deletion
+and the StatefulSet will replace the terminated member with the newer template once it exits. Future proposals
+may offer update capabilities. A StatefulSet requires RestartAlways pods. The addition of forgiveness may be
+necessary in the future to increase the safety of the controller recreating pods.
+
+
+### How identities are managed
+
+A key question is whether scaling down a StatefulSet and then scaling it back up should reuse identities. If not,
+scaling down becomes a destructive action (an admin cannot recover by scaling back up). Given the safety
+first assumption, identity reuse seems the correct default. This implies that identity assignment should
+be deterministic and not subject to controller races (a controller that has crashed during scale up should
+assign the same identities on restart, and two concurrent controllers should decide on the same outcome
+identities).
+
+The simplest way to manage identities, and easiest to understand for users, is a numeric identity system
+starting at I=0 that ranges up to the current replica count and is contiguous.
+
+Future work:
+
+* Cover identity reclamation - cleaning up resources for identities that are no longer in use.
+* Allow more sophisticated identity assignment - instead of `{name}-{0 - replicas-1}`, allow subsets and
+ complex indexing.
+
+### Controller behavior.
+
+When a StatefulSet is scaled up, the controller must create both pods and supporting resources for
+each new identity. The controller must create supporting resources for the pod before creating the
+pod. If a supporting resource with the appropriate name already exists, the controller should treat that as
+creation succeeding. If a supporting resource cannot be created, the controller should flag an error to
+status, back-off (like a scheduler or replication controller), and try again later. Each resource created
+by a StatefulSet controller must have a set of labels that match the selector, support orphaning, and have a
+controller back reference annotation identifying the owning StatefulSet by name and UID.
+
+When a StatefulSet is scaled down, the pod for the removed indentity should be deleted. It is less clear what the
+controller should do to supporting resources. If every pod requires a PV, and a user accidentally scales
+up to N=200 and then back down to N=3, leaving 197 PVs lying around may be undesirable (potential for
+abuse). On the other hand, a cluster of 5 that is accidentally scaled down to 3 might irreparably destroy
+the cluster if the PV for identities 4 and 5 are deleted (may not be recoverable). For the initial proposal,
+leaving the supporting resources is the safest path (safety first) with a potential future policy applied
+to the StatefulSet for how to manage supporting resources (DeleteImmediately, GarbageCollect, Preserve).
+
+The controller should reflect summary counts of resources on the StatefulSet status to enable clients to easily
+understand the current state of the set.
+
+### Parameterizing pod templates and supporting resources
+
+Since each pod needs a unique and distinct identity, and the pod needs to know its own identity, the
+StatefulSet must allow a pod template to be parameterized by the identity assigned to the pod. The pods that
+are created should be easily identified by their cluster membership.
+
+Because that pod needs access to stable storage, the StatefulSet may specify a template for one or more
+**persistent volume claims** that can be used for each distinct pod. The name of the volume claim must
+match a volume mount within the pod template.
+
+Future work:
+
+* In the future other resources may be added that must also be templated - for instance, secrets (unique secret per member), config data (unique config per member), and in the futher future, arbitrary extension resources.
+* Consider allowing the identity value itself to be passed as an environment variable via the downward API
+* Consider allowing per identity values to be specified that are passed to the pod template or volume claim.
+
+
+### Accessing pods by stable network identity
+
+In order to provide stable network identity, given that pods may not assume pod IP is constant over the
+lifetime of a pod, it must be possible to have a resolvable DNS name for the pod that is tied to the
+pod identity. There are two broad classes of clustered services - those that require clients to know
+all members of the cluster (load balancer intolerant) and those that are amenable to load balancing.
+For the former, clients must also be able to easily enumerate the list of DNS names that represent the
+member identities and access them inside the cluster. Within a pod, it must be possible for containers
+to find and access that DNS name for identifying itself to the cluster.
+
+Since a pod is expected to be controlled by a single controller at a time, it is reasonable for a pod to
+have a single identity at a time. Therefore, a service can expose a pod by its identity in a unique
+fashion via DNS by leveraging information written to the endpoints by the endpoints controller.
+
+The end result might be DNS resolution as follows:
+
+```
+# service mongo pointing to pods created by StatefulSet mdb, with identities mdb-1, mdb-2, mdb-3
+
+dig mongodb.namespace.svc.cluster.local +short A
+172.130.16.50
+
+dig mdb-1.mongodb.namespace.svc.cluster.local +short A
+# IP of pod created for mdb-1
+
+dig mdb-2.mongodb.namespace.svc.cluster.local +short A
+# IP of pod created for mdb-2
+
+dig mdb-3.mongodb.namespace.svc.cluster.local +short A
+# IP of pod created for mdb-3
+```
+
+This is currently implemented via an annotation on pods, which is surfaced to endpoints, and finally
+surfaced as DNS on the service that exposes those pods.
+
+```
+// The pods created by this StatefulSet will have the DNS names "mysql-0.NAMESPACE.svc.cluster.local"
+// and "mysql-1.NAMESPACE.svc.cluster.local"
+kind: StatefulSet
+metadata:
+ name: mysql
+spec:
+ replicas: 2
+ serviceName: db
+ template:
+ spec:
+ containers:
+ - image: mysql:latest
+
+// Example pod created by stateful set
+kind: Pod
+metadata:
+ name: mysql-0
+ annotations:
+ pod.beta.kubernetes.io/hostname: "mysql-0"
+ pod.beta.kubernetes.io/subdomain: db
+spec:
+ ...
+```
+
+
+### Preventing duplicate identities
+
+The StatefulSet controller is expected to execute like other controllers, as a single writer. However, when
+considering designing for safety first, the possibility of the controller running concurrently cannot
+be overlooked, and so it is important to ensure that duplicate pod identities are not achieved.
+
+There are two mechanisms to acheive this at the current time. One is to leverage unique names for pods
+that carry the identity of the pod - this prevents duplication because etcd 2 can guarantee single
+key transactionality. The other is to use the status field of the StatefulSet to coordinate membership
+information. It is possible to leverage both at this time, and encourage users to not assume pod
+name is significant, but users are likely to take what they can get. A downside of using unique names
+is that it complicates pre-warming of pods and pod migration - on the other hand, those are also
+advanced use cases that might be better solved by another, more specialized controller (a
+MigratableStatefulSet).
+
+
+### Managing lifecycle of members
+
+The most difficult aspect of managing a member set is ensuring that all members see a consistent configuration
+state of the set. Without a strongly consistent view of cluster state, most clustered software is
+vulnerable to split brain. For example, a new set is created with 3 members. If the node containing the
+first member is partitioned from the cluster, it may not observe the other two members, and thus create its
+own cluster of size 1. The other two members do see the first member, so they form a cluster of size 3.
+Both clusters appear to have quorum, which can lead to data loss if not detected.
+
+StatefulSets should provide basic mechanisms that enable a consistent view of cluster state to be possible,
+and in the future provide more tools to reduce the amount of work necessary to monitor and update that
+state.
+
+The first mechanism is that the StatefulSet controller blocks creation of new pods until all previous pods
+are reporting a healthy status. The StatefulSet controller uses the strong serializability of the underyling
+etcd storage to ensure that it acts on a consistent view of the cluster membership (the pods and their)
+status, and serializes the creation of pods based on the health state of other pods. This simplifies
+reasoning about how to initialize a StatefulSet, but is not sufficient to guarantee split brain does not
+occur.
+
+The second mechanism is having each "member" use the state of the cluster and transform that into cluster
+configuration or decisions about membership. This is currently implemented using a side car container
+that watches the master (via DNS today, although in the future this may be to endpoints directly) to
+receive an ordered history of events, and then applying those safely to the configuration. Note that
+for this to be safe, the history received must be strongly consistent (must be the same order of
+events from all observers) and the config change must be bounded (an old config version may not
+be allowed to exist forever). For now, this is known as a 'babysitter' (working name) and is intended
+to help identify abstractions that can be provided by the StatefulSet controller in the future.
+
+
+## Future Evolution
+
+Criteria for advancing to beta:
+
+* StatefulSets do not accidentally lose data due to cluster design - the pod safety proposal will
+ help ensure StatefulSets can guarantee **at most one** instance of a pod identity is running at
+ any time.
+* A design consensus is reached on StatefulSet upgrades.
+
+Criteria for advancing to GA:
+
+* StatefulSets solve 80% of clustered software configuraton with minimal input from users and are safe from common split brain problems
+ * Several representative examples of StatefulSets from the community have been proven/tested to be "correct" for a variety of partition problems (possibly via Jepsen or similar)
+ * Sufficient testing and soak time has been in place (like for Deployments) to ensure the necessary features are in place.
+* StatefulSets are considered easy to use for deploying clustered software for common cases
+
+Requested features:
+
+* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster
+ * Individual services can potentially be used to solve this in some cases.
+* Send more / simpler events to each pod from a central spot via the "signal API"
+* Persistent local volumes that can leverage local storage
+* Allow pods within the StatefulSet to identify "leader" in a way that can direct requests from a service to a particular member.
+* Provide upgrades of a StatefulSet in a controllable way (like Deployments).
+
+
+## Overlap with other proposals
+
+* Jobs can be used to perform a run-once initialization of the cluster
+* Init containers can be used to prime PVs and config with the identity of the pod.
+* Templates and how fields are overriden in the resulting object should have broad alignment
+* DaemonSet defines the core model for how new controllers sit alongside replication controller and
+ how upgrades can be implemented outside of Deployment objects.
+
+
+## History
+
+StatefulSets were formerly known as PetSets and were renamed to be less "cutesy" and more descriptive as a
+prerequisite to moving to beta. No animals were harmed in the making of this proposal.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/stateful-apps.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/synchronous-garbage-collection.md b/contributors/design-proposals/synchronous-garbage-collection.md
new file mode 100644
index 00000000..c5157408
--- /dev/null
+++ b/contributors/design-proposals/synchronous-garbage-collection.md
@@ -0,0 +1,175 @@
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Overview](#overview)
+- [API Design](#api-design)
+ - [Standard Finalizers](#standard-finalizers)
+ - [OwnerReference](#ownerreference)
+ - [DeleteOptions](#deleteoptions)
+- [Components changes](#components-changes)
+ - [API Server](#api-server)
+ - [Garbage Collector](#garbage-collector)
+ - [Controllers](#controllers)
+- [Handling circular dependencies](#handling-circular-dependencies)
+- [Unhandled cases](#unhandled-cases)
+- [Implications to existing clients](#implications-to-existing-clients)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Overview
+
+Users of the server-side garbage collection need to determine if the garbage collection is done. For example:
+* Currently `kubectl delete rc` blocks until all the pods are terminating. To convert to use server-side garbage collection, kubectl has to be able to determine if the garbage collection is done.
+* [#19701](https://github.com/kubernetes/kubernetes/issues/19701#issuecomment-236997077) is a use case where the user needs to wait for all service dependencies garbage collected and their names released, before she recreates the dependencies.
+
+We define the garbage collection as "done" when all the dependents are deleted from the key-value store, rather than merely in the terminating state. There are two reasons: *i)* for `Pod`s, the most usual garbage, only when they are deleted from the key-value store, we know kubelet has released resources they occupy; *ii)* some users need to recreate objects with the same names, they need to wait for the old objects to be deleted from the key-value store. (This limitation is because we index objects by their names in the key-value store today.)
+
+Synchronous Garbage Collection is a best-effort (see [unhandled cases](#unhandled-cases)) mechanism that allows user to determine if the garbage collection is done: after the API server receives a deletion request of an owning object, the object keeps existing in the key-value store until all its dependents are deleted from the key-value store by the garbage collector.
+
+Tracking issue: https://github.com/kubernetes/kubernetes/issues/29891
+
+# API Design
+
+## Standard Finalizers
+
+We will introduce a new standard finalizer:
+
+```go
+const GCFinalizer string = “DeletingDependents”
+```
+
+This finalizer indicates the object is terminating and is waiting for its dependents whose `OwnerReference.BlockOwnerDeletion` is true get deleted.
+
+## OwnerReference
+
+```go
+OwnerReference {
+ ...
+ // If true, AND if the owner has the "DeletingDependents" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed.
+ // Defaults to false.
+ // To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned.
+ BlockOwnerDeletion *bool
+}
+```
+
+The initial draft of the proposal did not include this field and it had a security loophole: a user who is only authorized to update one resource can set ownerReference to block the synchronous GC of other resources. Requiring users to explicitly set `BlockOwnerDeletion` allows the master to properly authorize the request.
+
+## DeleteOptions
+
+```go
+DeleteOptions {
+ …
+ // Whether and how garbage collection will be performed.
+ // Defaults to DeletePropagationDefault
+ // Either this field or OrphanDependents may be set, but not both.
+ PropagationPolicy *DeletePropagationPolicy
+}
+
+type DeletePropagationPolicy string
+
+const (
+ // The default depends on the existing finalizers on the object and the type of the object.
+ DeletePropagationDefault DeletePropagationPolicy = "DeletePropagationDefault"
+ // Orphans the dependents
+ DeletePropagationOrphan DeletePropagationPolicy = "DeletePropagationOrphan"
+ // Deletes the object from the key-value store, the garbage collector will delete the dependents in the background.
+ DeletePropagationBackground DeletePropagationPolicy = "DeletePropagationBackground"
+ // The object exists in the key-value store until the garbage collector deletes all the dependents whose ownerReference.blockOwnerDeletion=true from the key-value store.
+ // API sever will put the "DeletingDependents" finalizer on the object, and sets its deletionTimestamp.
+ // This policy is cascading, i.e., the dependents will be deleted with GarbageCollectionSynchronous.
+ DeletePropagationForeground DeletePropagationPolicy = "DeletePropagationForeground"
+)
+```
+
+The `DeletePropagationForeground` policy represents the synchronous GC mode.
+
+`DeleteOptions.OrphanDependents *bool` will be marked as deprecated and will be removed in 1.7. Validation code will make sure only one of `OrphanDependents` and `PropagationPolicy` may be set. We decided not to add another `DeleteAfterDependentsDeleted *bool`, because together with `OrphanDependents`, it will result in 9 possible combinations and is thus confusing.
+
+The conversion rules are described in the following table:
+
+| 1.5 | pre 1.4/1.4 |
+|------------------------------------------|--------------------------|
+| DeletePropagationDefault | OrphanDependents==nil |
+| DeletePropagationOrphan | *OrphanDependents==true |
+| DeletePropagationBackground | *OrphanDependents==false |
+| DeletePropagationForeground | N/A |
+
+# Components changes
+
+## API Server
+
+`Delete()` function checks `DeleteOptions.PropagationPolicy`. If the policy is `DeletePropagationForeground`, the API server will update the object instead of deleting it, add the "DeletingDependents" finalizer, remove the "OrphanDependents" finalizer if it's present, and set the `ObjectMeta.DeletionTimestamp`.
+
+When validating the ownerReference, API server needs to query the `Authorizer` to check if the user has "delete" permission of the owner object. It returns 422 if the user does not have the permissions but intends to set `OwnerReference.BlockOwnerDeletion` to true.
+
+## Garbage Collector
+
+**Modifications to processEvent()**
+
+Currently `processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to:
+
+* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`.
+* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted.
+
+**Modifications to processItem()**
+
+Currently `processItem()` consumes the `dirtyQueue`, requests the API server to delete an item if all of its owners do not exist. To support synchronous GC, it has to:
+
+* treat an owner as "not exist" if `owner.DeletionTimestamp != nil && !owner.Finalizers.Has(OrphanFinalizer)`, otherwise synchronous GC will not progress because the owner keeps existing in the key-value store.
+* when deleting dependents, if the owner's finalizers include `DeletingDependents`, it should use the `GarbageCollectionSynchronous` as GC policy.
+* if an object has multiple owners, some owners still exist while other owners are in the synchronous GC stage, then according to the existing logic of GC, the object wouldn't be deleted. To unblock the synchronous GC of owners, `processItem()` has to remove the ownerReferences pointing to them.
+
+In addition, if an object popped from `dirtyQueue` is marked as "GC in progress", `processItem()` treats it specially:
+
+* To avoid racing with another controller, it requeues the object if `observedGeneration < Generation`. This is best-effort, see [unhandled cases](#unhandled-cases).
+* Checks if the object has dependents
+ * If not, send a PUT request to remove the `GCFinalizer`;
+ * If so, then add all dependents to the `dirtryQueue`; we need bookkeeping to avoid adding the dependents repeatedly if the owner gets in the `synchronousGC queue` multiple times.
+
+## Controllers
+
+To utilize the synchronous garbage collection feature, controllers (e.g., the replicaset controller) need to set `OwnerReference.BlockOwnerDeletion` when creating dependent objects (e.g. pods).
+
+# Handling circular dependencies
+
+SynchronousGC will enter a deadlock in the presence of circular dependencies. The garbage collector can break the circle by lazily breaking circular dependencies: when `processItem()` processes an object, if it finds the object and all of its owners have the `GCFinalizer`, it removes the `GCFinalizer` from the object.
+
+Note that the approach is not rigorous and thus having false positives. For example, if a user first sends a SynchronousGC delete request for an object, then sends the delete request for its owner, then `processItem()` will be fooled to believe there is a circle. We expect user not to do this. We can make the circle detection more rigorous if needed.
+
+Circular dependencies are regarded as user error. If needed, we can add more guarantees to handle such cases later.
+
+# Unhandled cases
+
+* If the GC observes the owning object with the `GCFinalizer` before it observes the creation of all the dependents, GC will remove the finalizer from the owning object before all dependents are gone. Hence, synchronous GC is best-effort, though we guarantee that the dependents will be deleted eventually. We face a similar case when handling OrphanFinalizer, see [GC known issues](https://github.com/kubernetes/kubernetes/issues/26120).
+
+# Implications to existing clients
+
+Finalizer breaks an assumption that many Kubernetes components have: a deletion request with `grace period=0` will immediately remove the object from the key-value store. This is not true if an object has pending finalizers, the object will continue to exist, and currently the API server will not return an error in this case.
+
+**Namespace controller** suffered from this [problem](https://github.com/kubernetes/kubernetes/issues/32519) and was fixed in [#32524](https://github.com/kubernetes/kubernetes/pull/32524) by retrying every 15s if there are objects with pending finalizers to be removed from the key-value store. Object with pending `GCFinalizer` might take arbitrary long time be deleted, so namespace deletion might time out.
+
+**kubelet** deletes the pod from the key-value store after all its containers are terminated ([code](../../pkg/kubelet/status/status_manager.go#L441-L443)). It also assumes that if the API server does not return an error, the pod is removed from the key-value store. Breaking the assumption will not break `kubelet` though, because the `pod` must have already been in the terminated phase, `kubelet` will not care to manage it.
+
+**Node controller** forcefully deletes pod if the pod is scheduled to a node that does not exist ([code](../../pkg/controller/node/nodecontroller.go#L474)). The pod will continue to exist if it has pending finalizers. The node controller will futilely retry the deletion. Also, the `node controller` forcefully deletes pods before deleting the node ([code](../../pkg/controller/node/nodecontroller.go#L592)). If the pods have pending finalizers, the `node controller` will go ahead deleting the node, leaving those pods behind. These pods will be deleted from the key-value store when the pending finalizers are removed.
+
+**Podgc** deletes terminated pods if there are too many of them in the cluster. We need to make sure finalizers on Pods are taken off quickly enough so that the progress of `Podgc` is not affected.
+
+**Deployment controller** adopts existing `ReplicaSet` (RS) if its template matches. If a matching RS has a pending `GCFinalizer`, deployment should adopt it, take its pods into account, but shouldn't try to mutate it, because the RS controller will ignore a RS that's being deleted. Hence, `deployment controller` should wait for the RS to be deleted, and then create a new one.
+
+**Replication controller manager**, **Job controller**, and **ReplicaSet controller** ignore pods in terminated phase, so pods with pending finalizers will not block these controllers.
+
+**StatefulSet controller** will be blocked by a pod with pending finalizers, so synchronous GC might slow down its progress.
+
+**kubectl**: synchronous GC can simplify the **kubectl delete** reapers. Let's take the `deployment reaper` as an example, since it's the most complicated one. Currently, the reaper finds all `RS` with matching labels, scales them down, polls until `RS.Status.Replica` reaches 0, deletes the `RS`es, and finally deletes the `deployment`. If using synchronous GC, `kubectl delete deployment` is as easy as sending a synchronous GC delete request for the deployment, and polls until the deployment is deleted from the key-value store.
+
+Note that this **changes the behavior** of `kubectl delete`. The command will be blocked until all pods are deleted from the key-value store, instead of being blocked until pods are in the terminating state. This means `kubectl delete` blocks for longer time, but it has the benefit that the resources used by the pods are released when the `kubectl delete` returns. To allow kubectl user not waiting for the cleanup, we will add a `--wait` flag. It defaults to true; if it's set to `false`, `kubectl delete` will send the delete request with `PropagationPolicy=DeletePropagationBackground` and return immediately.
+
+To make the new kubectl compatible with the 1.4 and earlier masters, kubectl needs to switch to use the old reaper logic if it finds synchronous GC is not supported by the master.
+
+1.4 `kubectl delete rc/rs` uses `DeleteOptions.OrphanDependents=true`, which is going to be converted to `DeletePropagationBackground` (see [API Design](#api-changes)) by a 1.5 master, so its behavior keeps the same.
+
+Pre 1.4 `kubectl delete` uses `DeleteOptions.OrphanDependents=nil`, so does the 1.4 `kubectl delete` for resources other than rc and rs. The option is going to be converted to `DeletePropagationDefault` (see [API Design](#api-changes)) by a 1.5 master, so these commands behave the same as when working with a 1.4 master.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/synchronous-garbage-collection.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/taint-toleration-dedicated.md b/contributors/design-proposals/taint-toleration-dedicated.md
new file mode 100644
index 00000000..c523319f
--- /dev/null
+++ b/contributors/design-proposals/taint-toleration-dedicated.md
@@ -0,0 +1,291 @@
+# Taints, Tolerations, and Dedicated Nodes
+
+## Introduction
+
+This document describes *taints* and *tolerations*, which constitute a generic
+mechanism for restricting the set of pods that can use a node. We also describe
+one concrete use case for the mechanism, namely to limit the set of users (or
+more generally, authorization domains) who can access a set of nodes (a feature
+we call *dedicated nodes*). There are many other uses--for example, a set of
+nodes with a particular piece of hardware could be reserved for pods that
+require that hardware, or a node could be marked as unschedulable when it is
+being drained before shutdown, or a node could trigger evictions when it
+experiences hardware or software problems or abnormal node configurations; see
+issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and
+[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion.
+
+## Taints, tolerations, and dedicated nodes
+
+A *taint* is a new type that is part of the `NodeSpec`; when present, it
+prevents pods from scheduling onto the node unless the pod *tolerates* the taint
+(tolerations are listed in the `PodSpec`). Note that there are actually multiple
+flavors of taints: taints that prevent scheduling on a node, taints that cause
+the scheduler to try to avoid scheduling on a node but do not prevent it, taints
+that prevent a pod from starting on Kubelet even if the pod's `NodeName` was
+written directly (i.e. pod did not go through the scheduler), and taints that
+evict already-running pods.
+[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
+has more background on these different scenarios. We will focus on the first
+kind of taint in this doc, since it is the kind required for the "dedicated
+nodes" use case.
+
+Implementing dedicated nodes using taints and tolerations is straightforward: in
+essence, a node that is dedicated to group A gets taint `dedicated=A` and the
+pods belonging to group A get toleration `dedicated=A`. (The exact syntax and
+semantics of taints and tolerations are described later in this doc.) This keeps
+all pods except those belonging to group A off of the nodes. This approach
+easily generalizes to pods that are allowed to schedule into multiple dedicated
+node groups, and nodes that are a member of multiple dedicated node groups.
+
+Note that because tolerations are at the granularity of pods, the mechanism is
+very flexible -- any policy can be used to determine which tolerations should be
+placed on a pod. So the "group A" mentioned above could be all pods from a
+particular namespace or set of namespaces, or all pods with some other arbitrary
+characteristic in common. We expect that any real-world usage of taints and
+tolerations will employ an admission controller to apply the tolerations. For
+example, to give all pods from namespace A access to dedicated node group A, an
+admission controller would add the corresponding toleration to all pods from
+namespace A. Or to give all pods that require GPUs access to GPU nodes, an
+admission controller would add the toleration for GPU taints to pods that
+request the GPU resource.
+
+Everything that can be expressed using taints and tolerations can be expressed
+using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g.
+in the example in the previous paragraph, you could put a label `dedicated=A` on
+the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not*
+belonging to group A. But it is cumbersome to express exclusion policies using
+node affinity because every time you add a new type of restricted node, all pods
+that aren't allowed to use those nodes need to start avoiding those nodes using
+node affinity. This means the node affinity list can get quite long in clusters
+with lots of different groups of special nodes (lots of dedicated node groups,
+lots of different kinds of special hardware, etc.). Moreover, you need to also
+update any Pending pods when you add new types of special nodes. In contrast,
+with taints and tolerations, when you add a new type of special node, "regular"
+pods are unaffected, and you just need to add the necessary toleration to the
+pods you subsequent create that need to use the new type of special nodes. To
+put it another way, with taints and tolerations, only pods that use a set of
+special nodes need to know about those special nodes; with the node affinity
+approach, pods that have no interest in those special nodes need to know about
+all of the groups of special nodes.
+
+One final comment: in practice, it is often desirable to not only keep "regular"
+pods off of special nodes, but also to keep "special" pods off of regular nodes.
+An example in the dedicated nodes case is to not only keep regular users off of
+dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
+nodes. In this case, the "non-dedicated" nodes can be modeled as their own
+dedicated node group (for example, tainted as `dedicated=shared`), and pods that
+are not given access to any dedicated nodes ("regular" pods) would be given a
+toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations
+will be added by an admission controller.) In this case taints/tolerations are
+still better than node affinity because with taints/tolerations each pod only
+needs one special "marking", versus in the node affinity case where every time
+you add a dedicated node group (i.e. a new `dedicated=` value), you need to add
+a new node affinity rule to all pods (including pending pods) except the ones
+allowed to use that new dedicated node group.
+
+## API
+
+```go
+// The node this Taint is attached to has the effect "effect" on
+// any pod that that does not tolerate the Taint.
+type Taint struct {
+ Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
+ Value string `json:"value,omitempty"`
+ Effect TaintEffect `json:"effect"`
+}
+
+type TaintEffect string
+
+const (
+ // Do not allow new pods to schedule unless they tolerate the taint,
+ // but allow all pods submitted to Kubelet without going through the scheduler
+ // to start, and allow all already-running pods to continue running.
+ // Enforced by the scheduler.
+ TaintEffectNoSchedule TaintEffect = "NoSchedule"
+ // Like TaintEffectNoSchedule, but the scheduler tries not to schedule
+ // new pods onto the node, rather than prohibiting new pods from scheduling
+ // onto the node. Enforced by the scheduler.
+ TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
+ // Do not allow new pods to schedule unless they tolerate the taint,
+ // do not allow pods to start on Kubelet unless they tolerate the taint,
+ // but allow all already-running pods to continue running.
+ // Enforced by the scheduler and Kubelet.
+ TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
+ // Do not allow new pods to schedule unless they tolerate the taint,
+ // do not allow pods to start on Kubelet unless they tolerate the taint,
+ // and try to eventually evict any already-running pods that do not tolerate the taint.
+ // Enforced by the scheduler and Kubelet.
+ TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
+)
+
+// The pod this Toleration is attached to tolerates any taint that matches
+// the triple <key,value,effect> using the matching operator <operator>.
+type Toleration struct {
+ Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
+ // operator represents a key's relationship to the value.
+ // Valid operators are Exists and Equal. Defaults to Equal.
+ // Exists is equivalent to wildcard for value, so that a pod can
+ // tolerate all taints of a particular category.
+ Operator TolerationOperator `json:"operator"`
+ Value string `json:"value,omitempty"`
+ Effect TaintEffect `json:"effect"`
+ // TODO: For forgiveness (#1574), we'd eventually add at least a grace period
+ // here, and possibly an occurrence threshold and period.
+}
+
+// A toleration operator is the set of operators that can be used in a toleration.
+type TolerationOperator string
+
+const (
+ TolerationOpExists TolerationOperator = "Exists"
+ TolerationOpEqual TolerationOperator = "Equal"
+)
+
+```
+
+(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
+to understand the motivation for the various taint effects.)
+
+We will add:
+
+```go
+ // Multiple tolerations with the same key are allowed.
+ Tolerations []Toleration `json:"tolerations,omitempty"`
+```
+
+to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type
+TaintEffectPreferNoSchedule) in order to be able to schedule onto that node.
+
+We will add:
+
+```go
+ // Multiple taints with the same key are not allowed.
+ Taints []Taint `json:"taints,omitempty"`
+```
+
+to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union
+of the taints specified by various sources. For now, the only source is
+the `NodeSpec` itself, but in the future one could imagine a node inheriting
+taints from pods (if we were to allow taints to be attached to pods), from
+the node's startup configuration, etc. The scheduler should look at the `Taints`
+in `NodeStatus`, not in `NodeSpec`.
+
+Taints and tolerations are not scoped to namespace.
+
+## Implementation plan: taints, tolerations, and dedicated nodes
+
+Using taints and tolerations to implement dedicated nodes requires these steps:
+
+1. Add the API described above
+1. Add a scheduler predicate function that respects taints and tolerations (for
+TaintEffectNoSchedule) and a scheduler priority function that respects taints
+and tolerations (for TaintEffectPreferNoSchedule).
+1. Add to the Kubelet code to implement the "no admit" behavior of
+TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute
+1. Implement code in Kubelet that evicts a pod that no longer satisfies
+TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the
+controllers instead, but since taints might be used to enforce security
+policies, it is better to do in kubelet because kubelet can respond quickly and
+can guarantee the rules will be applied to all pods. Eviction may need to happen
+under a variety of circumstances: when a taint is added, when an existing taint
+is updated, when a toleration is removed from a pod, or when a toleration is
+modified on a pod.
+1. Add a new `kubectl` command that adds/removes taints to/from nodes,
+1. (This is the one step is that is specific to dedicated nodes) Implement an
+admission controller that adds tolerations to pods that are supposed to be
+allowed to use dedicated nodes (for example, based on pod's namespace).
+
+In the future one can imagine a generic policy configuration that configures an
+admission controller to apply the appropriate tolerations to the desired class
+of pods and taints to Nodes upon node creation. It could be used not just for
+policies about dedicated nodes, but also other uses of taints and tolerations,
+e.g. nodes that are restricted due to their hardware configuration.
+
+The `kubectl` command to add and remove taints on nodes will be modeled after
+`kubectl label`. Examples usages:
+
+```sh
+# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
+# If a taint with that key already exists, its value and effect are replaced as specified.
+$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute
+
+# Remove from node 'foo' the taint with key 'dedicated' if one exists.
+$ kubectl taint nodes foo dedicated-
+```
+
+## Example: implementing a dedicated nodes policy
+
+Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available
+only to pods in a particular namespace `banana`. First the administrator does
+
+```sh
+$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute
+$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute
+$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute
+
+```
+
+(assuming they want to evict pods that are already running on those nodes if those
+pods don't already tolerate the new taint)
+
+Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify
+a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`.
+
+In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having
+to enumerate them by name.
+
+## Future work
+
+At present, the Kubernetes security model allows any user to add and remove any
+taints and tolerations. Obviously this makes it impossible to securely enforce
+rules like dedicated nodes. We need some mechanism that prevents regular users
+from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them
+from mutating any fields of `NodeSpec`) and from mutating the `Tolerations`
+field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549)
+is relevant.
+
+Another security vulnerability arises if nodes are added to the cluster before
+receiving their taint. Thus we need to ensure that a new node does not become
+"Ready" until it has been configured with its taints. One way to do this is to
+have an admission controller that adds the taint whenever a Node object is
+created.
+
+A quota policy may want to treat nodes differently based on what taints, if any,
+they have. For example, if a particular namespace is only allowed to access
+dedicated nodes, then it may be convenient to give the namespace unlimited
+quota. (To use finite quota, you'd have to size the namespace's quota to the sum
+of the sizes of the machines in the dedicated node group, and update it when
+nodes are added/removed to/from the group.)
+
+It's conceivable that taints and tolerations could be unified with
+[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
+We have chosen not to do this for the reasons described in the "Future work"
+section of that doc.
+
+## Backward compatibility
+
+Old scheduler versions will ignore taints and tolerations. New scheduler
+versions will respect them.
+
+Users should not start using taints and tolerations until the full
+implementation has been in Kubelet and the master for enough binary versions
+that we feel comfortable that we will not need to roll back either Kubelet or
+master to a version that does not support them. Longer-term we will use a
+programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
+
+## Related issues
+
+This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190).
+There are a number of other related issues, all of which are linked to from
+[#17190](https://github.com/kubernetes/kubernetes/issues/17190).
+
+The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574).
+
+The concepts of taints and tolerations were originally developed as part of the
+Omega project at Google.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/templates.md b/contributors/design-proposals/templates.md
new file mode 100644
index 00000000..2d58fbd5
--- /dev/null
+++ b/contributors/design-proposals/templates.md
@@ -0,0 +1,569 @@
+# Templates+Parameterization: Repeatedly instantiating user-customized application topologies.
+
+## Motivation
+
+Addresses https://github.com/kubernetes/kubernetes/issues/11492
+
+There are two main motivators for Template functionality in Kubernetes: Controller Instantiation and Application Definition
+
+### Controller Instantiation
+
+Today the replication controller defines a PodTemplate which allows it to instantiate multiple pods with identical characteristics.
+This is useful but limited. Stateful applications have a need to instantiate multiple instances of a more sophisticated topology
+than just a single pod (e.g. they also need Volume definitions). A Template concept would allow a Controller to stamp out multiple
+instances of a given Template definition. This capability would be immediately useful to the [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016) proposal.
+
+Similarly the [Service Catalog proposal](https://github.com/kubernetes/kubernetes/pull/17543) could leverage template instantiation as a mechanism for claiming service instances.
+
+
+### Application Definition
+
+Kubernetes gives developers a platform on which to run images and many configuration objects to control those images, but
+constructing a cohesive application made up of images and configuration objects is currently difficult. Applications
+require:
+
+* Information sharing between images (e.g. one image provides a DB service, another consumes it)
+* Configuration/tuning settings (memory sizes, queue limits)
+* Unique/customizable identifiers (service names, routes)
+
+Application authors know which values should be tunable and what information must be shared, but there is currently no
+consistent way for an application author to define that set of information so that application consumers can easily deploy
+an application and make appropriate decisions about the tunable parameters the author intended to expose.
+
+Furthermore, even if an application author provides consumers with a set of API object definitions (e.g. a set of yaml files)
+it is difficult to build a UI around those objects that would allow the deployer to modify names in one place without
+potentially breaking assumed linkages to other pieces. There is also no prescriptive way to define which configuration
+values are appropriate for a deployer to tune or what the parameters control.
+
+## Use Cases
+
+### Use cases for templates in general
+
+* Providing a full baked application experience in a single portable object that can be repeatably deployed in different environments.
+ * e.g. Wordpress deployment with separate database pod/replica controller
+ * Complex service/replication controller/volume topologies
+* Bulk object creation
+* Provide a management mechanism for deleting/uninstalling an entire set of components related to a single deployed application
+* Providing a library of predefined application definitions that users can select from
+* Enabling the creation of user interfaces that can guide an application deployer through the deployment process with descriptive help about the configuration value decisions they are making, and useful default values where appropriate
+* Exporting a set of objects in a namespace as a template so the topology can be inspected/visualized or recreated in another environment
+* Controllers that need to instantiate multiple instances of identical objects (e.g. StatefulSets).
+
+
+### Use cases for parameters within templates
+
+* Share passwords between components (parameter value is provided to each component as an environment variable or as a Secret reference, with the Secret value being parameterized or produced by an [initializer](https://github.com/kubernetes/kubernetes/issues/3585))
+* Allow for simple deployment-time customization of “app” configuration via environment values or api objects, e.g. memory
+ tuning parameters to a MySQL image, Docker image registry prefix for image strings, pod resource requests and limits, default
+ scale size.
+* Allow simple, declarative defaulting of parameter values and expose them to end users in an approachable way - a parameter
+ like “MySQL table space” can be parameterized in images as an env var - the template parameters declare the parameter, give
+ it a friendly name, give it a reasonable default, and informs the user what tuning options are available.
+* Customization of component names to avoid collisions and ensure matched labeling (e.g. replica selector value and pod label are
+ user provided and in sync).
+* Customize cross-component references (e.g. user provides the name of a secret that already exists in their namespace, to use in
+ a pod as a TLS cert).
+* Provide guidance to users for parameters such as default values, descriptions, and whether or not a particular parameter value
+ is required or can be left blank.
+* Parameterize the replica count of a deployment or [StatefulSet](https://github.com/kubernetes/kubernetes/pull/18016)
+* Parameterize part of the labels and selector for a DaemonSet
+* Parameterize quota/limit values for a pod
+* Parameterize a secret value so a user can provide a custom password or other secret at deployment time
+
+
+## Design Assumptions
+
+The goal for this proposal is a simple schema which addresses a few basic challenges:
+
+* Allow application authors to expose configuration knobs for application deployers, with suggested defaults and
+descriptions of the purpose of each knob
+* Allow application deployers to easily customize exposed values like object names while maintaining referential integrity
+ between dependent pieces (for example ensuring a pod's labels always match the corresponding selector definition of the service)
+* Support maintaining a library of templates within Kubernetes that can be accessed and instantiated by end users
+* Allow users to quickly and repeatedly deploy instances of well-defined application patterns produced by the community
+* Follow established Kubernetes API patterns by defining new template related APIs which consume+return first class Kubernetes
+ API (and therefore json conformant) objects.
+
+We do not wish to invent a new Turing-complete templating language. There are good options available
+(e.g. https://github.com/mustache/mustache) for developers who want a completely flexible and powerful solution for creating
+arbitrarily complex templates with parameters, and tooling can be built around such schemes.
+
+This desire for simplicity also intentionally excludes template composability/embedding as a supported use case.
+
+Allowing templates to reference other templates presents versioning+consistency challenges along with making the template
+no longer a self-contained portable object. Scenarios necessitating multiple templates can be handled in one of several
+alternate ways:
+
+* Explicitly constructing a new template that merges the existing templates (tooling can easily be constructed to perform this
+ operation since the templates are first class api objects).
+* Manually instantiating each template and utilizing [service linking](https://github.com/kubernetes/kubernetes/pull/17543) to share
+ any necessary configuration data.
+
+This document will also refrain from proposing server APIs or client implementations. This has been a point of debate, and it makes
+more sense to focus on the template/parameter specification/syntax than to worry about the tooling that will process or manage the
+template objects. However since there is a desire to at least be able to support a server side implementation, this proposal
+does assume the specification will be k8s API friendly.
+
+## Desired characteristics
+
+* Fully k8s object json-compliant syntax. This allows server side apis that align with existing k8s apis to be constructed
+ which consume templates and existing k8s tooling to work with them. It also allows for api versioning/migration to be managed by
+ the existing k8s codec scheme rather than having to define/introduce a new syntax evolution mechanism.
+ * (Even if they are not part of the k8s core, it would still be good if a server side template processing+managing api supplied
+ as an ApiGroup consumed the same k8s object schema as the peer k8s apis rather than introducing a new one)
+* Self-contained parameter definitions. This allows a template to be a portable object which includes metadata that describe
+ the inputs it expects, making it easy to wrapper a user interface around the parameterization flow.
+* Object field primitive types include string, int, boolean, byte[]. The substitution scheme should support all of those types.
+ * complex types (struct/map/list) can be defined in terms of the available primitives, so it's preferred to avoid the complexity
+ of allowing for full complex-type substitution.
+* Parameter metadata. Parameters should include at a minimum, information describing the purpose of the parameter, whether it is
+ required/optional, and a default/suggested value. Type information could also be required to enable more intelligent client interfaces.
+* Template metadata. Templates should be able to include metadata describing their purpose or links to further documentation and
+ versioning information. Annotations on the Template's metadata field can fulfill this requirement.
+
+
+## Proposed Implementation
+
+### Overview
+
+We began by looking at the List object which allows a user to easily group a set of objects together for easy creation via a
+single CLI invocation. It also provides a portable format which requires only a single file to represent an application.
+
+From that starting point, we propose a Template API object which can encapsulate the definition of all components of an
+application to be created. The application definition is encapsulated in the form of an array of API objects (identical to
+List), plus a parameterization section. Components reference the parameter by name and the value of the parameter is
+substituted during a processing step, prior to submitting each component to the appropriate API endpoint for creation.
+
+The primary capability provided is that parameter values can easily be shared between components, such as a database password
+that is provided by the user once, but then attached as an environment variable to both a database pod and a web frontend pod.
+
+In addition, the template can be repeatedly instantiated for a consistent application deployment experience in different
+namespaces or Kubernetes clusters.
+
+Lastly, we propose the Template API object include a “Labels” section in which the template author can define a set of labels
+to be applied to all objects created from the template. This will give the template deployer an easy way to manage all the
+components created from a given template. These labels will also be applied to selectors defined by Objects within the template,
+allowing a combination of templates and labels to be used to scope resources within a namespace. That is, a given template
+can be instantiated multiple times within the same namespace, as long as a different label value is used each for each
+instantiation. The resulting objects will be independent from a replica/load-balancing perspective.
+
+Generation of parameter values for fields such as Secrets will be delegated to an [admission controller/initializer/finalizer](https://github.com/kubernetes/kubernetes/issues/3585) rather than being solved by the template processor. Some discussion about a generation
+service is occurring [here](https://github.com/kubernetes/kubernetes/issues/12732)
+
+Labels to be assigned to all objects could also be generated in addition to, or instead of, allowing labels to be supplied in the
+Template definition.
+
+### API Objects
+
+**Template Object**
+
+```
+// Template contains the inputs needed to produce a Config.
+type Template struct {
+ unversioned.TypeMeta
+ kapi.ObjectMeta
+
+ // Optional: Parameters is an array of Parameters used during the
+ // Template to Config transformation.
+ Parameters []Parameter
+
+ // Required: A list of resources to create
+ Objects []runtime.Object
+
+ // Optional: ObjectLabels is a set of labels that are applied to every
+ // object during the Template to Config transformation
+ // These labels are also be applied to selectors defined by objects in the template
+ ObjectLabels map[string]string
+}
+```
+
+**Parameter Object**
+
+```
+// Parameter defines a name/value variable that is to be processed during
+// the Template to Config transformation.
+type Parameter struct {
+ // Required: Parameter name must be set and it can be referenced in Template
+ // Items using $(PARAMETER_NAME)
+ Name string
+
+ // Optional: The name that will show in UI instead of parameter 'Name'
+ DisplayName string
+
+ // Optional: Parameter can have description
+ Description string
+
+ // Optional: Value holds the Parameter data.
+ // The value replaces all occurrences of the Parameter $(Name) or
+ // $((Name)) expression during the Template to Config transformation.
+ Value string
+
+ // Optional: Indicates the parameter must have a non-empty value either provided by the user or provided by a default. Defaults to false.
+ Required bool
+
+ // Optional: Type-value of the parameter (one of string, int, bool, or base64)
+ // Used by clients to provide validation of user input and guide users.
+ Type ParameterType
+}
+```
+
+As seen above, parameters allow for metadata which can be fed into client implementations to display information about the
+parameter’s purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)`
+and `$((PARAM))`. When the single parens option is used, the result of the substitution will remain quoted. When the double
+parens option is used, the result of the substitution will not be quoted. For example, given a parameter defined with a value
+of "BAR", the following behavior will be observed:
+
+```
+somefield: "$(FOO)" -> somefield: "BAR"
+somefield: "$((FOO))" -> somefield: BAR
+```
+
+// for concatenation, the result value reflects the type of substitution (quoted or unquoted):
+
+```
+somefield: "prefix_$(FOO)_suffix" -> somefield: "prefix_BAR_suffix"
+somefield: "prefix_$((FOO))_suffix" -> somefield: prefix_BAR_suffix
+```
+
+// if both types of substitution exist, quoting is performed:
+
+```
+somefield: "prefix_$((FOO))_$(FOO)_suffix" -> somefield: "prefix_BAR_BAR_suffix"
+```
+
+This mechanism allows for integer/boolean values to be substituted properly.
+
+The value of the parameter can be explicitly defined in template. This should be considered a default value for the parameter, clients
+which process templates are free to override this value based on user input.
+
+
+**Example Template**
+
+Illustration of a template which defines a service and replication controller with parameters to specialized
+the name of the top level objects, the number of replicas, and several environment variables defined on the
+pod template.
+
+```
+{
+ "kind": "Template",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "mongodb-ephemeral",
+ "annotations": {
+ "description": "Provides a MongoDB database service"
+ }
+ },
+ "labels": {
+ "template": "mongodb-ephemeral-template"
+ },
+ "objects": [
+ {
+ "kind": "Service",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "$(DATABASE_SERVICE_NAME)"
+ },
+ "spec": {
+ "ports": [
+ {
+ "name": "mongo",
+ "protocol": "TCP",
+ "targetPort": 27017
+ }
+ ],
+ "selector": {
+ "name": "$(DATABASE_SERVICE_NAME)"
+ }
+ }
+ },
+ {
+ "kind": "ReplicationController",
+ "apiVersion": "v1",
+ "metadata": {
+ "name": "$(DATABASE_SERVICE_NAME)"
+ },
+ "spec": {
+ "replicas": "$((REPLICA_COUNT))",
+ "selector": {
+ "name": "$(DATABASE_SERVICE_NAME)"
+ },
+ "template": {
+ "metadata": {
+ "creationTimestamp": null,
+ "labels": {
+ "name": "$(DATABASE_SERVICE_NAME)"
+ }
+ },
+ "spec": {
+ "containers": [
+ {
+ "name": "mongodb",
+ "image": "docker.io/centos/mongodb-26-centos7",
+ "ports": [
+ {
+ "containerPort": 27017,
+ "protocol": "TCP"
+ }
+ ],
+ "env": [
+ {
+ "name": "MONGODB_USER",
+ "value": "$(MONGODB_USER)"
+ },
+ {
+ "name": "MONGODB_PASSWORD",
+ "value": "$(MONGODB_PASSWORD)"
+ },
+ {
+ "name": "MONGODB_DATABASE",
+ "value": "$(MONGODB_DATABASE)"
+ }
+ ]
+ }
+ ]
+ }
+ }
+ }
+ }
+ ],
+ "parameters": [
+ {
+ "name": "DATABASE_SERVICE_NAME",
+ "description": "Database service name",
+ "value": "mongodb",
+ "required": true
+ },
+ {
+ "name": "MONGODB_USER",
+ "description": "Username for MongoDB user that will be used for accessing the database",
+ "value": "username",
+ "required": true
+ },
+ {
+ "name": "MONGODB_PASSWORD",
+ "description": "Password for the MongoDB user",
+ "required": true
+ },
+ {
+ "name": "MONGODB_DATABASE",
+ "description": "Database name",
+ "value": "sampledb",
+ "required": true
+ },
+ {
+ "name": "REPLICA_COUNT",
+ "description": "Number of mongo replicas to run",
+ "value": "1",
+ "required": true
+ }
+ ]
+}
+```
+
+### API Endpoints
+
+* **/processedtemplates** - when a template is POSTed to this endpoint, all parameters in the template are processed and
+substituted into appropriate locations in the object definitions. Validation is performed to ensure required parameters have
+a value supplied. In addition labels defined in the template are applied to the object definitions. Finally the customized
+template (still a `Template` object) is returned to the caller. (The possibility of returning a List instead has
+also been discussed and will be considered for implementation).
+
+The client is then responsible for iterating the objects returned and POSTing them to the appropriate resource api endpoint to
+create each object, if that is the desired end goal for the client.
+
+Performing parameter substitution on the server side has the benefit of centralizing the processing so that new clients of
+k8s, such as IDEs, CI systems, Web consoles, etc, do not need to reimplement template processing or embed the k8s binary.
+Instead they can invoke the k8s api directly.
+
+* **/templates** - the REST storage resource for storing and retrieving template objects, scoped within a namespace.
+
+Storing templates within k8s has the benefit of enabling template sharing and securing via the same roles/resources
+that are used to provide access control to other cluster resources. It also enables sophisticated service catalog
+flows in which selecting a service from a catalog results in a new instantiation of that service. (This is not the
+only way to implement such a flow, but it does provide a useful level of integration).
+
+Creating a new template (POST to the /templates api endpoint) simply stores the template definition, it has no side
+effects(no other objects are created).
+
+This resource can also support a subresource "/templates/templatename/processed". This resource would accept just a
+Parameters object and would process the template stored in the cluster as "templatename". The processed result would be
+returned in the same form as `/processedtemplates`
+
+### Workflow
+
+#### Template Instantiation
+
+Given a well-formed template, a client will
+
+1. Optionally set an explicit `value` for any parameter values the user wishes to explicitly set
+2. Submit the new template object to the `/processedtemplates` api endpoint
+
+The api endpoint will then:
+
+1. Validate the template including confirming “required” parameters have an explicit value.
+2. Walk each api object in the template.
+3. Adding all labels defined in the template’s ObjectLabels field.
+4. For each field, check if the value matches a parameter name and if so, set the value of the field to the value of the parameter.
+ * Partial substitutions are accepted, such as `SOME_$(PARAM)` which would be transformed into `SOME_XXXX` where `XXXX` is the value
+ of the `$(PARAM)` parameter.
+ * If a given $(VAL) could be resolved to either a parameter or an environment variable/downward api reference, an error will be
+ returned.
+5. Return the processed template object. (or List, depending on the choice made when this is implemented)
+
+The client can now either return the processed template to the user in a desired form (e.g. json or yaml), or directly iterate the
+api objects within the template, invoking the appropriate object creation api endpoint for each element. (If the api returns
+a List, the client would simply iterate the list to create the objects).
+
+The result is a consistently recreatable application configuration, including well-defined labels for grouping objects created by
+the template, with end-user customizations as enabled by the template author.
+
+#### Template Authoring
+
+To aid application authors in the creation of new templates, it should be possible to export existing objects from a project
+in template form. A user should be able to export all or a filtered subset of objects from a namespace, wrappered into a
+Template API object. The user will still need to customize the resulting object to enable parameterization and labeling,
+though sophisticated export logic could attempt to auto-parameterize well understood api fields. Such logic is not considered
+in this proposal.
+
+#### Tooling
+
+As described above, templates can be instantiated by posting them to a template processing endpoint. CLI tools should
+exist which can input parameter values from the user as part of the template instantiation flow.
+
+More sophisticated UI implementations should also guide the user through which parameters the template expects, the description
+of those templates, and the collection of user provided values.
+
+In addition, as described above, existing objects in a namespace can be exported in template form, making it easy to recreate a
+set of objects in a new namespace or a new cluster.
+
+
+## Examples
+
+### Example Templates
+
+These examples reflect the current OpenShift template schema, not the exact schema proposed in this document, however this
+proposal, if accepted, provides sufficient capability to support the examples defined here, with the exception of
+automatic generation of passwords.
+
+* [Jenkins template](https://github.com/openshift/origin/blob/master/examples/jenkins/jenkins-persistent-template.json)
+* [MySQL DB service template](https://github.com/openshift/origin/blob/master/examples/db-templates/mysql-persistent-template.json)
+
+### Examples of OpenShift Parameter Usage
+
+(mapped to use cases described above)
+
+* [Share passwords](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L146-L152)
+* [Simple deployment-time customization of “app” configuration via environment values](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L108-L126) (e.g. memory tuning, resource limits, etc)
+* [Customization of component names with referential integrity](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L199-L207)
+* [Customize cross-component references](https://github.com/jboss-openshift/application-templates/blob/master/eap/eap64-mongodb-s2i.json#L78-L83) (e.g. user provides the name of a secret that already exists in their namespace, to use in a pod as a TLS cert)
+
+## Requirements analysis
+
+There has been some discussion of desired goals for a templating/parameterization solution [here](https://github.com/kubernetes/kubernetes/issues/11492#issuecomment-160853594). This section will attempt to address each of those points.
+
+*The primary goal is that parameterization should facilitate reuse of declarative configuration templates in different environments in
+ a "significant number" of common cases without further expansion, substitution, or other static preprocessing.*
+
+* This solution provides for templates that can be reused as is (assuming parameters are not used or provide sane default values) across
+ different environments, they are a self-contained description of a topology.
+
+*Parameterization should not impede the ability to use kubectl commands with concrete resource specifications.*
+
+* The parameterization proposal here does not extend beyond Template objects. That is both a strength and limitation of this proposal.
+ Parameterizable objects must be wrapped into a Template object, rather than existing on their own.
+
+*Parameterization should work with all kubectl commands that accept --filename, and should work on templates comprised of multiple resources.*
+
+* Same as above.
+
+*The parameterization mechanism should not prevent the ability to wrap kubectl with workflow/orchestration tools, such as Deployment manager.*
+
+* Since this proposal uses standard API objects, a DM or Helm flow could still be constructed around a set of templates, just as those flows are
+ constructed around other API objects today.
+
+*Any parameterization mechanism we add should not preclude the use of a different parameterization mechanism, it should be possible
+to use different mechanisms for different resources, and, ideally, the transformation should be composable with other
+substitution/decoration passes.*
+
+* This templating scheme does not preclude layering an additional templating mechanism over top of it. For example, it would be
+ possible to write a Mustache template which, after Mustache processing, resulted in a Template which could then be instantiated
+ through the normal template instantiating process.
+
+*Parameterization should not compromise reproducibility. For instance, it should be possible to manage template arguments as well as
+templates under version control.*
+
+* Templates are a single file, including default or chosen values for parameters. They can easily be managed under version control.
+
+*It should be possible to specify template arguments (i.e., parameter values) declaratively, in a way that is "self-describing"
+(i.e., naming the parameters and the template to which they correspond). It should be possible to write generic commands to
+process templates.*
+
+* Parameter definitions include metadata which describes the purpose of the parameter. Since parameter definitions are part of the template,
+ there is no need to indicate which template they correspond to.
+
+*It should be possible to validate templates and template parameters, both values and the schema.*
+
+* Template objects are subject to standard api validation.
+
+*It should also be possible to validate and view the output of the substitution process.*
+
+* The `/processedtemplates` api returns the result of the substitution process, which is itself a Template object that can be validated.
+
+*It should be possible to generate forms for parameterized templates, as discussed in #4210 and #6487.*
+
+* Parameter definitions provide metadata that allows for the construction of form-based UIs to gather parameter values from users.
+
+*It shouldn't be inordinately difficult to evolve templates. Thus, strategies such as versioning and encapsulation should be
+encouraged, at least by convention.*
+
+* Templates can be versioned via annotations on the template object.
+
+## Key discussion points
+
+The preceding document is opinionated about each of these topics, however they have been popular topics of discussion so they are called out explicitly below.
+
+### Where to define parameters
+
+There has been some discussion around where to define parameters that are being injected into a Template
+
+1. In a separate standalone file
+2. Within the Template itself
+
+This proposal suggests including the parameter definitions within the Template, which provides a self-contained structure that
+can be easily versioned, transported, and instantiated without risk of mismatching content. In addition, a Template can easily
+be validated to confirm that all parameter references are resolveable.
+
+Separating the parameter definitions makes for a more complex process with respect to
+* Editing a template (if/when first class editing tools are created)
+* Storing/retrieving template objects with a central store
+
+Note that the `/templates/sometemplate/processed` subresource would accept a standalone set of parameters to be applied to `sometemplate`.
+
+### How to define parameters
+
+There has also been debate about how a parameter should be referenced from within a template. This proposal suggests that
+fields to be substituted by a parameter value use the "$(parameter)" syntax which is already used elsewhere within k8s. The
+value of `parameter` should be matched to a parameter with that name, and the value of the matched parameter substituted into
+the field value.
+
+Other suggestions include a path/map approach in which a list of field paths (e.g. json path expressions) and corresponding
+parameter names are provided. The substitution process would walk the map, replacing fields with the appropriate
+parameter value. This approach makes templates more fragile from the perspective of editing/refactoring as field paths
+may change, thus breaking the map. There is of course also risk of breaking references with the previous scheme, but
+renaming parameters seems less likely than changing field paths.
+
+### Storing templates in k8s
+
+Openshift defines templates as a first class resource so they can be created/retrieved/etc via standard tools. This allows client tools to list available templates (available in the openshift cluster), allows existing resource security controls to be applied to templates, and generally provides a more integrated feel to templates. However there is no explicit requirement that for k8s to adopt templates, it must also adopt storing them in the cluster.
+
+### Processing templates (server vs. client)
+
+Openshift handles template processing via a server endpoint which consumes a template object from the client and returns the list of objects
+produced by processing the template. It is also possible to handle the entire template processing flow via the client, but this was deemed
+undesirable as it would force each client tool to reimplement template processing (e.g. the standard CLI tool, an eclipse plugin, a plugin for a CI system like Jenkins, etc). The assumption in this proposal is that server side template processing is the preferred implementation approach for
+this reason.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/templates.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/ubernetes-cluster-state.png b/contributors/design-proposals/ubernetes-cluster-state.png
new file mode 100644
index 00000000..56ec2df8
--- /dev/null
+++ b/contributors/design-proposals/ubernetes-cluster-state.png
Binary files differ
diff --git a/contributors/design-proposals/ubernetes-design.png b/contributors/design-proposals/ubernetes-design.png
new file mode 100644
index 00000000..44924846
--- /dev/null
+++ b/contributors/design-proposals/ubernetes-design.png
Binary files differ
diff --git a/contributors/design-proposals/ubernetes-scheduling.png b/contributors/design-proposals/ubernetes-scheduling.png
new file mode 100644
index 00000000..01774882
--- /dev/null
+++ b/contributors/design-proposals/ubernetes-scheduling.png
Binary files differ
diff --git a/contributors/design-proposals/versioning.md b/contributors/design-proposals/versioning.md
new file mode 100644
index 00000000..ae724b12
--- /dev/null
+++ b/contributors/design-proposals/versioning.md
@@ -0,0 +1,174 @@
+# Kubernetes API and Release Versioning
+
+Reference: [Semantic Versioning](http://semver.org)
+
+Legend:
+
+* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released.
+This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the
+major version, **Y** is the minor version, and **Z** is the patch version.)
+* **API vX[betaY]** refers to the version of the HTTP API.
+
+## Release versioning
+
+### Minor version scheme and timeline
+
+* Kube X.Y.0-alpha.W, W > 0 (Branch: master)
+ * Alpha releases are released roughly every two weeks directly from the master
+branch.
+ * No cherrypick releases. If there is a critical bugfix, a new release from
+master can be created ahead of schedule.
+* Kube X.Y.Z-beta.W (Branch: release-X.Y)
+ * When master is feature-complete for Kube X.Y, we will cut the release-X.Y
+branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential
+to X.Y.
+ * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0.
+ * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases,
+(X.Y.0-beta.W | W > 0) as necessary.
+* Kube X.Y.0 (Branch: release-X.Y)
+ * Final release, cut from the release-X.Y branch cut two weeks prior.
+ * X.Y.1-beta.0 will be tagged at the same commit on the same branch.
+ * X.Y.0 occur 3 to 4 months after X.(Y-1).0.
+* Kube X.Y.Z, Z > 0 (Branch: release-X.Y)
+ * [Patch releases](#patch-releases) are released as we cherrypick commits into
+the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed.
+ * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is
+tagged on the followup commit that updates pkg/version/base.go with the beta
+version.
+* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z)
+ * These are special and different in that the X.Y.Z tag is branched to isolate
+the emergency/critical fix from all other changes that have landed on the
+release branch since the previous tag
+ * Cut release-X.Y.Z branch to hold the isolated patch release
+ * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1)
+ * Branched [patch releases](#patch-releases) are rarely needed but used for
+emergency/critical fixes to the latest release
+ * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed
+for this kind of release to be possible.
+
+### Major version timeline
+
+There is no mandated timeline for major versions. They only occur when we need
+to start the clock on deprecating features. A given major version should be the
+latest major version for at least one year from its original release date.
+
+### CI and dev version scheme
+
+* Continuous integration versions also exist, and are versioned off of alpha and
+beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an
+additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after
+X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds
+that are built off of a dirty build tree, (during development, with things in
+the tree that are not checked it,) it will be appended with -dirty.
+
+### Supported releases and component skew
+
+We expect users to stay reasonably up-to-date with the versions of Kubernetes
+they use in production, but understand that it may take time to upgrade,
+especially for production-critical components.
+
+We expect users to be running approximately the latest patch release of a given
+minor release; we often include critical bug fixes in
+[patch releases](#patch-release), and so encourage users to upgrade as soon as
+possible.
+
+Different components are expected to be compatible across different amounts of
+skew, all relative to the master version. Nodes may lag masters components by
+up to two minor versions but should be at a version no newer than the master; a
+client should be skewed no more than one minor version from the master, but may
+lead the master by up to one minor version. For example, a v1.3 master should
+work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4
+clients.
+
+Furthermore, we expect to "support" three minor releases at a time. "Support"
+means we expect users to be running that version in production, though we may
+not port fixes back before the latest minor version. For example, when v1.3
+comes out, v1.0 will no longer be supported: basically, that means that the
+reasonable response to the question "my v1.0 cluster isn't working," is, "you
+should probably upgrade it, (and probably should have some time ago)". With
+minor releases happening approximately every three months, that means a minor
+release is supported for approximately nine months.
+
+This policy is in line with
+[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade).
+
+## API versioning
+
+### Release versions as related to API versions
+
+Here is an example major release cycle:
+
+* **Kube 1.0 should have API v1 without v1beta\* API versions**
+ * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have
+the stable v1 API. This enables you to migrate all your objects off of the beta
+API versions of the API and allows us to remove those beta API versions in Kube
+1.0 with no effect. There will be tooling to help you detect and migrate any
+v1beta\* data versions or calls to v1 before you do the upgrade.
+* **Kube 1.x may have API v2beta***
+ * The first incarnation of a new (backwards-incompatible) API in HEAD is
+ v2beta1. By default this will be unregistered in apiserver, so it can change
+ freely. Once it is available by default in apiserver (which may not happen for
+several minor releases), it cannot change ever again because we serialize
+objects in versioned form, and we always need to be able to deserialize any
+objects that are saved in etcd, even between alpha versions. If further changes
+to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x
+versions.
+* **Kube 1.y (where y is the last version of the 1.x series) must have final
+API v2**
+ * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two
+ things: (1) users can upgrade to API v2 when running Kube 1.x and then switch
+ over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can
+ cleanup and remove all API v2beta\* versions because no one should have
+ v2beta\* objects left in their database. As mentioned above, tooling will exist
+ to make sure there are no calls or references to a given API version anywhere
+ inside someone's kube installation before someone upgrades.
+ * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only.
+It *may* include the v1 API as well if the burden is not high - this will be
+determined on a per-major-version basis.
+
+#### Rationale for API v2 being complete before v2.0's release
+
+It may seem a bit strange to complete the v2 API before v2.0 is released,
+but *adding* a v2 API is not a breaking change. *Removing* the v2beta\*
+APIs *is* a breaking change, which is what necessitates the major version bump.
+There are other ways to do this, but having the major release be the fresh start
+of that release's API without the baggage of its beta versions seems most
+intuitive out of the available options.
+
+## Patch releases
+
+Patch releases are intended for critical bug fixes to the latest minor version,
+such as addressing security vulnerabilities, fixes to problems affecting a large
+number of users, severe problems with no workaround, and blockers for products
+based on Kubernetes.
+
+They should not contain miscellaneous feature additions or improvements, and
+especially no incompatibilities should be introduced between patch versions of
+the same minor version (or even major version).
+
+Dependencies, such as Docker or Etcd, should also not be changed unless
+absolutely necessary, and also just to fix critical bugs (so, at most patch
+version changes, not new major nor minor versions).
+
+## Upgrades
+
+* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a
+rolling upgrade across their cluster. (Rolling upgrade means being able to
+upgrade the master first, then one node at a time. See #4855 for details.)
+ * However, we do not recommend upgrading more than two minor releases at a
+time (see [Supported releases](#supported-releases)), and do not recommend
+running non-latest patch releases of a given minor release.
+* No hard breaking changes over version boundaries.
+ * For example, if a user is at Kube 1.x, we may require them to upgrade to
+Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across
+major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as
+graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone
+to go from 1.x to 1.x+y before they go to 2.x.
+
+There is a separate question of how to track the capabilities of a kubelet to
+facilitate rolling upgrades. That is not addressed here.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-hostpath-qualifiers.md b/contributors/design-proposals/volume-hostpath-qualifiers.md
new file mode 100644
index 00000000..cd0902ec
--- /dev/null
+++ b/contributors/design-proposals/volume-hostpath-qualifiers.md
@@ -0,0 +1,150 @@
+# Support HostPath volume existence qualifiers
+
+## Introduction
+
+A Host volume source is probably the simplest volume type to define, needing
+only a single path. However, that simplicity comes with many assumptions and
+caveats.
+
+This proposal describes one of the issues associated with Host volumes &mdash;
+their silent and implicit creation of directories on the host &mdash; and
+proposes a solution.
+
+## Problem
+
+Right now, under Docker, when a bindmount references a hostPath, that path will
+be created as an empty directory, owned by root, if it does not already exist.
+This is rarely what the user actually wants because hostPath volumes are
+typically used to express a dependency on an existing external file or
+directory.
+This concern was raised during the [initial
+implementation](https://github.com/docker/docker/issues/1279#issuecomment-22965058)
+of this behavior in Docker and it was suggested that orchestration systems
+could better manage volume creation than Docker, but Docker does so as well
+anyways.
+
+To fix this problem, I propose allowing a pod to specify whether a given
+hostPath should exist prior to the pod running, whether it should be created,
+and what it should exist as.
+I also propose the inclusion of a default value which matches the current
+behavior to ensure backwards compatibility.
+
+To understand exactly when this behavior will or won't be correct, it's
+important to look at the use-cases of Host Volumes.
+The table below broadly classifies the use-case of Host Volumes and asserts
+whether this change would be of benefit to that use-case.
+
+### HostPath volume Use-cases
+
+| Use-case | Description | Examples | Benefits from this change? | Why? |
+|:---------|:------------|:---------|:--------------------------:|:-----|
+| Accessing an external system, data, or configuration | Data or a unix socket is created by a process on the host, and a pod within kubernetes consumes it | [fluentd-es-addon](https://github.com/kubernetes/kubernetes/blob/74b01041cc3feb2bb731cc243ab0e4515bef9a84/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml#L30), [addon-manager](https://github.com/kubernetes/kubernetes/blob/808f3ecbe673b4127627a457dc77266ede49905d/cluster/gce/coreos/kube-manifests/kube-addon-manager.yaml#L23), [kube-proxy](https://github.com/kubernetes/kubernetes/blob/010c976ce8dd92904a7609483c8e794fd8e94d4e/cluster/saltbase/salt/kube-proxy/kube-proxy.manifest#L65), etc | :white_check_mark: | Fails faster and with more useful messages, and won't run when basic assumptions are false (e.g. that docker is the runtime and the docker.sock exists) |
+| Providing data to external systems | Some pods wish to publish data to the host for other systems to consume, sometimes to a generic directory and sometimes to more component-specific ones | Kubelet core components which bindmount their logs out to `/var/log/*.log` so logrotate and other tools work with them | :white_check_mark: | Sometimes, but not always. It's directory-specific whether it not existing will be a problem. |
+| Communicating between instances and versions of yourself | A pod can use a hostPath directory as a sort of cache and, as opposed to an emptyDir, persist the directory between versions of itself | [etcd](https://github.com/kubernetes/kubernetes/blob/fac54c9b22eff5c5052a8e3369cf8416a7827d36/cluster/saltbase/salt/etcd/etcd.manifest#L84), caches | :x: | It's pretty much always okay to create them |
+
+
+### Other motivating factors
+
+One additional motivating factor for this change is that under the rkt runtime
+paths are not created when they do not exist. This change moves the management
+of these volumes into the Kubelet to the benefit of the rkt container runtime.
+
+
+## Proposed API Change
+
+### Host Volume
+
+I propose that the
+[`v1.HostPathVolumeSource`](https://github.com/kubernetes/kubernetes/blob/d26b4ca2859aa667ad520fb9518e0db67b74216a/pkg/api/types.go#L447-L451)
+object be changed to include the following additional field:
+
+`Type` - An optional string of `exists|file|device|socket|directory` - If not
+set, it will default to a backwards-compatible default behavior described
+below.
+
+| Value | Behavior |
+|:------|:---------|
+| *unset* | If nothing exists at the given path, an empty directory will be created there. Otherwise, behaves like `exists` |
+| `exists` | If nothing exists at the given path, the pod will fail to run and provide an informative error message |
+| `file` | If a file does not exist at the given path, the pod will fail to run and provide an informative error message |
+| `device` | If a block or character device does not exist at the given path, the pod will fail to run and provide an informative error message |
+| `socket` | If a socket does not exist at the given path, the pod will fail to run and provide an informative error message |
+| `directory` | If a directory does not exist at the given path, the pod will fail to run and provide an informative error message |
+
+Additional possible values, which are proposed to be excluded:
+
+|Value | Behavior | Reason for exclusion |
+|:-----|:---------|:---------------------|
+| `new-directory` | Like `auto`, but the given path must be a directory if it exists | `auto` mostly fills this use-case |
+| `character-device` | | Granularity beyond `device` shouldn't matter often |
+| `block-device` | | Granularity beyond `device` shouldn't matter often |
+| `new-file` | Like file, but if nothing exist an empty file is created instead | In general, bindmounting the parent directory of the file you intend to create addresses this usecase |
+| `optional` | If a path does not exist, then do not create any container-mount at all | This would better be handled by a new field entirely if this behavior is desirable |
+
+
+### Why not as part of any other volume types?
+
+This feature does not make sense for any of the other volume types simply
+because all of the other types are already fully qualified. For example, NFS
+volumes are known to always be in existence else they will not mount.
+Similarly, EmptyDir volumes will always exist as a directory.
+
+Only the HostVolume and SubPath means of referencing a path have the potential
+to reference arbitrary incorrect or nonexistent things without erroring out.
+
+### Alternatives
+
+One alternative is to augment Host Volumes with a `MustExist` bool and provide
+no further granularity. This would allow toggling between the `auto` and
+`exists` behaviors described above. This would likely cover the "90%" use-case
+and would be a simpler API. It would be sufficient for all of the examples
+linked above in my opionion.
+
+## Kubelet implementation
+
+It's proposed that prior to starting a pod, the Kubelet validates that the
+given path meets the qualifications of its type. Namely, if the type is `auto`
+the Kubelet will create an empty directory if none exists there, and for each
+of the others the Kubelet will perform the given validation prior to running
+the pod. This validation might be done by a volume plugin, but further
+technical consideration (out of scope of this proposal) is needed.
+
+
+## Possible concerns
+
+### Permissions
+
+This proposal does not attempt to change the state of volume permissions. Currently, a HostPath volume is created with `root` ownership and `755` permissions. This behavior will be retained. An argument for this behavior is given [here](volumes.md#shared-storage-hostpath).
+
+### SELinux
+
+This proposal should not impact SELinux relabeling. Verifying the presence and
+type of a given path will be logically separate from SELinux labeling.
+Similarly, creating the directory when it doesn't exist will happen before any
+SELinux operations and should not impact it.
+
+
+### Containerized Kubelet
+
+A containerized kubelet would have difficulty creating directories. The
+implementation will likely respect the `containerized` flag, or similar,
+allowing it to either break out or be "/rootfs/" aware and thus operate as
+desired.
+
+### Racy Validation
+
+Ideally the validation would be done at the time the bindmounts are created,
+else it's possible for a given path or directory to change in the duration from
+when it's validated and the container runtime attempts to create said mount.
+
+The only way to solve this problem is to integrate these sorts of qualification
+into container runtimes themselves.
+
+I don't think this problem is severe enough that we need to push to solve it;
+rather I think we can simply accept this minor race, and if runtimes eventually
+allow this we can begin to leverage them.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-hostpath-qualifiers.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-ownership-management.md b/contributors/design-proposals/volume-ownership-management.md
new file mode 100644
index 00000000..d08c491c
--- /dev/null
+++ b/contributors/design-proposals/volume-ownership-management.md
@@ -0,0 +1,108 @@
+## Volume plugins and idempotency
+
+Currently, volume plugins have a `SetUp` method which is called in the context of a higher-level
+workflow within the kubelet which has externalized the problem of managing the ownership of volumes.
+This design has a number of drawbacks that can be mitigated by completely internalizing all concerns
+of volume setup behind the volume plugin `SetUp` method.
+
+### Known issues with current externalized design
+
+1. The ownership management is currently repeatedly applied, which breaks packages that require
+ special permissions in order to work correctly
+2. There is a gap between files being mounted/created by volume plugins and when their ownership
+ is set correctly; race conditions exist around this
+3. Solving the correct application of ownership management in an externalized model is difficult
+ and makes it clear that the a transaction boundary is being broken by the externalized design
+
+### Additional issues with externalization
+
+Fully externalizing any one concern of volumes is difficult for a number of reasons:
+
+1. Many types of idempotence checks exist, and are used in a variety of combinations and orders
+2. Workflow in the kubelet becomes much more complex to handle:
+ 1. composition of plugins
+ 2. correct timing of application of ownership management
+ 3. callback to volume plugins when we know the whole `SetUp` flow is complete and correct
+ 4. callback to touch sentinel files
+ 5. etc etc
+3. We want to support fully external volume plugins -- would require complex orchestration / chatty
+ remote API
+
+## Proposed implementation
+
+Since all of the ownership information is known in advance of the call to the volume plugin `SetUp`
+method, we can easily internalize these concerns into the volume plugins and pass the ownership
+information to `SetUp`.
+
+The volume `Builder` interface's `SetUp` method changes to accept the group that should own the
+volume. Plugins become responsible for ensuring that the correct group is applied. The volume
+`Attributes` struct can be modified to remove the `SupportsOwnershipManagement` field.
+
+```go
+package volume
+
+type Builder interface {
+ // other methods omitted
+
+ // SetUp prepares and mounts/unpacks the volume to a self-determined
+ // directory path and returns an error. The group ID that should own the volume
+ // is passed as a parameter. Plugins may choose to ignore the group ID directive
+ // in the event that they do not support it (example: NFS). A group ID of -1
+ // indicates that the group ownership of the volume should not be modified by the plugin.
+ //
+ // SetUp will be called multiple times and should be idempotent.
+ SetUp(gid int64) error
+}
+```
+
+Each volume plugin will have to change to support the new `SetUp` signature. The existing
+ownership management code will be refactored into a library that volume plugins can use:
+
+```
+package volume
+
+func ManageOwnership(path string, fsGroup int64) error {
+ // 1. recursive chown of path
+ // 2. make path +setgid
+}
+```
+
+The workflow from the Kubelet's perspective for handling volume setup and refresh becomes:
+
+```go
+// go-ish pseudocode
+func mountExternalVolumes(pod) error {
+ podVolumes := make(kubecontainer.VolumeMap)
+ for i := range pod.Spec.Volumes {
+ volSpec := &pod.Spec.Volumes[i]
+ var fsGroup int64 = 0
+ if pod.Spec.SecurityContext != nil &&
+ pod.Spec.SecurityContext.FSGroup != nil {
+ fsGroup = *pod.Spec.SecurityContext.FSGroup
+ } else {
+ fsGroup = -1
+ }
+
+ // Try to use a plugin for this volume.
+ plugin := volume.NewSpecFromVolume(volSpec)
+ builder, err := kl.newVolumeBuilderFromPlugins(plugin, pod)
+ if err != nil {
+ return err
+ }
+ if builder == nil {
+ return errUnsupportedVolumeType
+ }
+
+ err := builder.SetUp(fsGroup)
+ if err != nil {
+ return nil
+ }
+ }
+
+ return nil
+}
+```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-ownership-management.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-provisioning.md b/contributors/design-proposals/volume-provisioning.md
new file mode 100644
index 00000000..f8202fbe
--- /dev/null
+++ b/contributors/design-proposals/volume-provisioning.md
@@ -0,0 +1,500 @@
+## Abstract
+
+Real Kubernetes clusters have a variety of volumes which differ widely in
+size, iops performance, retention policy, and other characteristics.
+Administrators need a way to dynamically provision volumes of these different
+types to automatically meet user demand.
+
+A new mechanism called 'storage classes' is proposed to provide this
+capability.
+
+## Motivation
+
+In Kubernetes 1.2, an alpha form of limited dynamic provisioning was added
+that allows a single volume type to be provisioned in clouds that offer
+special volume types.
+
+In Kubernetes 1.3, a label selector was added to persistent volume claims to
+allow administrators to create a taxonomy of volumes based on the
+characteristics important to them, and to allow users to make claims on those
+volumes based on those characteristics. This allows flexibility when claiming
+existing volumes; the same flexibility is needed when dynamically provisioning
+volumes.
+
+After gaining experience with dynamic provisioning after the 1.2 release, we
+want to create a more flexible feature that allows configuration of how
+different storage classes are provisioned and supports provisioning multiple
+types of volumes within a single cloud.
+
+### Out-of-tree provisioners
+
+One of our goals is to enable administrators to create out-of-tree
+provisioners, that is, provisioners whose code does not live in the Kubernetes
+project.
+
+## Design
+
+This design represents the minimally viable changes required to provision based on storage class configuration. Additional incremental features may be added as a separate effort.
+
+We propose that:
+
+1. Both for in-tree and out-of-tree storage provisioners, the PV created by the
+ provisioners must match the PVC that led to its creations. If a provisioner
+ is unable to provision such a matching PV, it reports an error to the
+ user.
+
+2. The above point applies also to PVC label selector. If user submits a PVC
+ with a label selector, the provisioner must provision a PV with matching
+ labels. This directly implies that the provisioner understands meaning
+ behind these labels - if user submits a claim with selector that wants
+ a PV with label "region" not in "[east,west]", the provisioner must
+ understand what label "region" means, what available regions are there and
+ choose e.g. "north".
+
+ In other words, provisioners should either refuse to provision a volume for
+ a PVC that has a selector, or select few labels that are allowed in
+ selectors (such as the "region" example above), implement necessary logic
+ for their parsing, document them and refuse any selector that references
+ unknown labels.
+
+3. An api object will be incubated in storage.k8s.io/v1beta1 to hold the a `StorageClass`
+ API resource. Each StorageClass object contains parameters required by the provisioner to provision volumes of that class. These parameters are opaque to the user.
+
+4. `PersistentVolume.Spec.Class` attribute is added to volumes. This attribute
+ is optional and specifies which `StorageClass` instance represents
+ storage characteristics of a particular PV.
+
+ During incubation, `Class` is an annotation and not
+ actual attribute.
+
+5. `PersistentVolume` instances do not require labels by the provisioner.
+
+6. `PersistentVolumeClaim.Spec.Class` attribute is added to claims. This
+ attribute specifies that only a volume with equal
+ `PersistentVolume.Spec.Class` value can satisfy a claim.
+
+ During incubation, `Class` is just an annotation and not
+ actual attribute.
+
+7. The existing provisioner plugin implementations be modified to accept
+ parameters as specified via `StorageClass`.
+
+8. The persistent volume controller modified to invoke provisioners using `StorageClass` configuration and bind claims with `PersistentVolumeClaim.Spec.Class` to volumes with equivalent `PersistentVolume.Spec.Class`
+
+9. The existing alpha dynamic provisioning feature be phased out in the
+ next release.
+
+### Controller workflow for provisioning volumes
+
+0. Kubernetes administator can configure name of a default StorageClass. This
+ StorageClass instance is then used when user requests a dynamically
+ provisioned volume, but does not specify a StorageClass. In other words,
+ `claim.Spec.Class == ""`
+ (or annotation `volume.beta.kubernetes.io/storage-class == ""`).
+
+1. When a new claim is submitted, the controller attempts to find an existing
+ volume that will fulfill the claim.
+
+ 1. If the claim has non-empty `claim.Spec.Class`, only PVs with the same
+ `pv.Spec.Class` are considered.
+
+ 2. If the claim has empty `claim.Spec.Class`, only PVs with an unset `pv.Spec.Class` are considered.
+
+ All "considered" volumes are evaluated and the
+ smallest matching volume is bound to the claim.
+
+2. If no volume is found for the claim and `claim.Spec.Class` is not set or is
+ empty string dynamic provisioning is disabled.
+
+3. If `claim.Spec.Class` is set the controller tries to find instance of StorageClass with this name. If no
+ such StorageClass is found, the controller goes back to step 1. and
+ periodically retries finding a matching volume or storage class again until
+ a match is found. The claim is `Pending` during this period.
+
+4. With StorageClass instance, the controller updates the claim:
+ * `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner`
+
+* **In-tree provisioning**
+
+ The controller tries to find an internal volume plugin referenced by
+ `storageClass.Provisioner`. If it is found:
+
+ 5. The internal provisioner implements interface`ProvisionableVolumePlugin`,
+ which has a method called `NewProvisioner` that returns a new provisioner.
+
+ 6. The controller calls volume plugin `Provision` with Parameters
+ from the `StorageClass` configuration object.
+
+ 7. If `Provision` returns an error, the controller generates an event on the
+ claim and goes back to step 1., i.e. it will retry provisioning
+ periodically.
+
+ 8. If `Provision` returns no error, the controller creates the returned
+ `api.PersistentVolume`, fills its `Class` attribute with `claim.Spec.Class`
+ and makes it already bound to the claim
+
+ 1. If the create operation for the `api.PersistentVolume` fails, it is
+ retried
+
+ 2. If the create operation does not succeed in reasonable time, the
+ controller attempts to delete the provisioned volume and creates an event
+ on the claim
+
+Existing behavior is unchanged for claims that do not specify
+`claim.Spec.Class`.
+
+* **Out of tree provisioning**
+
+ Following step 4. above, the controller tries to find internal plugin for the
+ `StorageClass`. If it is not found, it does not do anything, it just
+ periodically goes to step 1., i.e. tries to find available matching PV.
+
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
+ "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
+ interpreted as described in RFC 2119.
+
+ External provisioner must have these features:
+
+ * It MUST have a distinct name, following Kubernetenes plugin naming scheme
+ `<vendor name>/<provisioner name>`, e.g. `gluster.org/gluster-volume`.
+
+ * The provisioner SHOULD send events on a claim to report any errors
+ related to provisioning a volume for the claim. This way, users get the same
+ experience as with internal provisioners.
+
+ * The provisioner MUST implement also a deleter. It must be able to delete
+ storage assets it created. It MUST NOT assume that any other internal or
+ external plugin is present.
+
+ The external provisioner runs in a separate process which watches claims, be
+ it an external storage appliance, a daemon or a Kubernetes pod. For every
+ claim creation or update, it implements these steps:
+
+ 1. The provisioner inspects if
+ `claim.Annotations["volume.beta.kubernetes.io/storage-provisioner"] == <provisioner name>`.
+ All other claims MUST be ignored.
+
+ 2. The provisioner MUST check that the claim is unbound, i.e. its
+ `claim.Spec.VolumeName` is empty. Bound volumes MUST be ignored.
+
+ *Race condition when the provisioner provisions a new PV for a claim and
+ at the same time Kubernetes binds the same claim to another PV that was
+ just created by admin is discussed below.*
+
+ 3. It tries to find a StorageClass instance referenced by annotation
+ `claim.Annotations["volume.beta.kubernetes.io/storage-class"]`. If not
+ found, it SHOULD report an error (by sending an event to the claim) and it
+ SHOULD retry periodically with step i.
+
+ 4. The provisioner MUST parse arguments in the `StorageClass` and
+ `claim.Spec.Selector` and provisions appropriate storage asset that matches
+ both the parameters and the selector.
+ When it encounters unknown parameters in `storageClass.Parameters` or
+ `claim.Spec.Selector` or the combination of these parameters is impossible
+ to achieve, it SHOULD report an error and it MUST NOT provision a volume.
+ All errors found during parsing or provisioning SHOULD be send as events
+ on the claim and the provisioner SHOULD retry periodically with step i.
+
+ As parsing (and understanding) claim selectors is hard, the sentence
+ "MUST parse ... `claim.Spec.Selector`" will in typical case lead to simple
+ refusal of claims that have any selector:
+
+ ```go
+ if pvc.Spec.Selector != nil {
+ return Error("can't parse PVC selector!")
+ }
+ ```
+
+ 5. When the volume is provisioned, the provisioner MUST create a new PV
+ representing the storage asset and save it in Kubernetes. When this fails,
+ it SHOULD retry creating the PV again few times. If all attempts fail, it
+ MUST delete the storage asset. All errors SHOULD be sent as events to the
+ claim.
+
+ The created PV MUST have these properties:
+
+ * `pv.Spec.ClaimRef` MUST point to the claim that led to its creation
+ (including the claim UID).
+
+ *This way, the PV will be bound to the claim.*
+
+ * `pv.Annotations["pv.kubernetes.io/provisioned-by"]` MUST be set to name
+ of the external provisioner. This provisioner will be used to delete the
+ volume.
+
+ *The provisioner/delete should not assume there is any other
+ provisioner/deleter available that would delete the volume.*
+
+ * `pv.Annotations["volume.beta.kubernetes.io/storage-class"]` MUST be set
+ to name of the storage class requested by the claim.
+
+ *So the created PV matches the claim.*
+
+ * The provisioner MAY store any other information to the created PV as
+ annotations. It SHOULD save any information that is needed to delete the
+ storage asset there, as appropriate StorageClass instance may not exist
+ when the volume will be deleted. However, references to Secret instance
+ or direct username/password to a remote storage appliance MUST NOT be
+ stored there, see issue #34822.
+
+ * `pv.Labels` MUST be set to match `claim.spec.selector`. The provisioner
+ MAY add additional labels.
+
+ *So the created PV matches the claim.*
+
+ * `pv.Spec` MUST be set to match requirements in `claim.Spec`, especially
+ access mode and PV size. The provisioned volume size MUST NOT be smaller
+ than size requested in the claim, however it MAY be larger.
+
+ *So the created PV matches the claim.*
+
+ * `pv.Spec.PersistentVolumeSource` MUST be set to point to the created
+ storage asset.
+
+ * `pv.Spec.PersistentVolumeReclaimPolicy` SHOULD be set to `Delete` unless
+ user manually configures other reclaim policy.
+
+ * `pv.Name` MUST be unique. Internal provisioners use name based on
+ `claim.UID` to produce conflicts when two provisioners accidentally
+ provision a PV for the same claim, however external provisioners can use
+ any mechanism to generate an unique PV name.
+
+ Example of a claim that is to be provisioned by an external provisioner for
+ `foo.org/foo-volume`:
+
+ ```yaml
+ apiVersion: v1
+ kind: PersistentVolumeClaim
+ metadata:
+ annotations:
+ volume.beta.kubernetes.io/storage-class: myClass
+ volume.beta.kubernetes.io/storage-provisioner: foo.org/foo-volume
+ name: fooclaim
+ namespace: default
+ resourceVersion: "53"
+ uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3
+ spec:
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 4Gi
+ # volumeName: must be empty!
+ ```
+
+ Example of the created PV:
+
+ ```yaml
+ apiVersion: v1
+ kind: PersistentVolume
+ metadata:
+ annotations:
+ pv.kubernetes.io/provisioned-by: foo.org/foo-volume
+ volume.beta.kubernetes.io/storage-class: myClass
+ foo.org/provisioner: "any other annotations as needed"
+ labels:
+ foo.org/my-label: "any labels as needed"
+ generateName: "foo-volume-"
+ spec:
+ accessModes:
+ - ReadWriteOnce
+ awsElasticBlockStore:
+ fsType: ext4
+ volumeID: aws://us-east-1d/vol-de401a79
+ capacity:
+ storage: 4Gi
+ claimRef:
+ apiVersion: v1
+ kind: PersistentVolumeClaim
+ name: fooclaim
+ namespace: default
+ resourceVersion: "53"
+ uid: 5a294561-7e5b-11e6-a20e-0eb6048532a3
+ persistentVolumeReclaimPolicy: Delete
+ ```
+
+ As result, Kubernetes has a PV that represents the storage asset and is bound
+ to the claim. When everything went well, Kubernetes completed binding of the
+ claim to the PV.
+
+ Kubernetes was not blocked in any way during the provisioning and could
+ either bound the claim to another PV that was created by user or even the
+ claim may have been deleted by the user. In both cases, Kubernetes will mark
+ the PV to be delete using the protocol below.
+
+ The external provisioner MAY save any annotations to the claim that is
+ provisioned, however the claim may be modified or even deleted by the user at
+ any time.
+
+
+### Controller workflow for deleting volumes
+
+When the controller decides that a volume should be deleted it performs these
+steps:
+
+1. The controller changes `pv.Status.Phase` to `Released`.
+
+2. The controller looks for `pv.Annotations["pv.kubernetes.io/provisioned-by"]`.
+ If found, it uses this provisioner/deleter to delete the volume.
+
+3. If the volume is not annotated by `pv.kubernetes.io/provisioned-by`, the
+ controller inspects `pv.Spec` and finds in-tree deleter for the volume.
+
+4. If the deleter found by steps 2. or 3. is internal, it calls it and deletes
+ the storage asset together with the PV that represents it.
+
+5. If the deleter is not known to Kubernetes, it does not do anything.
+
+6. External deleters MUST watch for PV changes. When
+ `pv.Status.Phase == Released && pv.Annotations['pv.kubernetes.io/provisioned-by'] == <deleter name>`,
+ the deleter:
+
+ * It MUST check reclaim policy of the PV and ignore all PVs whose
+ `Spec.PersistentVolumeReclaimPolicy` is not `Delete`.
+
+ * It MUST delete the storage asset.
+
+ * Only after the storage asset was successfully deleted, it MUST delete the
+ PV object in Kubernetes.
+
+ * Any error SHOULD be sent as an event on the PV being deleted and the
+ deleter SHOULD retry to delete the volume periodically.
+
+ * The deleter SHOULD NOT use any information from StorageClass instance
+ referenced by the PV. This is different to internal deleters, which
+ need to be StorageClass instance present at the time of deletion to read
+ Secret instances (see Gluster provisioner for example), however we would
+ like to phase out this behavior.
+
+ Note that watching `pv.Status` has been frowned upon in the past, however in
+ this particular case we could use it quite reliably to trigger deletion.
+ It's not trivial to find out if a PV is not needed and should be deleted.
+ *Alternatively, an annotation could be used.*
+
+### Security considerations
+
+Both internal and external provisioners and deleters may need access to
+credentials (e.g. username+password) of an external storage appliance to
+provision and delete volumes.
+
+* For internal provisioners, a Secret instance in a well secured namespace
+should be used. Pointer to the Secret instance shall be parameter of the
+StorageClass and it MUST NOT be copied around the system e.g. in annotations
+of PVs. See issue #34822.
+
+* External provisioners running in pod should have appropriate credentials
+mouted as Secret inside pods that run the provisioner. Namespace with the pods
+and Secret instance should be well secured.
+
+### `StorageClass` API
+
+A new API group should hold the API for storage classes, following the pattern
+of autoscaling, metrics, etc. To allow for future storage-related APIs, we
+should call this new API group `storage.k8s.io` and incubate in storage.k8s.io/v1beta1.
+
+Storage classes will be represented by an API object called `StorageClass`:
+
+```go
+package storage
+
+// StorageClass describes the parameters for a class of storage for
+// which PersistentVolumes can be dynamically provisioned.
+//
+// StorageClasses are non-namespaced; the name of the storage class
+// according to etcd is in ObjectMeta.Name.
+type StorageClass struct {
+ unversioned.TypeMeta `json:",inline"`
+ ObjectMeta `json:"metadata,omitempty"`
+
+ // Provisioner indicates the type of the provisioner.
+ Provisioner string `json:"provisioner,omitempty"`
+
+ // Parameters for dynamic volume provisioner.
+ Parameters map[string]string `json:"parameters,omitempty"`
+}
+
+```
+
+`PersistentVolumeClaimSpec` and `PersistentVolumeSpec` both get Class attribute
+(the existing annotation is used during incubation):
+
+```go
+type PersistentVolumeClaimSpec struct {
+ // Name of requested storage class. If non-empty, only PVs with this
+ // pv.Spec.Class will be considered for binding and if no such PV is
+ // available, StorageClass with this name will be used to dynamically
+ // provision the volume.
+ Class string
+...
+}
+
+type PersistentVolumeSpec struct {
+ // Name of StorageClass instance that this volume belongs to.
+ Class string
+...
+}
+```
+
+Storage classes are natural to think of as a global resource, since they:
+
+1. Align with PersistentVolumes, which are a global resource
+2. Are administrator controlled
+
+### Provisioning configuration
+
+With the scheme outlined above the provisioner creates PVs using parameters specified in the `StorageClass` object.
+
+### Provisioner interface changes
+
+`struct volume.VolumeOptions` (containing parameters for a provisioner plugin)
+will be extended to contain StorageClass.Parameters.
+
+The existing provisioner implementations will be modified to accept the StorageClass configuration object.
+
+### PV Controller Changes
+
+The persistent volume controller will be modified to implement the new
+workflow described in this proposal. The changes will be limited to the
+`provisionClaimOperation` method, which is responsible for invoking the
+provisioner and to favor existing volumes before provisioning a new one.
+
+## Examples
+
+### AWS provisioners with distinct QoS
+
+This example shows two storage classes, "aws-fast" and "aws-slow".
+
+```
+apiVersion: v1
+kind: StorageClass
+metadata:
+ name: aws-fast
+provisioner: kubernetes.io/aws-ebs
+parameters:
+ zone: us-east-1b
+ type: ssd
+
+
+apiVersion: v1
+kind: StorageClass
+metadata:
+ name: aws-slow
+provisioner: kubernetes.io/aws-ebs
+parameters:
+ zone: us-east-1b
+ type: spinning
+```
+
+# Additional Implementation Details
+
+0. Annotation `volume.alpha.kubernetes.io/storage-class` is used instead of `claim.Spec.Class` and `volume.Spec.Class` during incubation.
+
+1. `claim.Spec.Selector` and `claim.Spec.Class` are mutually exclusive for now (1.4). User can either match existing volumes with `Selector` XOR match existing volumes with `Class` and get dynamic provisioning by using `Class`. This simplifies initial PR and also provisioners. This limitation may be lifted in future releases.
+
+# Cloud Providers
+
+Since the `volume.alpha.kubernetes.io/storage-class` is in use a `StorageClass` must be defined to support provisioning. No default is assumed as before.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-provisioning.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-selectors.md b/contributors/design-proposals/volume-selectors.md
new file mode 100644
index 00000000..c1915f99
--- /dev/null
+++ b/contributors/design-proposals/volume-selectors.md
@@ -0,0 +1,268 @@
+## Abstract
+
+Real Kubernetes clusters have a variety of volumes which differ widely in
+size, iops performance, retention policy, and other characteristics. A
+mechanism is needed to enable administrators to describe the taxonomy of these
+volumes, and for users to make claims on these volumes based on their
+attributes within this taxonomy.
+
+A label selector mechanism is proposed to enable flexible selection of volumes
+by persistent volume claims.
+
+## Motivation
+
+Currently, users of persistent volumes have the ability to make claims on
+those volumes based on some criteria such as the access modes the volume
+supports and minimum resources offered by a volume. In an organization, there
+are often more complex requirements for the storage volumes needed by
+different groups of users. A mechanism is needed to model these different
+types of volumes and to allow users to select those different types without
+being intimately familiar with their underlying characteristics.
+
+As an example, many cloud providers offer a range of performance
+characteristics for storage, with higher performing storage being more
+expensive. Cluster administrators want the ability to:
+
+1. Invent a taxonomy of logical storage classes using the attributes
+ important to them
+2. Allow users to make claims on volumes using these attributes
+
+## Constraints and Assumptions
+
+The proposed design should:
+
+1. Deal with manually-created volumes
+2. Not necessarily require users to know or understand the differences between
+ volumes (ie, Kubernetes should not dictate any particular set of
+ characteristics to administrators to think in terms of)
+
+We will focus **only** on the barest mechanisms to describe and implement
+label selectors in this proposal. We will address the following topics in
+future proposals:
+
+1. An extension resource or third party resource for storage classes
+1. Dynamically provisioning new volumes for based on storage class
+
+## Use Cases
+
+1. As a user, I want to be able to make a claim on a persistent volume by
+ specifying a label selector as well as the currently available attributes
+
+### Use Case: Taxonomy of Persistent Volumes
+
+Kubernetes offers volume types for a variety of storage systems. Within each
+of those storage systems, there are numerous ways in which volume instances
+may differ from one another: iops performance, retention policy, etc.
+Administrators of real clusters typically need to manage a variety of
+different volumes with different characteristics for different groups of
+users.
+
+Kubernetes should make it possible for administrators to flexibly model the
+taxonomy of volumes in their clusters and to label volumes with their storage
+class. This capability must be optional and fully backward-compatible with
+the existing API.
+
+Let's look at an example. This example is *purely fictitious* and the
+taxonomies presented here are not a suggestion of any sort. In the case of
+AWS EBS there are four different types of volume (in ascending order of cost):
+
+1. Cold HDD
+2. Throughput optimized HDD
+3. General purpose SSD
+4. Provisioned IOPS SSD
+
+Currently, there is no way to distinguish between a group of 4 PVs where each
+volume is of one of these different types. Administrators need the ability to
+distinguish between instances of these types. An administrator might decide
+to think of these volumes as follows:
+
+1. Cold HDD - `tin`
+2. Throughput optimized HDD - `bronze`
+3. General purpose SSD - `silver`
+4. Provisioned IOPS SSD - `gold`
+
+This is not the only dimension that EBS volumes can differ in. Let's simplify
+things and imagine that AWS has two availability zones, `east` and `west`. Our
+administrators want to differentiate between volumes of the same type in these
+two zones, so they create a taxonomy of volumes like so:
+
+1. `tin-west`
+2. `tin-east`
+3. `bronze-west`
+4. `bronze-east`
+5. `silver-west`
+6. `silver-east`
+7. `gold-west`
+8. `gold-east`
+
+Another administrator of the same cluster might label things differently,
+choosing to focus on the business role of volumes. Say that the data
+warehouse department is the sole consumer of the cold HDD type, and the DB as
+a service offering is the sole consumer of provisioned IOPS volumes. The
+administrator might decide on the following taxonomy of volumes:
+
+1. `warehouse-east`
+2. `warehouse-west`
+3. `dbaas-east`
+4. `dbaas-west`
+
+There are any number of ways an administrator may choose to distinguish
+between volumes. Labels are used in Kubernetes to express the user-defined
+properties of API objects and are a good fit to express this information for
+volumes. In the examples above, administrators might differentiate between
+the classes of volumes using the labels `business-unit`, `volume-type`, or
+`region`.
+
+Label selectors are used through the Kubernetes API to describe relationships
+between API objects using flexible, user-defined criteria. It makes sense to
+use the same mechanism with persistent volumes and storage claims to provide
+the same functionality for these API objects.
+
+## Proposed Design
+
+We propose that:
+
+1. A new field called `Selector` be added to the `PersistentVolumeClaimSpec`
+ type
+2. The persistent volume controller be modified to account for this selector
+ when determining the volume to bind to a claim
+
+### Persistent Volume Selector
+
+Label selectors are used throughout the API to allow users to express
+relationships in a flexible manner. The problem of selecting a volume to
+match a claim fits perfectly within this metaphor. Adding a label selector to
+`PersistentVolumeClaimSpec` will allow users to label their volumes with
+criteria important to them and select volumes based on these criteria.
+
+```go
+// PersistentVolumeClaimSpec describes the common attributes of storage devices
+// and allows a Source for provider-specific attributes
+type PersistentVolumeClaimSpec struct {
+ // Contains the types of access modes required
+ AccessModes []PersistentVolumeAccessMode `json:"accessModes,omitempty"`
+ // Selector is a selector which must be true for the claim to bind to a volume
+ Selector *unversioned.Selector `json:"selector,omitempty"`
+ // Resources represents the minimum resources required
+ Resources ResourceRequirements `json:"resources,omitempty"`
+ // VolumeName is the binding reference to the PersistentVolume backing this claim
+ VolumeName string `json:"volumeName,omitempty"`
+}
+```
+
+### Labeling volumes
+
+Volumes can already be labeled:
+
+```yaml
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: ebs-pv-1
+ labels:
+ ebs-volume-type: iops
+ aws-availability-zone: us-east-1
+spec:
+ capacity:
+ storage: 100Gi
+ accessModes:
+ - ReadWriteMany
+ persistentVolumeReclaimPolicy: Retain
+ awsElasticBlockStore:
+ volumeID: vol-12345
+ fsType: xfs
+```
+
+### Controller Changes
+
+At the time of this writing, the various controllers for persistent volumes
+are in the process of being refactored into a single controller (see
+[kubernetes/24331](https://github.com/kubernetes/kubernetes/pull/24331)).
+
+The resulting controller should be modified to use the new
+`selector` field to match a claim to a volume. In order to
+match to a volume, all criteria must be satisfied; ie, if a label selector is
+specified on a claim, a volume must match both the label selector and any
+specified access modes and resource requirements to be considered a match.
+
+## Examples
+
+Let's take a look at a few examples, revisiting the taxonomy of EBS volumes and regions:
+
+Volumes of the different types might be labeled as follows:
+
+```yaml
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: ebs-pv-west
+ labels:
+ ebs-volume-type: iops-ssd
+ aws-availability-zone: us-west-1
+spec:
+ capacity:
+ storage: 150Gi
+ accessModes:
+ - ReadWriteMany
+ persistentVolumeReclaimPolicy: Retain
+ awsElasticBlockStore:
+ volumeID: vol-23456
+ fsType: xfs
+
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: ebs-pv-east
+ labels:
+ ebs-volume-type: gp-ssd
+ aws-availability-zone: us-east-1
+spec:
+ capacity:
+ storage: 150Gi
+ accessModes:
+ - ReadWriteMany
+ persistentVolumeReclaimPolicy: Retain
+ awsElasticBlockStore:
+ volumeID: vol-34567
+ fsType: xfs
+```
+
+...claims on these volumes would look like:
+
+```yaml
+kind: PersistentVolumeClaim
+apiVersion: v1
+metadata:
+ name: ebs-claim-west
+spec:
+ accessModes:
+ - ReadWriteMany
+ resources:
+ requests:
+ storage: 1Gi
+ selector:
+ matchLabels:
+ ebs-volume-type: iops-ssd
+ aws-availability-zone: us-west-1
+
+kind: PersistentVolumeClaim
+apiVersion: v1
+metadata:
+ name: ebs-claim-east
+spec:
+ accessModes:
+ - ReadWriteMany
+ resources:
+ requests:
+ storage: 1Gi
+ selector:
+ matchLabels:
+ ebs-volume-type: gp-ssd
+ aws-availability-zone: us-east-1
+```
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volume-selectors.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-snapshotting.md b/contributors/design-proposals/volume-snapshotting.md
new file mode 100644
index 00000000..e92ed3d1
--- /dev/null
+++ b/contributors/design-proposals/volume-snapshotting.md
@@ -0,0 +1,523 @@
+Kubernetes Snapshotting Proposal
+================================
+
+**Authors:** [Cindy Wang](https://github.com/ciwang)
+
+## Background
+
+Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs).
+
+Typical existing backup solutions offer on demand or scheduled snapshots.
+
+An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume.
+
+Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves.
+
+## Objectives
+
+For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals.
+
+* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers.
+
+ * Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods.
+
+* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API.
+
+ * Nongoal: Support Delete and Restore snapshot operations in API.
+
+* Goal 3: Implement snapshotting interface for GCE PDs.
+
+ * Nongoal: Implement snapshotting interface for non GCE PD volumes.
+
+### Feature Roadmap
+
+Major features, in order of priority (bold features are priorities for v1):
+
+* **On demand snapshots**
+
+ * **API to create new snapshots and list existing snapshots**
+
+ * API to restore a disk from a snapshot and delete old snapshots
+
+* Scheduled snapshots
+
+* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node)
+
+## Requirements
+
+### Performance
+
+* Time SLA from issuing a snapshot to completion:
+
+* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location.
+
+* This should be on the order of a few minutes.
+
+### Reliability
+
+* Data corruption
+
+ * Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons:
+
+ * GCE and Amazon can create snapshots while the application is running.
+
+ * Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code.
+
+ * Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown.
+
+* Snapshot failure
+
+ * Case: Failure during external process, such as during API call or upload
+
+ * Log error, retry until success (indefinitely)
+
+ * Case: Failure within Kubernetes, such as controller restarts
+
+ * If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state.
+
+## Solution Overview
+
+Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects.
+
+* **Create:**
+
+ * Key: create.snapshot.volume.alpha.kubernetes.io
+
+ * Value: [snapshot name]
+
+* **List:**
+
+ * Key: snapshot.volume.alpha.kubernetes.io/[snapshot name]
+
+ * Value: [snapshot timestamp]
+
+A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes.
+
+The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage).
+
+## Detailed Design
+
+### API
+
+* Create snapshot
+
+ * Usage:
+
+ * Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter
+
+ * When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list.
+
+ * API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually.
+
+ * PVC control loop in master
+
+ * If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV.
+
+ * If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV.
+
+* List existing snapshots
+
+ * Only displayed as annotations on PVC object.
+
+ * Only lists unique names and timestamps of snapshots taken using the Kubernetes API.
+
+ * Usage:
+
+ * Get the PVC object
+
+ * Snapshots are listed as key-value pairs within the PVC annotations
+
+### SnapshotController
+
+![Snapshot Controller Diagram](volume-snapshotting.png?raw=true "Snapshot controller diagram")
+
+**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests.
+
+**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully.
+
+**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful.
+
+The controller will have a loop that does the following:
+
+* Fetch State
+
+ * Fetch all PVC objects from the API server.
+
+* Act
+
+ * Trigger snapshot:
+
+ * Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation.
+
+* Persist State
+
+ * Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server.
+
+Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor.
+
+The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor.
+
+### Create Snapshot Logic
+
+To create a snapshot:
+
+* Acquire operation lock for volume so that no other attach or detach operations can be started for volume.
+
+ * Abort if there is already a pending operation for the specified volume (main loop will retry, if needed).
+
+* Spawn a new thread:
+
+ * Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC.
+
+ * For any errors, log the error, and terminate the thread (the main controller will retry as needed).
+
+ * Once a snapshot is created successfully:
+
+ * Make a call to the API server to delete the create snapshot annotation in the PVC object.
+
+ * Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations.
+
+*Brainstorming notes below, read at your own risk!*
+
+* * *
+
+
+Open questions:
+
+* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API?
+
+ * It seems that the API route is a bit more feasible in implementation and can also be fully utilized.
+
+ * Can the API call methods on VolumePlugins? Yeah via controller
+
+ * The scheduler gives users functionality that doesn’t already exist, but required adding an entirely new controller
+
+* Should the list and restore operations be part of v1?
+
+* Do we call them snapshots or backups?
+
+ * From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application."
+
+* At what minimum granularity should snapshots be allowed?
+
+* How do we store information about the most recent snapshot in case the controller restarts?
+
+* In case of error, do we err on the side of fewer or more snapshots?
+
+Snapshot Scheduler
+
+1. PVC API Object
+
+A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression.
+
+* CRUD operations on snapshot schedules
+
+ * Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/)
+
+ * The cron expression provides flexibility to decrease the interval between snapshots in future versions
+
+ * Read: Display snapshot schedule to user via kubectl get pvc
+
+ * Update: Do not support changing the snapshot schedule for an existing PVC
+
+ * Delete: Do not support deleting the snapshot schedule for an existing PVC
+
+ * In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name
+
+* Validation
+
+ * Cron expressions must have a 0 in the minutes place and use exact, not interval syntax
+
+ * [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals).
+
+ * If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user
+
+ Objective
+
+Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins.
+
+Goal: Implement snapshotting interface for GCE PDs.
+
+Goal: Protect against data loss by allowing users to restore snapshots of their disks.
+
+Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes.
+
+Nongoal: Use snapshotting to provide additional features such as migration.
+
+ Background
+
+Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs).
+
+Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage.
+
+Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules.
+
+ what is the value case? Who wants this? What do we enable by implementing this?
+
+I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious.
+
+Requirements
+
+Functionality
+
+Should this support PVs, direct volumes, or both?
+
+Should we support deletion?
+
+Should we support restores?
+
+Automated schedule -- times or intervals? Before major event?
+
+Performance
+
+Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing?
+
+* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards.
+
+ * Pending = uploading to GCE
+
+* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting
+
+Reliability
+
+How do we ascertain that deletions happen when we want them to?
+
+For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes.
+
+We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this.
+
+What happens when the snapshot fails as these are async operations?
+
+Retry (for some time period? indefinitely?) and log the error
+
+Other
+
+What is the UI for seeing the list of snapshots?
+
+In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots.
+
+Overview
+
+There are several design options for the design of each layer of implementation as follows.
+
+1. **Public API:**
+
+Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive:
+
+ 1. New Volume API object
+
+ 1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object.
+
+ 2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot.
+
+ 3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object.
+
+ 4. Pros
+
+ 1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule.
+
+ 5. Cons
+
+ 2. Heavyweight codewise; involves changing and touching a lot of existing code.
+
+ 3. Potentially bad UX: How is the Volume API object created?
+
+ 1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands.
+
+ 2. When a unique volume is specified in a pod or PV spec.
+
+ 2. Directly in volume definition in the pod/PV object
+
+ 6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule.
+
+ 7. Pros
+
+ 4. Easy for users to implement and understand
+
+ 8. Cons
+
+ 5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used.
+
+ 6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set.
+
+ 3. Only in PV object
+
+ 9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule.
+
+ 10. Pros
+
+ 7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property.
+
+ 11. Cons
+
+ 8. No support for direct volumes
+
+ 9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job
+
+ 10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules.
+
+ 4. Annotations: key value pairs on API object
+
+ 12. User experience is the same as (b)
+
+ 13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours.
+
+ 14. Pros
+
+ 11. Less invasive to the codebase than (a-c)
+
+ 15. Cons
+
+ 12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation.
+
+2. **Business logic:**
+
+ 5. Does this go on the master, node, or both?
+
+ 16. Where the snapshot is stored
+
+ 13. GCE, Amazon: cloud storage
+
+ 14. Others stored on volume itself (gluster) or external drive (iSCSI)
+
+ 17. Requirements for snapshot operation
+
+ 15. Application flush, sync, and fsfreeze before creating snapshot
+
+ 6. Suggestion:
+
+ 18. New SnapshotController on master
+
+ 16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot
+
+ 17. If controller restarts and we miss a snapshot in the process, just skip it
+
+ 3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5)
+
+ 18. If snapshotting call fails, retry for an amount of time specified in retryPeriod
+
+ 19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period
+
+ 19. Logic to prepare the disk for snapshotting on node
+
+ 20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD)
+
+ 7. Alternatives: login entirely on node
+
+ 20. Problems:
+
+ 21. If pod moves from one node to another
+
+ 4. A different node is in now in charge of snapshotting
+
+ 5. If the volume plugin requires external memory for snapshots, we need to move the existing data
+
+ 22. If the same pod exists on two different nodes, which node is in charge
+
+3. **Volume plugin interface/internal API:**
+
+ 8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin)
+
+ 9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot
+
+ 10. Similar to volume.attach call
+
+4. **Other questions:**
+
+ 11. Snapshot period
+
+ 12. Time or period
+
+ 13. What is our SLO around time accuracy?
+
+ 21. Best effort, but no guarantees (depends on time or period) -- if going with time.
+
+ 14. What if we miss a snapshot?
+
+ 22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity
+
+ 15. Will we know when an operation has failed? How do we report that?
+
+ 23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases
+
+ 16. Will we be responsible for GCing old snapshots?
+
+ 24. Maybe this can be explicit non-goal, in the future can automate garbage collection
+
+ 17. If the pod dies do we continue creating snapshots?
+
+ 18. How to communicate errors (PD doesn’t support snapshotting, time period unsupported)
+
+ 19. Off schedule snapshotting like before an application upgrade
+
+ 20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this?
+
+Options, pros, cons, suggestion/recommendation
+
+Example 1b
+
+During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod’s associated volume.
+
+For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2):
+
+apiVersion: v1
+kind: Pod
+metadata:
+ name: test-pd
+spec:
+ containers:
+ - image: gcr.io/google_containers/test-webserver
+ name: test-container
+ volumeMounts:
+ - mountPath: /test-pd
+ name: test-volume
+ volumes:
+ - name: test-volume
+ # This GCE PD must already exist.
+ gcePersistentDisk:
+ pdName: my-data-disk
+ fsType: ext4
+
+Introduce a new field into the volume spec:
+
+apiVersion: v1
+kind: Pod
+metadata:
+ name: test-pd
+spec:
+ containers:
+ - image: gcr.io/google_containers/test-webserver
+ name: test-container
+ volumeMounts:
+ - mountPath: /test-pd
+ name: test-volume
+ volumes:
+ - name: test-volume
+ # This GCE PD must already exist.
+ gcePersistentDisk:
+ pdName: my-data-disk
+ fsType: ext4
+
+** ssTimes: ****[1, 5]**
+
+ Caveats
+
+* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because
+
+ * this does not provide value to the user and only adds an extra layer of indirection/complexity.
+
+ * ?
+
+ Dependencies
+
+* Kubernetes
+
+* Persistent volume snapshot support through API
+
+ * POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/volume-snapshotting.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/design-proposals/volume-snapshotting.png b/contributors/design-proposals/volume-snapshotting.png
new file mode 100644
index 00000000..1b1ea748
--- /dev/null
+++ b/contributors/design-proposals/volume-snapshotting.png
Binary files differ
diff --git a/contributors/design-proposals/volumes.md b/contributors/design-proposals/volumes.md
new file mode 100644
index 00000000..874dc2af
--- /dev/null
+++ b/contributors/design-proposals/volumes.md
@@ -0,0 +1,482 @@
+## Abstract
+
+A proposal for sharing volumes between containers in a pod using a special supplemental group.
+
+## Motivation
+
+Kubernetes volumes should be usable regardless of the UID a container runs as. This concern cuts
+across all volume types, so the system should be able to handle them in a generalized way to provide
+uniform functionality across all volume types and lower the barrier to new plugins.
+
+Goals of this design:
+
+1. Enumerate the different use-cases for volume usage in pods
+2. Define the desired goal state for ownership and permission management in Kubernetes
+3. Describe the changes necessary to achieve desired state
+
+## Constraints and Assumptions
+
+1. When writing permissions in this proposal, `D` represents a don't-care value; example: `07D0`
+ represents permissions where the owner has `7` permissions, all has `0` permissions, and group
+ has a don't-care value
+2. Read-write usability of a volume from a container is defined as one of:
+ 1. The volume is owned by the container's effective UID and has permissions `07D0`
+ 2. The volume is owned by the container's effective GID or one of its supplemental groups and
+ has permissions `0D70`
+3. Volume plugins should not have to handle setting permissions on volumes
+5. Preventing two containers within a pod from reading and writing to the same volume (by choosing
+ different container UIDs) is not something we intend to support today
+6. We will not design to support multiple processes running in a single container as different
+ UIDs; use cases that require work by different UIDs should be divided into different pods for
+ each UID
+
+## Current State Overview
+
+### Kubernetes
+
+Kubernetes volumes can be divided into two broad categories:
+
+1. Unshared storage:
+ 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret,
+ downward api. All volumes in this category delegate to `EmptyDir` for their underlying
+ storage. These volumes are created with ownership `root:root`.
+ 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively
+ by a single pod*.
+2. Shared storage:
+ 1. `hostPath` is shared storage because it is necessarily used by a container and the host
+ 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. For these volumes, the ownership
+ is determined by the configuration of the shared storage system.
+ 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because
+ they may be used simultaneously by multiple pods.
+
+The `EmptyDir` volume was recently modified to create the volume directory with `0777` permissions
+from `0750` to support basic usability of that volume as a non-root UID.
+
+### Docker
+
+Docker recently added supplemental group support. This adds the ability to specify additional
+groups that a container should be part of, and will be released with Docker 1.8.
+
+There is a [proposal](https://github.com/docker/docker/pull/14632) to add a bind-mount flag to tell
+Docker to change the ownership of a volume to the effective UID and GID of a container, but this has
+not yet been accepted.
+
+### rkt
+
+rkt
+[image manifests](https://github.com/appc/spec/blob/master/spec/aci.md#image-manifest-schema) can
+specify users and groups, similarly to how a Docker image can. A rkt
+[pod manifest](https://github.com/appc/spec/blob/master/spec/pods.md#pod-manifest-schema) can also
+override the default user and group specified by the image manifest.
+
+rkt does not currently support supplemental groups or changing the owning UID or
+group of a volume, but it has been [requested](https://github.com/coreos/rkt/issues/1309).
+
+## Use Cases
+
+1. As a user, I want the system to set ownership and permissions on volumes correctly to enable
+ reads and writes with the following scenarios:
+ 1. All containers running as root
+ 2. All containers running as the same non-root user
+ 3. Multiple containers running as a mix of root and non-root users
+
+### All containers running as root
+
+For volumes that only need to be used by root, no action needs to be taken to change ownership or
+permissions, but setting the ownership based on the supplemental group shared by all containers in a
+pod will also work. For situations where read-only access to a shared volume is required from one
+or more containers, the `VolumeMount`s in those containers should have the `readOnly` field set.
+
+### All containers running as a single non-root user
+
+In use cases whether a volume is used by a single non-root UID the volume ownership and permissions
+should be set to enable read/write access.
+
+Currently, a non-root UID will not have permissions to write to any but an `EmptyDir` volume.
+Today, users that need this case to work can:
+
+1. Grant the container the necessary capabilities to `chown` and `chmod` the volume:
+ - `CAP_FOWNER`
+ - `CAP_CHOWN`
+ - `CAP_DAC_OVERRIDE`
+2. Run a wrapper script that runs `chown` and `chmod` commands to set the desired ownership and
+ permissions on the volume before starting their main process
+
+This workaround has significant drawbacks:
+
+1. It grants powerful kernel capabilities to the code in the image and thus is not securing,
+ defeating the reason containers are run as non-root users
+2. The user experience is poor; it requires changing Dockerfile, adding a layer, or modifying the
+ container's command
+
+Some cluster operators manage the ownership of shared storage volumes on the server side.
+In this scenario, the UID of the container using the volume is known in advance. The ownership of
+the volume is set to match the container's UID on the server side.
+
+### Containers running as a mix of root and non-root users
+
+If the list of UIDs that need to use a volume includes both root and non-root users, supplemental
+groups can be applied to enable sharing volumes between containers. The ownership and permissions
+`root:<supplemental group> 2770` will make a volume usable from both containers running as root and
+running as a non-root UID and the supplemental group. The setgid bit is used to ensure that files
+created in the volume will inherit the owning GID of the volume.
+
+## Community Design Discussion
+
+- [kubernetes/2630](https://github.com/kubernetes/kubernetes/issues/2630)
+- [kubernetes/11319](https://github.com/kubernetes/kubernetes/issues/11319)
+- [kubernetes/9384](https://github.com/kubernetes/kubernetes/pull/9384)
+
+## Analysis
+
+The system needs to be able to:
+
+1. Model correctly which volumes require ownership management
+1. Determine the correct ownership of each volume in a pod if required
+1. Set the ownership and permissions on volumes when required
+
+### Modeling whether a volume requires ownership management
+
+#### Unshared storage: volumes derived from `EmptyDir`
+
+Since Kubernetes creates `EmptyDir` volumes, it should ensure the ownership is set to enable the
+volumes to be usable for all of the above scenarios.
+
+#### Unshared storage: network block devices
+
+Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way
+as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir`
+volumes, permissions and ownership can be managed on the client side by the Kubelet when used
+exclusively by one pod. When the volumes are used outside of a persistent volume, or with the
+`ReadWriteOnce` mode, they are effectively unshared storage.
+
+When used by multiple pods, there are many additional use-cases to analyze before we can be
+confident that we can support ownership management robustly with these file systems. The right
+design is one that makes it easy to experiment and develop support for ownership management with
+volume plugins to enable developers and cluster operators to continue exploring these issues.
+
+#### Shared storage: hostPath
+
+The `hostPath` volume should only be used by effective-root users, and the permissions of paths
+exposed into containers via hostPath volumes should always be managed by the cluster operator. If
+the Kubelet managed the ownership for `hostPath` volumes, a user who could create a `hostPath`
+volume could affect changes in the state of arbitrary paths within the host's filesystem. This
+would be a severe security risk, so we will consider hostPath a corner case that the kubelet should
+never perform ownership management for.
+
+#### Shared storage
+
+Ownership management of shared storage is a complex topic. Ownership for existing shared storage
+will be managed externally from Kubernetes. For this case, our API should make it simple to express
+whether a particular volume should have these concerns managed by Kubernetes.
+
+We will not attempt to address the ownership and permissions concerns of new shared storage
+in this proposal.
+
+When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany`
+modes, it is shared storage, and thus outside the scope of this proposal.
+
+#### Plugin API requirements
+
+From the above, we know that some volume plugins will 'want' ownership management from the Kubelet
+and others will not. Plugins should be able to opt in to ownership management from the Kubelet. To
+facilitate this, there should be a method added to the `volume.Plugin` interface that the Kubelet
+uses to determine whether to perform ownership management for a volume.
+
+### Determining correct ownership of a volume
+
+Using the approach of a pod-level supplemental group to own volumes solves the problem in any of the
+cases of UID/GID combinations within a pod. Since this is the simplest approach that handles all
+use-cases, our solution will be made in terms of it.
+
+Eventually, Kubernetes should allocate a unique group for each pod so that a pod's volumes are
+usable by that pod's containers, but not by containers of another pod. The supplemental group used
+to share volumes must be unique in a multitenant cluster. If uniqueness is enforced at the host
+level, pods from one host may be able to use shared filesystems meant for pods on another host.
+
+Eventually, Kubernetes should integrate with external identity management systems to populate pod
+specs with the right supplemental groups necessary to use shared volumes. In the interim until the
+identity management story is far enough along to implement this type of integration, we will rely
+on being able to set arbitrary groups. (Note: as of this writing, a PR is being prepared for
+setting arbitrary supplemental groups).
+
+An admission controller could handle allocating groups for each pod and setting the group in the
+pod's security context.
+
+#### A note on the root group
+
+Today, by default, all docker containers are run in the root group (GID 0). This is relied on by
+image authors that make images to run with a range of UIDs: they set the group ownership for
+important paths to be the root group, so that containers running as GID 0 *and* an arbitrary UID
+can read and write to those paths normally.
+
+It is important to note that the changes proposed here will not affect the primary GID of
+containers in pods. Setting the `pod.Spec.SecurityContext.FSGroup` field will not
+override the primary GID and should be safe to use in images that expect GID 0.
+
+### Setting ownership and permissions on volumes
+
+For `EmptyDir`-based volumes and unshared storage, `chown` and `chmod` on the node are sufficient to
+set ownership and permissions. Shared storage is different because:
+
+1. Shared storage may not live on the node a pod that uses it runs on
+2. Shared storage may be externally managed
+
+## Proposed design:
+
+Our design should minimize code for handling ownership required in the Kubelet and volume plugins.
+
+### API changes
+
+We should not interfere with images that need to run as a particular UID or primary GID. A pod
+level supplemental group allows us to express a group that all containers in a pod run as in a way
+that is orthogonal to the primary UID and GID of each container process.
+
+```go
+package api
+
+type PodSecurityContext struct {
+ // FSGroup is a supplemental group that all containers in a pod run under. This group will own
+ // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will
+ // not set the group ownership of any volumes.
+ FSGroup *int64 `json:"fsGroup,omitempty"`
+}
+```
+
+The V1 API will be extended with the same field:
+
+```go
+package v1
+
+type PodSecurityContext struct {
+ // FSGroup is a supplemental group that all containers in a pod run under. This group will own
+ // volumes that the Kubelet manages ownership for. If this is not specified, the Kubelet will
+ // not set the group ownership of any volumes.
+ FSGroup *int64 `json:"fsGroup,omitempty"`
+}
+```
+
+The values that can be specified for the `pod.Spec.SecurityContext.FSGroup` field are governed by
+[pod security policy](https://github.com/kubernetes/kubernetes/pull/7893).
+
+#### API backward compatibility
+
+Pods created by old clients will have the `pod.Spec.SecurityContext.FSGroup` field unset;
+these pods will not have their volumes managed by the Kubelet. Old clients will not be able to set
+or read the `pod.Spec.SecurityContext.FSGroup` field.
+
+### Volume changes
+
+The `volume.Mounter` interface should have a new method added that indicates whether the plugin
+supports ownership management:
+
+```go
+package volume
+
+type Mounter interface {
+ // other methods omitted
+
+ // SupportsOwnershipManagement indicates that this volume supports having ownership
+ // and permissions managed by the Kubelet; if true, the caller may manipulate UID
+ // or GID of this volume.
+ SupportsOwnershipManagement() bool
+}
+```
+
+In the first round of work, only `hostPath` and `emptyDir` and its derivations will be tested with
+ownership management support:
+
+| Plugin Name | SupportsOwnershipManagement |
+|-------------------------|-------------------------------|
+| `hostPath` | false |
+| `emptyDir` | true |
+| `gitRepo` | true |
+| `secret` | true |
+| `downwardAPI` | true |
+| `gcePersistentDisk` | false |
+| `awsElasticBlockStore` | false |
+| `nfs` | false |
+| `iscsi` | false |
+| `glusterfs` | false |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd` | false |
+| `cinder` | false |
+| `cephfs` | false |
+
+Ultimately, the matrix will theoretically look like:
+
+| Plugin Name | SupportsOwnershipManagement |
+|-------------------------|-------------------------------|
+| `hostPath` | false |
+| `emptyDir` | true |
+| `gitRepo` | true |
+| `secret` | true |
+| `downwardAPI` | true |
+| `gcePersistentDisk` | true |
+| `awsElasticBlockStore` | true |
+| `nfs` | false |
+| `iscsi` | true |
+| `glusterfs` | false |
+| `persistentVolumeClaim` | depends on underlying volume and PV mode |
+| `rbd` | true |
+| `cinder` | false |
+| `cephfs` | false |
+
+### Kubelet changes
+
+The Kubelet should be modified to perform ownership and label management when required for a volume.
+
+For ownership management the criteria are:
+
+1. The `pod.Spec.SecurityContext.FSGroup` field is populated
+2. The volume builder returns `true` from `SupportsOwnershipManagement`
+
+Logic should be added to the `mountExternalVolumes` method that runs a local `chgrp` and `chmod` if
+the pod-level supplemental group is set and the volume supports ownership management:
+
+```go
+package kubelet
+
+type ChgrpRunner interface {
+ Chgrp(path string, gid int) error
+}
+
+type ChmodRunner interface {
+ Chmod(path string, mode os.FileMode) error
+}
+
+type Kubelet struct {
+ chgrpRunner ChgrpRunner
+ chmodRunner ChmodRunner
+}
+
+func (kl *Kubelet) mountExternalVolumes(pod *api.Pod) (kubecontainer.VolumeMap, error) {
+ podFSGroup = pod.Spec.PodSecurityContext.FSGroup
+ podFSGroupSet := false
+ if podFSGroup != 0 {
+ podFSGroupSet = true
+ }
+
+ podVolumes := make(kubecontainer.VolumeMap)
+
+ for i := range pod.Spec.Volumes {
+ volSpec := &pod.Spec.Volumes[i]
+
+ rootContext, err := kl.getRootDirContext()
+ if err != nil {
+ return nil, err
+ }
+
+ // Try to use a plugin for this volume.
+ internal := volume.NewSpecFromVolume(volSpec)
+ builder, err := kl.newVolumeMounterFromPlugins(internal, pod, volume.VolumeOptions{RootContext: rootContext}, kl.mounter)
+ if err != nil {
+ glog.Errorf("Could not create volume builder for pod %s: %v", pod.UID, err)
+ return nil, err
+ }
+ if builder == nil {
+ return nil, errUnsupportedVolumeType
+ }
+ err = builder.SetUp()
+ if err != nil {
+ return nil, err
+ }
+
+ if builder.SupportsOwnershipManagement() &&
+ podFSGroupSet {
+ err = kl.chgrpRunner.Chgrp(builder.GetPath(), podFSGroup)
+ if err != nil {
+ return nil, err
+ }
+
+ err = kl.chmodRunner.Chmod(builder.GetPath(), os.FileMode(1770))
+ if err != nil {
+ return nil, err
+ }
+ }
+
+ podVolumes[volSpec.Name] = builder
+ }
+
+ return podVolumes, nil
+}
+```
+
+This allows the volume plugins to determine when they do and don't want this type of support from
+the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet.
+
+The docker runtime will be modified to set the supplemental group of each container based on the
+`pod.Spec.SecurityContext.FSGroup` field. Theoretically, the `rkt` runtime could support this
+feature in a similar way.
+
+### Examples
+
+#### EmptyDir
+
+For a pod that has two containers sharing an `EmptyDir` volume:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: test-pod
+spec:
+ securityContext:
+ fsGroup: 1001
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1009
+ volumeMounts:
+ - mountPath: "/example/hostpath/a"
+ name: empty-vol
+ - name: b
+ securityContext:
+ runAsUser: 1010
+ volumeMounts:
+ - mountPath: "/example/hostpath/b"
+ name: empty-vol
+ volumes:
+ - name: empty-vol
+```
+
+When the Kubelet runs this pod, the `empty-vol` volume will have ownership root:1001 and permissions
+`0770`. It will be usable from both containers a and b.
+
+#### HostPath
+
+For a volume that uses a `hostPath` volume with containers running as different UIDs:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: test-pod
+spec:
+ securityContext:
+ fsGroup: 1001
+ containers:
+ - name: a
+ securityContext:
+ runAsUser: 1009
+ volumeMounts:
+ - mountPath: "/example/hostpath/a"
+ name: host-vol
+ - name: b
+ securityContext:
+ runAsUser: 1010
+ volumeMounts:
+ - mountPath: "/example/hostpath/b"
+ name: host-vol
+ volumes:
+ - name: host-vol
+ hostPath:
+ path: "/tmp/example-pod"
+```
+
+The cluster operator would need to manually `chgrp` and `chmod` the `/tmp/example-pod` on the host
+in order for the volume to be usable from the pod.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/volumes.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/README.md b/contributors/devel/README.md
new file mode 100644
index 00000000..cf29f3b4
--- /dev/null
+++ b/contributors/devel/README.md
@@ -0,0 +1,83 @@
+# Kubernetes Developer Guide
+
+The developer guide is for anyone wanting to either write code which directly accesses the
+Kubernetes API, or to contribute directly to the Kubernetes project.
+It assumes some familiarity with concepts in the [User Guide](../user-guide/README.md) and the [Cluster Admin
+Guide](../admin/README.md).
+
+
+## The process of developing and contributing code to the Kubernetes project
+
+* **On Collaborative Development** ([collab.md](collab.md)): Info on pull requests and code reviews.
+
+* **GitHub Issues** ([issues.md](issues.md)): How incoming issues are reviewed and prioritized.
+
+* **Pull Request Process** ([pull-requests.md](pull-requests.md)): When and why pull requests are closed.
+
+* **Kubernetes On-Call Rotations** ([on-call-rotations.md](on-call-rotations.md)): Descriptions of on-call rotations for build and end-user support.
+
+* **Faster PR reviews** ([faster_reviews.md](faster_reviews.md)): How to get faster PR reviews.
+
+* **Getting Recent Builds** ([getting-builds.md](getting-builds.md)): How to get recent builds including the latest builds that pass CI.
+
+* **Automated Tools** ([automation.md](automation.md)): Descriptions of the automation that is running on our github repository.
+
+
+## Setting up your dev environment, coding, and debugging
+
+* **Development Guide** ([development.md](development.md)): Setting up your development environment.
+
+* **Hunting flaky tests** ([flaky-tests.md](flaky-tests.md)): We have a goal of 99.9% flake free tests.
+ Here's how to run your tests many times.
+
+* **Logging Conventions** ([logging.md](logging.md)): Glog levels.
+
+* **Profiling Kubernetes** ([profiling.md](profiling.md)): How to plug in go pprof profiler to Kubernetes.
+
+* **Instrumenting Kubernetes with a new metric**
+ ([instrumentation.md](instrumentation.md)): How to add a new metrics to the
+ Kubernetes code base.
+
+* **Coding Conventions** ([coding-conventions.md](coding-conventions.md)):
+ Coding style advice for contributors.
+
+* **Document Conventions** ([how-to-doc.md](how-to-doc.md))
+ Document style advice for contributors.
+
+* **Running a cluster locally** ([running-locally.md](running-locally.md)):
+ A fast and lightweight local cluster deployment for development.
+
+## Developing against the Kubernetes API
+
+* The [REST API documentation](../api-reference/README.md) explains the REST
+ API exposed by apiserver.
+
+* **Annotations** ([docs/user-guide/annotations.md](../user-guide/annotations.md)): are for attaching arbitrary non-identifying metadata to objects.
+ Programs that automate Kubernetes objects may use annotations to store small amounts of their state.
+
+* **API Conventions** ([api-conventions.md](api-conventions.md)):
+ Defining the verbs and resources used in the Kubernetes API.
+
+* **API Client Libraries** ([client-libraries.md](client-libraries.md)):
+ A list of existing client libraries, both supported and user-contributed.
+
+
+## Writing plugins
+
+* **Authentication Plugins** ([docs/admin/authentication.md](../admin/authentication.md)):
+ The current and planned states of authentication tokens.
+
+* **Authorization Plugins** ([docs/admin/authorization.md](../admin/authorization.md)):
+ Authorization applies to all HTTP requests on the main apiserver port.
+ This doc explains the available authorization implementations.
+
+* **Admission Control Plugins** ([admission_control](../design/admission_control.md))
+
+
+## Building releases
+
+See the [kubernetes/release](https://github.com/kubernetes/release) repository for details on creating releases and related tools and helper scripts.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/README.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/adding-an-APIGroup.md b/contributors/devel/adding-an-APIGroup.md
new file mode 100644
index 00000000..5832be23
--- /dev/null
+++ b/contributors/devel/adding-an-APIGroup.md
@@ -0,0 +1,100 @@
+Adding an API Group
+===============
+
+This document includes the steps to add an API group. You may also want to take
+a look at PR [#16621](https://github.com/kubernetes/kubernetes/pull/16621) and
+PR [#13146](https://github.com/kubernetes/kubernetes/pull/13146), which add API
+groups.
+
+Please also read about [API conventions](api-conventions.md) and
+[API changes](api_changes.md) before adding an API group.
+
+### Your core group package:
+
+We plan on improving the way the types are factored in the future; see
+[#16062](https://github.com/kubernetes/kubernetes/pull/16062) for the directions
+in which this might evolve.
+
+1. Create a folder in pkg/apis to hold your group. Create types.go in
+pkg/apis/`<group>`/ and pkg/apis/`<group>`/`<version>`/ to define API objects
+in your group;
+
+2. Create pkg/apis/`<group>`/{register.go, `<version>`/register.go} to register
+this group's API objects to the encoding/decoding scheme (e.g.,
+[pkg/apis/authentication/register.go](../../pkg/apis/authentication/register.go) and
+[pkg/apis/authentication/v1beta1/register.go](../../pkg/apis/authentication/v1beta1/register.go);
+
+3. Add a pkg/apis/`<group>`/install/install.go, which is responsible for adding
+the group to the `latest` package, so that other packages can access the group's
+meta through `latest.Group`. You probably only need to change the name of group
+and version in the [example](../../pkg/apis/authentication/install/install.go)). You
+need to import this `install` package in {pkg/master,
+pkg/client/unversioned}/import_known_versions.go, if you want to make your group
+accessible to other packages in the kube-apiserver binary, binaries that uses
+the client package.
+
+Step 2 and 3 are mechanical, we plan on autogenerate these using the
+cmd/libs/go2idl/ tool.
+
+### Scripts changes and auto-generated code:
+
+1. Generate conversions and deep-copies:
+
+ 1. Add your "group/" or "group/version" into
+ cmd/libs/go2idl/conversion-gen/main.go;
+ 2. Make sure your pkg/apis/`<group>`/`<version>` directory has a doc.go file
+ with the comment `// +k8s:deepcopy-gen=package,register`, to catch the
+ attention of our generation tools.
+ 3. Make sure your `pkg/apis/<group>/<version>` directory has a doc.go file
+ with the comment `// +k8s:conversion-gen=<internal-pkg>`, to catch the
+ attention of our generation tools. For most APIs the only target you
+ need is `k8s.io/kubernetes/pkg/apis/<group>` (your internal API).
+ 3. Make sure your `pkg/apis/<group>` and `pkg/apis/<group>/<version>` directories
+ have a doc.go file with the comment `+groupName=<group>.k8s.io`, to correctly
+ generate the DNS-suffixed group name.
+ 5. Run hack/update-all.sh.
+
+2. Generate files for Ugorji codec:
+
+ 1. Touch types.generated.go in pkg/apis/`<group>`{/, `<version>`};
+ 2. Run hack/update-codecgen.sh.
+
+3. Generate protobuf objects:
+
+ 1. Add your group to `cmd/libs/go2idl/go-to-protobuf/protobuf/cmd.go` to
+ `New()` in the `Packages` field
+ 2. Run hack/update-generated-protobuf.sh
+
+### Client (optional):
+
+We are overhauling pkg/client, so this section might be outdated; see
+[#15730](https://github.com/kubernetes/kubernetes/pull/15730) for how the client
+package might evolve. Currently, to add your group to the client package, you
+need to:
+
+1. Create pkg/client/unversioned/`<group>`.go, define a group client interface
+and implement the client. You can take pkg/client/unversioned/extensions.go as a
+reference.
+
+2. Add the group client interface to the `Interface` in
+pkg/client/unversioned/client.go and add method to fetch the interface. Again,
+you can take how we add the Extensions group there as an example.
+
+3. If you need to support the group in kubectl, you'll also need to modify
+pkg/kubectl/cmd/util/factory.go.
+
+### Make the group/version selectable in unit tests (optional):
+
+1. Add your group in pkg/api/testapi/testapi.go, then you can access the group
+in tests through testapi.`<group>`;
+
+2. Add your "group/version" to `KUBE_TEST_API_VERSIONS` in
+ hack/make-rules/test.sh and hack/make-rules/test-integration.sh
+
+TODO: Add a troubleshooting section.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/adding-an-APIGroup.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/api-conventions.md b/contributors/devel/api-conventions.md
new file mode 100644
index 00000000..0be45182
--- /dev/null
+++ b/contributors/devel/api-conventions.md
@@ -0,0 +1,1350 @@
+API Conventions
+===============
+
+Updated: 4/22/2016
+
+*This document is oriented at users who want a deeper understanding of the
+Kubernetes API structure, and developers wanting to extend the Kubernetes API.
+An introduction to using resources with kubectl can be found in [Working with
+resources](../user-guide/working-with-resources.md).*
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+ - [Types (Kinds)](#types-kinds)
+ - [Resources](#resources)
+ - [Objects](#objects)
+ - [Metadata](#metadata)
+ - [Spec and Status](#spec-and-status)
+ - [Typical status properties](#typical-status-properties)
+ - [References to related objects](#references-to-related-objects)
+ - [Lists of named subobjects preferred over maps](#lists-of-named-subobjects-preferred-over-maps)
+ - [Primitive types](#primitive-types)
+ - [Constants](#constants)
+ - [Unions](#unions)
+ - [Lists and Simple kinds](#lists-and-simple-kinds)
+ - [Differing Representations](#differing-representations)
+ - [Verbs on Resources](#verbs-on-resources)
+ - [PATCH operations](#patch-operations)
+ - [Strategic Merge Patch](#strategic-merge-patch)
+ - [List Operations](#list-operations)
+ - [Map Operations](#map-operations)
+ - [Idempotency](#idempotency)
+ - [Optional vs. Required](#optional-vs-required)
+ - [Defaulting](#defaulting)
+ - [Late Initialization](#late-initialization)
+ - [Concurrency Control and Consistency](#concurrency-control-and-consistency)
+ - [Serialization Format](#serialization-format)
+ - [Units](#units)
+ - [Selecting Fields](#selecting-fields)
+ - [Object references](#object-references)
+ - [HTTP Status codes](#http-status-codes)
+ - [Success codes](#success-codes)
+ - [Error codes](#error-codes)
+ - [Response Status Kind](#response-status-kind)
+ - [Events](#events)
+ - [Naming conventions](#naming-conventions)
+ - [Label, selector, and annotation conventions](#label-selector-and-annotation-conventions)
+ - [WebSockets and SPDY](#websockets-and-spdy)
+ - [Validation](#validation)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+The conventions of the [Kubernetes API](../api.md) (and related APIs in the
+ecosystem) are intended to ease client development and ensure that configuration
+mechanisms can be implemented that work across a diverse set of use cases
+consistently.
+
+The general style of the Kubernetes API is RESTful - clients create, update,
+delete, or retrieve a description of an object via the standard HTTP verbs
+(POST, PUT, DELETE, and GET) - and those APIs preferentially accept and return
+JSON. Kubernetes also exposes additional endpoints for non-standard verbs and
+allows alternative content types. All of the JSON accepted and returned by the
+server has a schema, identified by the "kind" and "apiVersion" fields. Where
+relevant HTTP header fields exist, they should mirror the content of JSON
+fields, but the information should not be represented only in the HTTP header.
+
+The following terms are defined:
+
+* **Kind** the name of a particular object schema (e.g. the "Cat" and "Dog"
+kinds would have different attributes and properties)
+* **Resource** a representation of a system entity, sent or retrieved as JSON
+via HTTP to the server. Resources are exposed via:
+ * Collections - a list of resources of the same type, which may be queryable
+ * Elements - an individual resource, addressable via a URL
+
+Each resource typically accepts and returns data of a single kind. A kind may be
+accepted or returned by multiple resources that reflect specific use cases. For
+instance, the kind "Pod" is exposed as a "pods" resource that allows end users
+to create, update, and delete pods, while a separate "pod status" resource (that
+acts on "Pod" kind) allows automated processes to update a subset of the fields
+in that resource.
+
+Resource collections should be all lowercase and plural, whereas kinds are
+CamelCase and singular.
+
+
+## Types (Kinds)
+
+Kinds are grouped into three categories:
+
+1. **Objects** represent a persistent entity in the system.
+
+ Creating an API object is a record of intent - once created, the system will
+work to ensure that resource exists. All API objects have common metadata.
+
+ An object may have multiple resources that clients can use to perform
+specific actions that create, update, delete, or get.
+
+ Examples: `Pod`, `ReplicationController`, `Service`, `Namespace`, `Node`.
+
+2. **Lists** are collections of **resources** of one (usually) or more
+(occasionally) kinds.
+
+ The name of a list kind must end with "List". Lists have a limited set of
+common metadata. All lists use the required "items" field to contain the array
+of objects they return. Any kind that has the "items" field must be a list kind.
+
+ Most objects defined in the system should have an endpoint that returns the
+full set of resources, as well as zero or more endpoints that return subsets of
+the full list. Some objects may be singletons (the current user, the system
+defaults) and may not have lists.
+
+ In addition, all lists that return objects with labels should support label
+filtering (see [docs/user-guide/labels.md](../user-guide/labels.md), and most
+lists should support filtering by fields.
+
+ Examples: PodLists, ServiceLists, NodeLists
+
+ TODO: Describe field filtering below or in a separate doc.
+
+3. **Simple** kinds are used for specific actions on objects and for
+non-persistent entities.
+
+ Given their limited scope, they have the same set of limited common metadata
+as lists.
+
+ For instance, the "Status" kind is returned when errors occur and is not
+persisted in the system.
+
+ Many simple resources are "subresources", which are rooted at API paths of
+specific resources. When resources wish to expose alternative actions or views
+that are closely coupled to a single resource, they should do so using new
+sub-resources. Common subresources include:
+
+ * `/binding`: Used to bind a resource representing a user request (e.g., Pod,
+PersistentVolumeClaim) to a cluster infrastructure resource (e.g., Node,
+PersistentVolume).
+ * `/status`: Used to write just the status portion of a resource. For
+example, the `/pods` endpoint only allows updates to `metadata` and `spec`,
+since those reflect end-user intent. An automated process should be able to
+modify status for users to see by sending an updated Pod kind to the server to
+the "/pods/&lt;name&gt;/status" endpoint - the alternate endpoint allows
+different rules to be applied to the update, and access to be appropriately
+restricted.
+ * `/scale`: Used to read and write the count of a resource in a manner that
+is independent of the specific resource schema.
+
+ Two additional subresources, `proxy` and `portforward`, provide access to
+cluster resources as described in
+[docs/user-guide/accessing-the-cluster.md](../user-guide/accessing-the-cluster.md).
+
+The standard REST verbs (defined below) MUST return singular JSON objects. Some
+API endpoints may deviate from the strict REST pattern and return resources that
+are not singular JSON objects, such as streams of JSON objects or unstructured
+text log data.
+
+The term "kind" is reserved for these "top-level" API types. The term "type"
+should be used for distinguishing sub-categories within objects or subobjects.
+
+### Resources
+
+All JSON objects returned by an API MUST have the following fields:
+
+* kind: a string that identifies the schema this object should have
+* apiVersion: a string that identifies the version of the schema the object
+should have
+
+These fields are required for proper decoding of the object. They may be
+populated by the server by default from the specified URL path, but the client
+likely needs to know the values in order to construct the URL path.
+
+### Objects
+
+#### Metadata
+
+Every object kind MUST have the following metadata in a nested object field
+called "metadata":
+
+* namespace: a namespace is a DNS compatible label that objects are subdivided
+into. The default namespace is 'default'. See
+[docs/user-guide/namespaces.md](../user-guide/namespaces.md) for more.
+* name: a string that uniquely identifies this object within the current
+namespace (see [docs/user-guide/identifiers.md](../user-guide/identifiers.md)).
+This value is used in the path when retrieving an individual object.
+* uid: a unique in time and space value (typically an RFC 4122 generated
+identifier, see [docs/user-guide/identifiers.md](../user-guide/identifiers.md))
+used to distinguish between objects with the same name that have been deleted
+and recreated
+
+Every object SHOULD have the following metadata in a nested object field called
+"metadata":
+
+* resourceVersion: a string that identifies the internal version of this object
+that can be used by clients to determine when objects have changed. This value
+MUST be treated as opaque by clients and passed unmodified back to the server.
+Clients should not assume that the resource version has meaning across
+namespaces, different kinds of resources, or different servers. (See
+[concurrency control](#concurrency-control-and-consistency), below, for more
+details.)
+* generation: a sequence number representing a specific generation of the
+desired state. Set by the system and monotonically increasing, per-resource. May
+be compared, such as for RAW and WAW consistency.
+* creationTimestamp: a string representing an RFC 3339 date of the date and time
+an object was created
+* deletionTimestamp: a string representing an RFC 3339 date of the date and time
+after which this resource will be deleted. This field is set by the server when
+a graceful deletion is requested by the user, and is not directly settable by a
+client. The resource will be deleted (no longer visible from resource lists, and
+not reachable by name) after the time in this field. Once set, this value may
+not be unset or be set further into the future, although it may be shortened or
+the resource may be deleted prior to this time.
+* labels: a map of string keys and values that can be used to organize and
+categorize objects (see [docs/user-guide/labels.md](../user-guide/labels.md))
+* annotations: a map of string keys and values that can be used by external
+tooling to store and retrieve arbitrary metadata about this object (see
+[docs/user-guide/annotations.md](../user-guide/annotations.md))
+
+Labels are intended for organizational purposes by end users (select the pods
+that match this label query). Annotations enable third-party automation and
+tooling to decorate objects with additional metadata for their own use.
+
+#### Spec and Status
+
+By convention, the Kubernetes API makes a distinction between the specification
+of the desired state of an object (a nested object field called "spec") and the
+status of the object at the current time (a nested object field called
+"status"). The specification is a complete description of the desired state,
+including configuration settings provided by the user,
+[default values](#defaulting) expanded by the system, and properties initialized
+or otherwise changed after creation by other ecosystem components (e.g.,
+schedulers, auto-scalers), and is persisted in stable storage with the API
+object. If the specification is deleted, the object will be purged from the
+system. The status summarizes the current state of the object in the system, and
+is usually persisted with the object by an automated processes but may be
+generated on the fly. At some cost and perhaps some temporary degradation in
+behavior, the status could be reconstructed by observation if it were lost.
+
+When a new version of an object is POSTed or PUT, the "spec" is updated and
+available immediately. Over time the system will work to bring the "status" into
+line with the "spec". The system will drive toward the most recent "spec"
+regardless of previous versions of that stanza. In other words, if a value is
+changed from 2 to 5 in one PUT and then back down to 3 in another PUT the system
+is not required to 'touch base' at 5 before changing the "status" to 3. In other
+words, the system's behavior is *level-based* rather than *edge-based*. This
+enables robust behavior in the presence of missed intermediate state changes.
+
+The Kubernetes API also serves as the foundation for the declarative
+configuration schema for the system. In order to facilitate level-based
+operation and expression of declarative configuration, fields in the
+specification should have declarative rather than imperative names and
+semantics -- they represent the desired state, not actions intended to yield the
+desired state.
+
+The PUT and POST verbs on objects MUST ignore the "status" values, to avoid
+accidentally overwriting the status in read-modify-write scenarios. A `/status`
+subresource MUST be provided to enable system components to update statuses of
+resources they manage.
+
+Otherwise, PUT expects the whole object to be specified. Therefore, if a field
+is omitted it is assumed that the client wants to clear that field's value. The
+PUT verb does not accept partial updates. Modification of just part of an object
+may be achieved by GETting the resource, modifying part of the spec, labels, or
+annotations, and then PUTting it back. See
+[concurrency control](#concurrency-control-and-consistency), below, regarding
+read-modify-write consistency when using this pattern. Some objects may expose
+alternative resource representations that allow mutation of the status, or
+performing custom actions on the object.
+
+All objects that represent a physical resource whose state may vary from the
+user's desired intent SHOULD have a "spec" and a "status". Objects whose state
+cannot vary from the user's desired intent MAY have only "spec", and MAY rename
+"spec" to a more appropriate name.
+
+Objects that contain both spec and status should not contain additional
+top-level fields other than the standard metadata fields.
+
+##### Typical status properties
+
+**Conditions** represent the latest available observations of an object's
+current state. Objects may report multiple conditions, and new types of
+conditions may be added in the future. Therefore, conditions are represented
+using a list/slice, where all have similar structure.
+
+The `FooCondition` type for some resource type `Foo` may include a subset of the
+following fields, but must contain at least `type` and `status` fields:
+
+```go
+ Type FooConditionType `json:"type" description:"type of Foo condition"`
+ Status ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`
+ LastHeartbeatTime unversioned.Time `json:"lastHeartbeatTime,omitempty" description:"last time we got an update on a given condition"`
+ LastTransitionTime unversioned.Time `json:"lastTransitionTime,omitempty" description:"last time the condition transit from one status to another"`
+ Reason string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`
+ Message string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`
+```
+
+Additional fields may be added in the future.
+
+Conditions should be added to explicitly convey properties that users and
+components care about rather than requiring those properties to be inferred from
+other observations.
+
+Condition status values may be `True`, `False`, or `Unknown`. The absence of a
+condition should be interpreted the same as `Unknown`.
+
+In general, condition values may change back and forth, but some condition
+transitions may be monotonic, depending on the resource and condition type.
+However, conditions are observations and not, themselves, state machines, nor do
+we define comprehensive state machines for objects, nor behaviors associated
+with state transitions. The system is level-based rather than edge-triggered,
+and should assume an Open World.
+
+A typical oscillating condition type is `Ready`, which indicates the object was
+believed to be fully operational at the time it was last probed. A possible
+monotonic condition could be `Succeeded`. A `False` status for `Succeeded` would
+imply failure. An object that was still active would not have a `Succeeded`
+condition, or its status would be `Unknown`.
+
+Some resources in the v1 API contain fields called **`phase`**, and associated
+`message`, `reason`, and other status fields. The pattern of using `phase` is
+deprecated. Newer API types should use conditions instead. Phase was essentially
+a state-machine enumeration field, that contradicted
+[system-design principles](../design/principles.md#control-logic) and hampered
+evolution, since [adding new enum values breaks backward
+compatibility](api_changes.md). Rather than encouraging clients to infer
+implicit properties from phases, we intend to explicitly expose the conditions
+that clients need to monitor. Conditions also have the benefit that it is
+possible to create some conditions with uniform meaning across all resource
+types, while still exposing others that are unique to specific resource types.
+See [#7856](http://issues.k8s.io/7856) for more details and discussion.
+
+In condition types, and everywhere else they appear in the API, **`Reason`** is
+intended to be a one-word, CamelCase representation of the category of cause of
+the current status, and **`Message`** is intended to be a human-readable phrase
+or sentence, which may contain specific details of the individual occurrence.
+`Reason` is intended to be used in concise output, such as one-line
+`kubectl get` output, and in summarizing occurrences of causes, whereas
+`Message` is intended to be presented to users in detailed status explanations,
+such as `kubectl describe` output.
+
+Historical information status (e.g., last transition time, failure counts) is
+only provided with reasonable effort, and is not guaranteed to not be lost.
+
+Status information that may be large (especially proportional in size to
+collections of other resources, such as lists of references to other objects --
+see below) and/or rapidly changing, such as
+[resource usage](../design/resources.md#usage-data), should be put into separate
+objects, with possibly a reference from the original object. This helps to
+ensure that GETs and watch remain reasonably efficient for the majority of
+clients, which may not need that data.
+
+Some resources report the `observedGeneration`, which is the `generation` most
+recently observed by the component responsible for acting upon changes to the
+desired state of the resource. This can be used, for instance, to ensure that
+the reported status reflects the most recent desired status.
+
+#### References to related objects
+
+References to loosely coupled sets of objects, such as
+[pods](../user-guide/pods.md) overseen by a
+[replication controller](../user-guide/replication-controller.md), are usually
+best referred to using a [label selector](../user-guide/labels.md). In order to
+ensure that GETs of individual objects remain bounded in time and space, these
+sets may be queried via separate API queries, but will not be expanded in the
+referring object's status.
+
+References to specific objects, especially specific resource versions and/or
+specific fields of those objects, are specified using the `ObjectReference` type
+(or other types representing strict subsets of it). Unlike partial URLs, the
+ObjectReference type facilitates flexible defaulting of fields from the
+referring object or other contextual information.
+
+References in the status of the referee to the referrer may be permitted, when
+the references are one-to-one and do not need to be frequently updated,
+particularly in an edge-based manner.
+
+#### Lists of named subobjects preferred over maps
+
+Discussed in [#2004](http://issue.k8s.io/2004) and elsewhere. There are no maps
+of subobjects in any API objects. Instead, the convention is to use a list of
+subobjects containing name fields.
+
+For example:
+
+```yaml
+ports:
+ - name: www
+ containerPort: 80
+```
+
+vs.
+
+```yaml
+ports:
+ www:
+ containerPort: 80
+```
+
+This rule maintains the invariant that all JSON/YAML keys are fields in API
+objects. The only exceptions are pure maps in the API (currently, labels,
+selectors, annotations, data), as opposed to sets of subobjects.
+
+#### Primitive types
+
+* Avoid floating-point values as much as possible, and never use them in spec.
+Floating-point values cannot be reliably round-tripped (encoded and re-decoded)
+without changing, and have varying precision and representations across
+languages and architectures.
+* All numbers (e.g., uint32, int64) are converted to float64 by Javascript and
+some other languages, so any field which is expected to exceed that either in
+magnitude or in precision (specifically integer values > 53 bits) should be
+serialized and accepted as strings.
+* Do not use unsigned integers, due to inconsistent support across languages and
+libraries. Just validate that the integer is non-negative if that's the case.
+* Do not use enums. Use aliases for string instead (e.g., `NodeConditionType`).
+* Look at similar fields in the API (e.g., ports, durations) and follow the
+conventions of existing fields.
+* All public integer fields MUST use the Go `(u)int32` or Go `(u)int64` types,
+not `(u)int` (which is ambiguous depending on target platform). Internal types
+may use `(u)int`.
+
+#### Constants
+
+Some fields will have a list of allowed values (enumerations). These values will
+be strings, and they will be in CamelCase, with an initial uppercase letter.
+Examples: "ClusterFirst", "Pending", "ClientIP".
+
+#### Unions
+
+Sometimes, at most one of a set of fields can be set. For example, the
+[volumes] field of a PodSpec has 17 different volume type-specific fields, such
+as `nfs` and `iscsi`. All fields in the set should be
+[Optional](#optional-vs-required).
+
+Sometimes, when a new type is created, the api designer may anticipate that a
+union will be needed in the future, even if only one field is allowed initially.
+In this case, be sure to make the field [Optional](#optional-vs-required)
+optional. In the validation, you may still return an error if the sole field is
+unset. Do not set a default value for that field.
+
+### Lists and Simple kinds
+
+Every list or simple kind SHOULD have the following metadata in a nested object
+field called "metadata":
+
+* resourceVersion: a string that identifies the common version of the objects
+returned by in a list. This value MUST be treated as opaque by clients and
+passed unmodified back to the server. A resource version is only valid within a
+single namespace on a single kind of resource.
+
+Every simple kind returned by the server, and any simple kind sent to the server
+that must support idempotency or optimistic concurrency should return this
+value. Since simple resources are often used as input alternate actions that
+modify objects, the resource version of the simple resource should correspond to
+the resource version of the object.
+
+
+## Differing Representations
+
+An API may represent a single entity in different ways for different clients, or
+transform an object after certain transitions in the system occur. In these
+cases, one request object may have two representations available as different
+resources, or different kinds.
+
+An example is a Service, which represents the intent of the user to group a set
+of pods with common behavior on common ports. When Kubernetes detects a pod
+matches the service selector, the IP address and port of the pod are added to an
+Endpoints resource for that Service. The Endpoints resource exists only if the
+Service exists, but exposes only the IPs and ports of the selected pods. The
+full service is represented by two distinct resources - under the original
+Service resource the user created, as well as in the Endpoints resource.
+
+As another example, a "pod status" resource may accept a PUT with the "pod"
+kind, with different rules about what fields may be changed.
+
+Future versions of Kubernetes may allow alternative encodings of objects beyond
+JSON.
+
+
+## Verbs on Resources
+
+API resources should use the traditional REST pattern:
+
+* GET /&lt;resourceNamePlural&gt; - Retrieve a list of type
+&lt;resourceName&gt;, e.g. GET /pods returns a list of Pods.
+* POST /&lt;resourceNamePlural&gt; - Create a new resource from the JSON object
+provided by the client.
+* GET /&lt;resourceNamePlural&gt;/&lt;name&gt; - Retrieves a single resource
+with the given name, e.g. GET /pods/first returns a Pod named 'first'. Should be
+constant time, and the resource should be bounded in size.
+* DELETE /&lt;resourceNamePlural&gt;/&lt;name&gt; - Delete the single resource
+with the given name. DeleteOptions may specify gracePeriodSeconds, the optional
+duration in seconds before the object should be deleted. Individual kinds may
+declare fields which provide a default grace period, and different kinds may
+have differing kind-wide default grace periods. A user provided grace period
+overrides a default grace period, including the zero grace period ("now").
+* PUT /&lt;resourceNamePlural&gt;/&lt;name&gt; - Update or create the resource
+with the given name with the JSON object provided by the client.
+* PATCH /&lt;resourceNamePlural&gt;/&lt;name&gt; - Selectively modify the
+specified fields of the resource. See more information [below](#patch).
+* GET /&lt;resourceNamePlural&gt;&amp;watch=true - Receive a stream of JSON
+objects corresponding to changes made to any resource of the given kind over
+time.
+
+### PATCH operations
+
+The API supports three different PATCH operations, determined by their
+corresponding Content-Type header:
+
+* JSON Patch, `Content-Type: application/json-patch+json`
+ * As defined in [RFC6902](https://tools.ietf.org/html/rfc6902), a JSON Patch is
+a sequence of operations that are executed on the resource, e.g. `{"op": "add",
+"path": "/a/b/c", "value": [ "foo", "bar" ]}`. For more details on how to use
+JSON Patch, see the RFC.
+* Merge Patch, `Content-Type: application/merge-patch+json`
+ * As defined in [RFC7386](https://tools.ietf.org/html/rfc7386), a Merge Patch
+is essentially a partial representation of the resource. The submitted JSON is
+"merged" with the current resource to create a new one, then the new one is
+saved. For more details on how to use Merge Patch, see the RFC.
+* Strategic Merge Patch, `Content-Type: application/strategic-merge-patch+json`
+ * Strategic Merge Patch is a custom implementation of Merge Patch. For a
+detailed explanation of how it works and why it needed to be introduced, see
+below.
+
+#### Strategic Merge Patch
+
+In the standard JSON merge patch, JSON objects are always merged but lists are
+always replaced. Often that isn't what we want. Let's say we start with the
+following Pod:
+
+```yaml
+spec:
+ containers:
+ - name: nginx
+ image: nginx-1.0
+```
+
+...and we POST that to the server (as JSON). Then let's say we want to *add* a
+container to this Pod.
+
+```yaml
+PATCH /api/v1/namespaces/default/pods/pod-name
+spec:
+ containers:
+ - name: log-tailer
+ image: log-tailer-1.0
+```
+
+If we were to use standard Merge Patch, the entire container list would be
+replaced with the single log-tailer container. However, our intent is for the
+container lists to merge together based on the `name` field.
+
+To solve this problem, Strategic Merge Patch uses metadata attached to the API
+objects to determine what lists should be merged and which ones should not.
+Currently the metadata is available as struct tags on the API objects
+themselves, but will become available to clients as Swagger annotations in the
+future. In the above example, the `patchStrategy` metadata for the `containers`
+field would be `merge` and the `patchMergeKey` would be `name`.
+
+Note: If the patch results in merging two lists of scalars, the scalars are
+first deduplicated and then merged.
+
+Strategic Merge Patch also supports special operations as listed below.
+
+### List Operations
+
+To override the container list to be strictly replaced, regardless of the
+default:
+
+```yaml
+containers:
+ - name: nginx
+ image: nginx-1.0
+ - $patch: replace # any further $patch operations nested in this list will be ignored
+```
+
+To delete an element of a list that should be merged:
+
+```yaml
+containers:
+ - name: nginx
+ image: nginx-1.0
+ - $patch: delete
+ name: log-tailer # merge key and value goes here
+```
+
+### Map Operations
+
+To indicate that a map should not be merged and instead should be taken literally:
+
+```yaml
+$patch: replace # recursive and applies to all fields of the map it's in
+containers:
+- name: nginx
+ image: nginx-1.0
+```
+
+To delete a field of a map:
+
+```yaml
+name: nginx
+image: nginx-1.0
+labels:
+ live: null # set the value of the map key to null
+```
+
+
+## Idempotency
+
+All compatible Kubernetes APIs MUST support "name idempotency" and respond with
+an HTTP status code 409 when a request is made to POST an object that has the
+same name as an existing object in the system. See
+[docs/user-guide/identifiers.md](../user-guide/identifiers.md) for details.
+
+Names generated by the system may be requested using `metadata.generateName`.
+GenerateName indicates that the name should be made unique by the server prior
+to persisting it. A non-empty value for the field indicates the name will be
+made unique (and the name returned to the client will be different than the name
+passed). The value of this field will be combined with a unique suffix on the
+server if the Name field has not been provided. The provided value must be valid
+within the rules for Name, and may be truncated by the length of the suffix
+required to make the value unique on the server. If this field is specified, and
+Name is not present, the server will NOT return a 409 if the generated name
+exists - instead, it will either return 201 Created or 504 with Reason
+`ServerTimeout` indicating a unique name could not be found in the time
+allotted, and the client should retry (optionally after the time indicated in
+the Retry-After header).
+
+## Optional vs. Required
+
+Fields must be either optional or required.
+
+Optional fields have the following properties:
+
+- They have `omitempty` struct tag in Go.
+- They are a pointer type in the Go definition (e.g. `bool *awesomeFlag`) or
+have a built-in `nil` value (e.g. maps and slices).
+- The API server should allow POSTing and PUTing a resource with this field
+unset.
+
+Required fields have the opposite properties, namely:
+
+- They do not have an `omitempty` struct tag.
+- They are not a pointer type in the Go definition (e.g. `bool otherFlag`).
+- The API server should not allow POSTing or PUTing a resource with this field
+unset.
+
+Using the `omitempty` tag causes swagger documentation to reflect that the field
+is optional.
+
+Using a pointer allows distinguishing unset from the zero value for that type.
+There are some cases where, in principle, a pointer is not needed for an
+optional field since the zero value is forbidden, and thus implies unset. There
+are examples of this in the codebase. However:
+
+- it can be difficult for implementors to anticipate all cases where an empty
+value might need to be distinguished from a zero value
+- structs are not omitted from encoder output even where omitempty is specified,
+which is messy;
+- having a pointer consistently imply optional is clearer for users of the Go
+language client, and any other clients that use corresponding types
+
+Therefore, we ask that pointers always be used with optional fields that do not
+have a built-in `nil` value.
+
+
+## Defaulting
+
+Default resource values are API version-specific, and they are applied during
+the conversion from API-versioned declarative configuration to internal objects
+representing the desired state (`Spec`) of the resource. Subsequent GETs of the
+resource will include the default values explicitly.
+
+Incorporating the default values into the `Spec` ensures that `Spec` depicts the
+full desired state so that it is easier for the system to determine how to
+achieve the state, and for the user to know what to anticipate.
+
+API version-specific default values are set by the API server.
+
+## Late Initialization
+
+Late initialization is when resource fields are set by a system controller
+after an object is created/updated.
+
+For example, the scheduler sets the `pod.spec.nodeName` field after the pod is
+created.
+
+Late-initializers should only make the following types of modifications:
+ - Setting previously unset fields
+ - Adding keys to maps
+ - Adding values to arrays which have mergeable semantics
+(`patchStrategy:"merge"` attribute in the type definition).
+
+These conventions:
+ 1. allow a user (with sufficient privilege) to override any system-default
+ behaviors by setting the fields that would otherwise have been defaulted.
+ 1. enables updates from users to be merged with changes made during late
+initialization, using strategic merge patch, as opposed to clobbering the
+change.
+ 1. allow the component which does the late-initialization to use strategic
+merge patch, which facilitates composition and concurrency of such components.
+
+Although the apiserver Admission Control stage acts prior to object creation,
+Admission Control plugins should follow the Late Initialization conventions
+too, to allow their implementation to be later moved to a 'controller', or to
+client libraries.
+
+## Concurrency Control and Consistency
+
+Kubernetes leverages the concept of *resource versions* to achieve optimistic
+concurrency. All Kubernetes resources have a "resourceVersion" field as part of
+their metadata. This resourceVersion is a string that identifies the internal
+version of an object that can be used by clients to determine when objects have
+changed. When a record is about to be updated, it's version is checked against a
+pre-saved value, and if it doesn't match, the update fails with a StatusConflict
+(HTTP status code 409).
+
+The resourceVersion is changed by the server every time an object is modified.
+If resourceVersion is included with the PUT operation the system will verify
+that there have not been other successful mutations to the resource during a
+read/modify/write cycle, by verifying that the current value of resourceVersion
+matches the specified value.
+
+The resourceVersion is currently backed by [etcd's
+modifiedIndex](https://coreos.com/docs/distributed-configuration/etcd-api/).
+However, it's important to note that the application should *not* rely on the
+implementation details of the versioning system maintained by Kubernetes. We may
+change the implementation of resourceVersion in the future, such as to change it
+to a timestamp or per-object counter.
+
+The only way for a client to know the expected value of resourceVersion is to
+have received it from the server in response to a prior operation, typically a
+GET. This value MUST be treated as opaque by clients and passed unmodified back
+to the server. Clients should not assume that the resource version has meaning
+across namespaces, different kinds of resources, or different servers.
+Currently, the value of resourceVersion is set to match etcd's sequencer. You
+could think of it as a logical clock the API server can use to order requests.
+However, we expect the implementation of resourceVersion to change in the
+future, such as in the case we shard the state by kind and/or namespace, or port
+to another storage system.
+
+In the case of a conflict, the correct client action at this point is to GET the
+resource again, apply the changes afresh, and try submitting again. This
+mechanism can be used to prevent races like the following:
+
+```
+Client #1 Client #2
+GET Foo GET Foo
+Set Foo.Bar = "one" Set Foo.Baz = "two"
+PUT Foo PUT Foo
+```
+
+When these sequences occur in parallel, either the change to Foo.Bar or the
+change to Foo.Baz can be lost.
+
+On the other hand, when specifying the resourceVersion, one of the PUTs will
+fail, since whichever write succeeds changes the resourceVersion for Foo.
+
+resourceVersion may be used as a precondition for other operations (e.g., GET,
+DELETE) in the future, such as for read-after-write consistency in the presence
+of caching.
+
+"Watch" operations specify resourceVersion using a query parameter. It is used
+to specify the point at which to begin watching the specified resources. This
+may be used to ensure that no mutations are missed between a GET of a resource
+(or list of resources) and a subsequent Watch, even if the current version of
+the resource is more recent. This is currently the main reason that list
+operations (GET on a collection) return resourceVersion.
+
+
+## Serialization Format
+
+APIs may return alternative representations of any resource in response to an
+Accept header or under alternative endpoints, but the default serialization for
+input and output of API responses MUST be JSON.
+
+Protobuf serialization of API objects are currently **EXPERIMENTAL** and will change without notice.
+
+All dates should be serialized as RFC3339 strings.
+
+## Units
+
+Units must either be explicit in the field name (e.g., `timeoutSeconds`), or
+must be specified as part of the value (e.g., `resource.Quantity`). Which
+approach is preferred is TBD, though currently we use the `fooSeconds`
+convention for durations.
+
+
+## Selecting Fields
+
+Some APIs may need to identify which field in a JSON object is invalid, or to
+reference a value to extract from a separate resource. The current
+recommendation is to use standard JavaScript syntax for accessing that field,
+assuming the JSON object was transformed into a JavaScript object, without the
+leading dot, such as `metadata.name`.
+
+Examples:
+
+* Find the field "current" in the object "state" in the second item in the array
+"fields": `fields[1].state.current`
+
+## Object references
+
+Object references should either be called `fooName` if referring to an object of
+kind `Foo` by just the name (within the current namespace, if a namespaced
+resource), or should be called `fooRef`, and should contain a subset of the
+fields of the `ObjectReference` type.
+
+
+TODO: Plugins, extensions, nested kinds, headers
+
+
+## HTTP Status codes
+
+The server will respond with HTTP status codes that match the HTTP spec. See the
+section below for a breakdown of the types of status codes the server will send.
+
+The following HTTP status codes may be returned by the API.
+
+#### Success codes
+
+* `200 StatusOK`
+ * Indicates that the request completed successfully.
+* `201 StatusCreated`
+ * Indicates that the request to create kind completed successfully.
+* `204 StatusNoContent`
+ * Indicates that the request completed successfully, and the response contains
+no body.
+ * Returned in response to HTTP OPTIONS requests.
+
+#### Error codes
+
+* `307 StatusTemporaryRedirect`
+ * Indicates that the address for the requested resource has changed.
+ * Suggested client recovery behavior:
+ * Follow the redirect.
+
+
+* `400 StatusBadRequest`
+ * Indicates the requested is invalid.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `401 StatusUnauthorized`
+ * Indicates that the server can be reached and understood the request, but
+refuses to take any further action, because the client must provide
+authorization. If the client has provided authorization, the server is
+indicating the provided authorization is unsuitable or invalid.
+ * Suggested client recovery behavior:
+ * If the user has not supplied authorization information, prompt them for
+the appropriate credentials. If the user has supplied authorization information,
+inform them their credentials were rejected and optionally prompt them again.
+
+
+* `403 StatusForbidden`
+ * Indicates that the server can be reached and understood the request, but
+refuses to take any further action, because it is configured to deny access for
+some reason to the requested resource by the client.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `404 StatusNotFound`
+ * Indicates that the requested resource does not exist.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `405 StatusMethodNotAllowed`
+ * Indicates that the action the client attempted to perform on the resource
+was not supported by the code.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `409 StatusConflict`
+ * Indicates that either the resource the client attempted to create already
+exists or the requested update operation cannot be completed due to a conflict.
+ * Suggested client recovery behavior:
+ * * If creating a new resource:
+ * * Either change the identifier and try again, or GET and compare the
+fields in the pre-existing object and issue a PUT/update to modify the existing
+object.
+ * * If updating an existing resource:
+ * See `Conflict` from the `status` response section below on how to
+retrieve more information about the nature of the conflict.
+ * GET and compare the fields in the pre-existing object, merge changes (if
+still valid according to preconditions), and retry with the updated request
+(including `ResourceVersion`).
+
+
+* `410 StatusGone`
+ * Indicates that the item is no longer available at the server and no
+forwarding address is known.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `422 StatusUnprocessableEntity`
+ * Indicates that the requested create or update operation cannot be completed
+due to invalid data provided as part of the request.
+ * Suggested client recovery behavior:
+ * Do not retry. Fix the request.
+
+
+* `429 StatusTooManyRequests`
+ * Indicates that the either the client rate limit has been exceeded or the
+server has received more requests then it can process.
+ * Suggested client recovery behavior:
+ * Read the `Retry-After` HTTP header from the response, and wait at least
+that long before retrying.
+
+
+* `500 StatusInternalServerError`
+ * Indicates that the server can be reached and understood the request, but
+either an unexpected internal error occurred and the outcome of the call is
+unknown, or the server cannot complete the action in a reasonable time (this may
+be due to temporary server load or a transient communication issue with another
+server).
+ * Suggested client recovery behavior:
+ * Retry with exponential backoff.
+
+
+* `503 StatusServiceUnavailable`
+ * Indicates that required service is unavailable.
+ * Suggested client recovery behavior:
+ * Retry with exponential backoff.
+
+
+* `504 StatusServerTimeout`
+ * Indicates that the request could not be completed within the given time.
+Clients can get this response ONLY when they specified a timeout param in the
+request.
+ * Suggested client recovery behavior:
+ * Increase the value of the timeout param and retry with exponential
+backoff.
+
+## Response Status Kind
+
+Kubernetes will always return the `Status` kind from any API endpoint when an
+error occurs. Clients SHOULD handle these types of objects when appropriate.
+
+A `Status` kind will be returned by the API in two cases:
+ * When an operation is not successful (i.e. when the server would return a non
+2xx HTTP status code).
+ * When a HTTP `DELETE` call is successful.
+
+The status object is encoded as JSON and provided as the body of the response.
+The status object contains fields for humans and machine consumers of the API to
+get more detailed information for the cause of the failure. The information in
+the status object supplements, but does not override, the HTTP status code's
+meaning. When fields in the status object have the same meaning as generally
+defined HTTP headers and that header is returned with the response, the header
+should be considered as having higher priority.
+
+**Example:**
+
+```console
+$ curl -v -k -H "Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc" https://10.240.122.184:443/api/v1/namespaces/default/pods/grafana
+
+> GET /api/v1/namespaces/default/pods/grafana HTTP/1.1
+> User-Agent: curl/7.26.0
+> Host: 10.240.122.184
+> Accept: */*
+> Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc
+>
+
+< HTTP/1.1 404 Not Found
+< Content-Type: application/json
+< Date: Wed, 20 May 2015 18:10:42 GMT
+< Content-Length: 232
+<
+{
+ "kind": "Status",
+ "apiVersion": "v1",
+ "metadata": {},
+ "status": "Failure",
+ "message": "pods \"grafana\" not found",
+ "reason": "NotFound",
+ "details": {
+ "name": "grafana",
+ "kind": "pods"
+ },
+ "code": 404
+}
+```
+
+`status` field contains one of two possible values:
+* `Success`
+* `Failure`
+
+`message` may contain human-readable description of the error
+
+`reason` may contain a machine-readable, one-word, CamelCase description of why
+this operation is in the `Failure` status. If this value is empty there is no
+information available. The `reason` clarifies an HTTP status code but does not
+override it.
+
+`details` may contain extended data associated with the reason. Each reason may
+define its own extended details. This field is optional and the data returned is
+not guaranteed to conform to any schema except that defined by the reason type.
+
+Possible values for the `reason` and `details` fields:
+* `BadRequest`
+ * Indicates that the request itself was invalid, because the request doesn't
+make any sense, for example deleting a read-only object.
+ * This is different than `status reason` `Invalid` above which indicates that
+the API call could possibly succeed, but the data was invalid.
+ * API calls that return BadRequest can never succeed.
+ * Http status code: `400 StatusBadRequest`
+
+
+* `Unauthorized`
+ * Indicates that the server can be reached and understood the request, but
+refuses to take any further action without the client providing appropriate
+authorization. If the client has provided authorization, this error indicates
+the provided credentials are insufficient or invalid.
+ * Details (optional):
+ * `kind string`
+ * The kind attribute of the unauthorized resource (on some operations may
+differ from the requested resource).
+ * `name string`
+ * The identifier of the unauthorized resource.
+ * HTTP status code: `401 StatusUnauthorized`
+
+
+* `Forbidden`
+ * Indicates that the server can be reached and understood the request, but
+refuses to take any further action, because it is configured to deny access for
+some reason to the requested resource by the client.
+ * Details (optional):
+ * `kind string`
+ * The kind attribute of the forbidden resource (on some operations may
+differ from the requested resource).
+ * `name string`
+ * The identifier of the forbidden resource.
+ * HTTP status code: `403 StatusForbidden`
+
+
+* `NotFound`
+ * Indicates that one or more resources required for this operation could not
+be found.
+ * Details (optional):
+ * `kind string`
+ * The kind attribute of the missing resource (on some operations may
+differ from the requested resource).
+ * `name string`
+ * The identifier of the missing resource.
+ * HTTP status code: `404 StatusNotFound`
+
+
+* `AlreadyExists`
+ * Indicates that the resource you are creating already exists.
+ * Details (optional):
+ * `kind string`
+ * The kind attribute of the conflicting resource.
+ * `name string`
+ * The identifier of the conflicting resource.
+ * HTTP status code: `409 StatusConflict`
+
+* `Conflict`
+ * Indicates that the requested update operation cannot be completed due to a
+conflict. The client may need to alter the request. Each resource may define
+custom details that indicate the nature of the conflict.
+ * HTTP status code: `409 StatusConflict`
+
+
+* `Invalid`
+ * Indicates that the requested create or update operation cannot be completed
+due to invalid data provided as part of the request.
+ * Details (optional):
+ * `kind string`
+ * the kind attribute of the invalid resource
+ * `name string`
+ * the identifier of the invalid resource
+ * `causes`
+ * One or more `StatusCause` entries indicating the data in the provided
+resource that was invalid. The `reason`, `message`, and `field` attributes will
+be set.
+ * HTTP status code: `422 StatusUnprocessableEntity`
+
+
+* `Timeout`
+ * Indicates that the request could not be completed within the given time.
+Clients may receive this response if the server has decided to rate limit the
+client, or if the server is overloaded and cannot process the request at this
+time.
+ * Http status code: `429 TooManyRequests`
+ * The server should set the `Retry-After` HTTP header and return
+`retryAfterSeconds` in the details field of the object. A value of `0` is the
+default.
+
+
+* `ServerTimeout`
+ * Indicates that the server can be reached and understood the request, but
+cannot complete the action in a reasonable time. This maybe due to temporary
+server load or a transient communication issue with another server.
+ * Details (optional):
+ * `kind string`
+ * The kind attribute of the resource being acted on.
+ * `name string`
+ * The operation that is being attempted.
+ * The server should set the `Retry-After` HTTP header and return
+`retryAfterSeconds` in the details field of the object. A value of `0` is the
+default.
+ * Http status code: `504 StatusServerTimeout`
+
+
+* `MethodNotAllowed`
+ * Indicates that the action the client attempted to perform on the resource
+was not supported by the code.
+ * For instance, attempting to delete a resource that can only be created.
+ * API calls that return MethodNotAllowed can never succeed.
+ * Http status code: `405 StatusMethodNotAllowed`
+
+
+* `InternalError`
+ * Indicates that an internal error occurred, it is unexpected and the outcome
+of the call is unknown.
+ * Details (optional):
+ * `causes`
+ * The original error.
+ * Http status code: `500 StatusInternalServerError` `code` may contain the suggested HTTP return code for this status.
+
+
+## Events
+
+Events are complementary to status information, since they can provide some
+historical information about status and occurrences in addition to current or
+previous status. Generate events for situations users or administrators should
+be alerted about.
+
+Choose a unique, specific, short, CamelCase reason for each event category. For
+example, `FreeDiskSpaceInvalid` is a good event reason because it is likely to
+refer to just one situation, but `Started` is not a good reason because it
+doesn't sufficiently indicate what started, even when combined with other event
+fields.
+
+`Error creating foo` or `Error creating foo %s` would be appropriate for an
+event message, with the latter being preferable, since it is more informational.
+
+Accumulate repeated events in the client, especially for frequent events, to
+reduce data volume, load on the system, and noise exposed to users.
+
+## Naming conventions
+
+* Go field names must be CamelCase. JSON field names must be camelCase. Other
+than capitalization of the initial letter, the two should almost always match.
+No underscores nor dashes in either.
+* Field and resource names should be declarative, not imperative (DoSomething,
+SomethingDoer, DoneBy, DoneAt).
+* Use `Node` where referring to
+the node resource in the context of the cluster. Use `Host` where referring to
+properties of the individual physical/virtual system, such as `hostname`,
+`hostPath`, `hostNetwork`, etc.
+* `FooController` is a deprecated kind naming convention. Name the kind after
+the thing being controlled instead (e.g., `Job` rather than `JobController`).
+* The name of a field that specifies the time at which `something` occurs should
+be called `somethingTime`. Do not use `stamp` (e.g., `creationTimestamp`).
+* We use the `fooSeconds` convention for durations, as discussed in the [units
+subsection](#units).
+ * `fooPeriodSeconds` is preferred for periodic intervals and other waiting
+periods (e.g., over `fooIntervalSeconds`).
+ * `fooTimeoutSeconds` is preferred for inactivity/unresponsiveness deadlines.
+ * `fooDeadlineSeconds` is preferred for activity completion deadlines.
+* Do not use abbreviations in the API, except where they are extremely commonly
+used, such as "id", "args", or "stdin".
+* Acronyms should similarly only be used when extremely commonly known. All
+letters in the acronym should have the same case, using the appropriate case for
+the situation. For example, at the beginning of a field name, the acronym should
+be all lowercase, such as "httpGet". Where used as a constant, all letters
+should be uppercase, such as "TCP" or "UDP".
+* The name of a field referring to another resource of kind `Foo` by name should
+be called `fooName`. The name of a field referring to another resource of kind
+`Foo` by ObjectReference (or subset thereof) should be called `fooRef`.
+* More generally, include the units and/or type in the field name if they could
+be ambiguous and they are not specified by the value or value type.
+
+## Label, selector, and annotation conventions
+
+Labels are the domain of users. They are intended to facilitate organization and
+management of API resources using attributes that are meaningful to users, as
+opposed to meaningful to the system. Think of them as user-created mp3 or email
+inbox labels, as opposed to the directory structure used by a program to store
+its data. The former enables the user to apply an arbitrary ontology, whereas
+the latter is implementation-centric and inflexible. Users will use labels to
+select resources to operate on, display label values in CLI/UI columns, etc.
+Users should always retain full power and flexibility over the label schemas
+they apply to labels in their namespaces.
+
+However, we should support conveniences for common cases by default. For
+example, what we now do in ReplicationController is automatically set the RC's
+selector and labels to the labels in the pod template by default, if they are
+not already set. That ensures that the selector will match the template, and
+that the RC can be managed using the same labels as the pods it creates. Note
+that once we generalize selectors, it won't necessarily be possible to
+unambiguously generate labels that match an arbitrary selector.
+
+If the user wants to apply additional labels to the pods that it doesn't select
+upon, such as to facilitate adoption of pods or in the expectation that some
+label values will change, they can set the selector to a subset of the pod
+labels. Similarly, the RC's labels could be initialized to a subset of the pod
+template's labels, or could include additional/different labels.
+
+For disciplined users managing resources within their own namespaces, it's not
+that hard to consistently apply schemas that ensure uniqueness. One just needs
+to ensure that at least one value of some label key in common differs compared
+to all other comparable resources. We could/should provide a verification tool
+to check that. However, development of conventions similar to the examples in
+[Labels](../user-guide/labels.md) make uniqueness straightforward. Furthermore,
+relatively narrowly used namespaces (e.g., per environment, per application) can
+be used to reduce the set of resources that could potentially cause overlap.
+
+In cases where users could be running misc. examples with inconsistent schemas,
+or where tooling or components need to programmatically generate new objects to
+be selected, there needs to be a straightforward way to generate unique label
+sets. A simple way to ensure uniqueness of the set is to ensure uniqueness of a
+single label value, such as by using a resource name, uid, resource hash, or
+generation number.
+
+Problems with uids and hashes, however, include that they have no semantic
+meaning to the user, are not memorable nor readily recognizable, and are not
+predictable. Lack of predictability obstructs use cases such as creation of a
+replication controller from a pod, such as people want to do when exploring the
+system, bootstrapping a self-hosted cluster, or deletion and re-creation of a
+new RC that adopts the pods of the previous one, such as to rename it.
+Generation numbers are more predictable and much clearer, assuming there is a
+logical sequence. Fortunately, for deployments that's the case. For jobs, use of
+creation timestamps is common internally. Users should always be able to turn
+off auto-generation, in order to permit some of the scenarios described above.
+Note that auto-generated labels will also become one more field that needs to be
+stripped out when cloning a resource, within a namespace, in a new namespace, in
+a new cluster, etc., and will need to be ignored around when updating a resource
+via patch or read-modify-write sequence.
+
+Inclusion of a system prefix in a label key is fairly hostile to UX. A prefix is
+only necessary in the case that the user cannot choose the label key, in order
+to avoid collisions with user-defined labels. However, I firmly believe that the
+user should always be allowed to select the label keys to use on their
+resources, so it should always be possible to override default label keys.
+
+Therefore, resources supporting auto-generation of unique labels should have a
+`uniqueLabelKey` field, so that the user could specify the key if they wanted
+to, but if unspecified, it could be set by default, such as to the resource
+type, like job, deployment, or replicationController. The value would need to be
+at least spatially unique, and perhaps temporally unique in the case of job.
+
+Annotations have very different intended usage from labels. We expect them to be
+primarily generated and consumed by tooling and system extensions. I'm inclined
+to generalize annotations to permit them to directly store arbitrary json. Rigid
+names and name prefixes make sense, since they are analogous to API fields.
+
+In fact, in-development API fields, including those used to represent fields of
+newer alpha/beta API versions in the older stable storage version, may be
+represented as annotations with the form `something.alpha.kubernetes.io/name` or
+`something.beta.kubernetes.io/name` (depending on our confidence in it). For
+example `net.alpha.kubernetes.io/policy` might represent an experimental network
+policy field. The "name" portion of the annotation should follow the below
+conventions for annotations. When an annotation gets promoted to a field, the
+name transformation should then be mechanical: `foo-bar` becomes `fooBar`.
+
+Other advice regarding use of labels, annotations, and other generic map keys by
+Kubernetes components and tools:
+ - Key names should be all lowercase, with words separated by dashes, such as
+`desired-replicas`
+ - Prefix the key with `kubernetes.io/` or `foo.kubernetes.io/`, preferably the
+latter if the label/annotation is specific to `foo`
+ - For instance, prefer `service-account.kubernetes.io/name` over
+`kubernetes.io/service-account.name`
+ - Use annotations to store API extensions that the controller responsible for
+the resource doesn't need to know about, experimental fields that aren't
+intended to be generally used API fields, etc. Beware that annotations aren't
+automatically handled by the API conversion machinery.
+
+
+## WebSockets and SPDY
+
+Some of the API operations exposed by Kubernetes involve transfer of binary
+streams between the client and a container, including attach, exec, portforward,
+and logging. The API therefore exposes certain operations over upgradeable HTTP
+connections ([described in RFC 2817](https://tools.ietf.org/html/rfc2817)) via
+the WebSocket and SPDY protocols. These actions are exposed as subresources with
+their associated verbs (exec, log, attach, and portforward) and are requested
+via a GET (to support JavaScript in a browser) and POST (semantically accurate).
+
+There are two primary protocols in use today:
+
+1. Streamed channels
+
+ When dealing with multiple independent binary streams of data such as the
+remote execution of a shell command (writing to STDIN, reading from STDOUT and
+STDERR) or forwarding multiple ports the streams can be multiplexed onto a
+single TCP connection. Kubernetes supports a SPDY based framing protocol that
+leverages SPDY channels and a WebSocket framing protocol that multiplexes
+multiple channels onto the same stream by prefixing each binary chunk with a
+byte indicating its channel. The WebSocket protocol supports an optional
+subprotocol that handles base64-encoded bytes from the client and returns
+base64-encoded bytes from the server and character based channel prefixes ('0',
+'1', '2') for ease of use from JavaScript in a browser.
+
+2. Streaming response
+
+ The default log output for a channel of streaming data is an HTTP Chunked
+Transfer-Encoding, which can return an arbitrary stream of binary data from the
+server. Browser-based JavaScript is limited in its ability to access the raw
+data from a chunked response, especially when very large amounts of logs are
+returned, and in future API calls it may be desirable to transfer large files.
+The streaming API endpoints support an optional WebSocket upgrade that provides
+a unidirectional channel from the server to the client and chunks data as binary
+WebSocket frames. An optional WebSocket subprotocol is exposed that base64
+encodes the stream before returning it to the client.
+
+Clients should use the SPDY protocols if their clients have native support, or
+WebSockets as a fallback. Note that WebSockets is susceptible to Head-of-Line
+blocking and so clients must read and process each message sequentially. In
+the future, an HTTP/2 implementation will be exposed that deprecates SPDY.
+
+
+## Validation
+
+API objects are validated upon receipt by the apiserver. Validation errors are
+flagged and returned to the caller in a `Failure` status with `reason` set to
+`Invalid`. In order to facilitate consistent error messages, we ask that
+validation logic adheres to the following guidelines whenever possible (though
+exceptional cases will exist).
+
+* Be as precise as possible.
+* Telling users what they CAN do is more useful than telling them what they
+CANNOT do.
+* When asserting a requirement in the positive, use "must". Examples: "must be
+greater than 0", "must match regex '[a-z]+'". Words like "should" imply that
+the assertion is optional, and must be avoided.
+* When asserting a formatting requirement in the negative, use "must not".
+Example: "must not contain '..'". Words like "should not" imply that the
+assertion is optional, and must be avoided.
+* When asserting a behavioral requirement in the negative, use "may not".
+Examples: "may not be specified when otherField is empty", "only `name` may be
+specified".
+* When referencing a literal string value, indicate the literal in
+single-quotes. Example: "must not contain '..'".
+* When referencing another field name, indicate the name in back-quotes.
+Example: "must be greater than `request`".
+* When specifying inequalities, use words rather than symbols. Examples: "must
+be less than 256", "must be greater than or equal to 0". Do not use words
+like "larger than", "bigger than", "more than", "higher than", etc.
+* When specifying numeric ranges, use inclusive ranges when possible.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/api-conventions.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/api_changes.md b/contributors/devel/api_changes.md
new file mode 100755
index 00000000..963deb7c
--- /dev/null
+++ b/contributors/devel/api_changes.md
@@ -0,0 +1,732 @@
+*This document is oriented at developers who want to change existing APIs.
+A set of API conventions, which applies to new APIs and to changes, can be
+found at [API Conventions](api-conventions.md).
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [So you want to change the API?](#so-you-want-to-change-the-api)
+ - [Operational overview](#operational-overview)
+ - [On compatibility](#on-compatibility)
+ - [Incompatible API changes](#incompatible-api-changes)
+ - [Changing versioned APIs](#changing-versioned-apis)
+ - [Edit types.go](#edit-typesgo)
+ - [Edit defaults.go](#edit-defaultsgo)
+ - [Edit conversion.go](#edit-conversiongo)
+ - [Changing the internal structures](#changing-the-internal-structures)
+ - [Edit types.go](#edit-typesgo-1)
+ - [Edit validation.go](#edit-validationgo)
+ - [Edit version conversions](#edit-version-conversions)
+ - [Generate protobuf objects](#generate-protobuf-objects)
+ - [Edit json (un)marshaling code](#edit-json-unmarshaling-code)
+ - [Making a new API Group](#making-a-new-api-group)
+ - [Update the fuzzer](#update-the-fuzzer)
+ - [Update the semantic comparisons](#update-the-semantic-comparisons)
+ - [Implement your change](#implement-your-change)
+ - [Write end-to-end tests](#write-end-to-end-tests)
+ - [Examples and docs](#examples-and-docs)
+ - [Alpha, Beta, and Stable Versions](#alpha-beta-and-stable-versions)
+ - [Adding Unstable Features to Stable Versions](#adding-unstable-features-to-stable-versions)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# So you want to change the API?
+
+Before attempting a change to the API, you should familiarize yourself with a
+number of existing API types and with the [API conventions](api-conventions.md).
+If creating a new API type/resource, we also recommend that you first send a PR
+containing just a proposal for the new API types, and that you initially target
+the extensions API (pkg/apis/extensions).
+
+The Kubernetes API has two major components - the internal structures and
+the versioned APIs. The versioned APIs are intended to be stable, while the
+internal structures are implemented to best reflect the needs of the Kubernetes
+code itself.
+
+What this means for API changes is that you have to be somewhat thoughtful in
+how you approach changes, and that you have to touch a number of pieces to make
+a complete change. This document aims to guide you through the process, though
+not all API changes will need all of these steps.
+
+## Operational overview
+
+It is important to have a high level understanding of the API system used in
+Kubernetes in order to navigate the rest of this document.
+
+As mentioned above, the internal representation of an API object is decoupled
+from any one API version. This provides a lot of freedom to evolve the code,
+but it requires robust infrastructure to convert between representations. There
+are multiple steps in processing an API operation - even something as simple as
+a GET involves a great deal of machinery.
+
+The conversion process is logically a "star" with the internal form at the
+center. Every versioned API can be converted to the internal form (and
+vice-versa), but versioned APIs do not convert to other versioned APIs directly.
+This sounds like a heavy process, but in reality we do not intend to keep more
+than a small number of versions alive at once. While all of the Kubernetes code
+operates on the internal structures, they are always converted to a versioned
+form before being written to storage (disk or etcd) or being sent over a wire.
+Clients should consume and operate on the versioned APIs exclusively.
+
+To demonstrate the general process, here is a (hypothetical) example:
+
+ 1. A user POSTs a `Pod` object to `/api/v7beta1/...`
+ 2. The JSON is unmarshalled into a `v7beta1.Pod` structure
+ 3. Default values are applied to the `v7beta1.Pod`
+ 4. The `v7beta1.Pod` is converted to an `api.Pod` structure
+ 5. The `api.Pod` is validated, and any errors are returned to the user
+ 6. The `api.Pod` is converted to a `v6.Pod` (because v6 is the latest stable
+version)
+ 7. The `v6.Pod` is marshalled into JSON and written to etcd
+
+Now that we have the `Pod` object stored, a user can GET that object in any
+supported api version. For example:
+
+ 1. A user GETs the `Pod` from `/api/v5/...`
+ 2. The JSON is read from etcd and unmarshalled into a `v6.Pod` structure
+ 3. Default values are applied to the `v6.Pod`
+ 4. The `v6.Pod` is converted to an `api.Pod` structure
+ 5. The `api.Pod` is converted to a `v5.Pod` structure
+ 6. The `v5.Pod` is marshalled into JSON and sent to the user
+
+The implication of this process is that API changes must be done carefully and
+backward-compatibly.
+
+## On compatibility
+
+Before talking about how to make API changes, it is worthwhile to clarify what
+we mean by API compatibility. Kubernetes considers forwards and backwards
+compatibility of its APIs a top priority.
+
+An API change is considered forward and backward-compatible if it:
+
+ * adds new functionality that is not required for correct behavior (e.g.,
+does not add a new required field)
+ * does not change existing semantics, including:
+ * default values and behavior
+ * interpretation of existing API types, fields, and values
+ * which fields are required and which are not
+
+Put another way:
+
+1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before
+your change must work the same after your change.
+2. Any API call that uses your change must not cause problems (e.g. crash or
+degrade behavior) when issued against servers that do not include your change.
+3. It must be possible to round-trip your change (convert to different API
+versions and back) with no loss of information.
+4. Existing clients need not be aware of your change in order for them to
+continue to function as they did previously, even when your change is utilized.
+
+If your change does not meet these criteria, it is not considered strictly
+compatible, and may break older clients, or result in newer clients causing
+undefined behavior.
+
+Let's consider some examples. In a hypothetical API (assume we're at version
+v6), the `Frobber` struct looks something like this:
+
+```go
+// API v6.
+type Frobber struct {
+ Height int `json:"height"`
+ Param string `json:"param"`
+}
+```
+
+You want to add a new `Width` field. It is generally safe to add new fields
+without changing the API version, so you can simply change it to:
+
+```go
+// Still API v6.
+type Frobber struct {
+ Height int `json:"height"`
+ Width int `json:"width"`
+ Param string `json:"param"`
+}
+```
+
+The onus is on you to define a sane default value for `Width` such that rule #1
+above is true - API calls and stored objects that used to work must continue to
+work.
+
+For your next change you want to allow multiple `Param` values. You can not
+simply change `Param string` to `Params []string` (without creating a whole new
+API version) - that fails rules #1 and #2. You can instead do something like:
+
+```go
+// Still API v6, but kind of clumsy.
+type Frobber struct {
+ Height int `json:"height"`
+ Width int `json:"width"`
+ Param string `json:"param"` // the first param
+ ExtraParams []string `json:"extraParams"` // additional params
+}
+```
+
+Now you can satisfy the rules: API calls that provide the old style `Param`
+will still work, while servers that don't understand `ExtraParams` can ignore
+it. This is somewhat unsatisfying as an API, but it is strictly compatible.
+
+Part of the reason for versioning APIs and for using internal structs that are
+distinct from any one version is to handle growth like this. The internal
+representation can be implemented as:
+
+```go
+// Internal, soon to be v7beta1.
+type Frobber struct {
+ Height int
+ Width int
+ Params []string
+}
+```
+
+The code that converts to/from versioned APIs can decode this into the somewhat
+uglier (but compatible!) structures. Eventually, a new API version, let's call
+it v7beta1, will be forked and it can use the clean internal structure.
+
+We've seen how to satisfy rules #1 and #2. Rule #3 means that you can not
+extend one versioned API without also extending the others. For example, an
+API call might POST an object in API v7beta1 format, which uses the cleaner
+`Params` field, but the API server might store that object in trusty old v6
+form (since v7beta1 is "beta"). When the user reads the object back in the
+v7beta1 API it would be unacceptable to have lost all but `Params[0]`. This
+means that, even though it is ugly, a compatible change must be made to the v6
+API.
+
+However, this is very challenging to do correctly. It often requires multiple
+representations of the same information in the same API resource, which need to
+be kept in sync in the event that either is changed. For example, let's say you
+decide to rename a field within the same API version. In this case, you add
+units to `height` and `width`. You implement this by adding duplicate fields:
+
+```go
+type Frobber struct {
+ Height *int `json:"height"`
+ Width *int `json:"width"`
+ HeightInInches *int `json:"heightInInches"`
+ WidthInInches *int `json:"widthInInches"`
+}
+```
+
+You convert all of the fields to pointers in order to distinguish between unset
+and set to 0, and then set each corresponding field from the other in the
+defaulting pass (e.g., `heightInInches` from `height`, and vice versa), which
+runs just prior to conversion. That works fine when the user creates a resource
+from a hand-written configuration -- clients can write either field and read
+either field, but what about creation or update from the output of GET, or
+update via PATCH (see
+[In-place updates](../user-guide/managing-deployments.md#in-place-updates-of-resources))?
+In this case, the two fields will conflict, because only one field would be
+updated in the case of an old client that was only aware of the old field (e.g.,
+`height`).
+
+Say the client creates:
+
+```json
+{
+ "height": 10,
+ "width": 5
+}
+```
+
+and GETs:
+
+```json
+{
+ "height": 10,
+ "heightInInches": 10,
+ "width": 5,
+ "widthInInches": 5
+}
+```
+
+then PUTs back:
+
+```json
+{
+ "height": 13,
+ "heightInInches": 10,
+ "width": 5,
+ "widthInInches": 5
+}
+```
+
+The update should not fail, because it would have worked before `heightInInches`
+was added.
+
+Therefore, when there are duplicate fields, the old field MUST take precedence
+over the new, and the new field should be set to match by the server upon write.
+A new client would be aware of the old field as well as the new, and so can
+ensure that the old field is either unset or is set consistently with the new
+field. However, older clients would be unaware of the new field. Please avoid
+introducing duplicate fields due to the complexity they incur in the API.
+
+A new representation, even in a new API version, that is more expressive than an
+old one breaks backward compatibility, since clients that only understood the
+old representation would not be aware of the new representation nor its
+semantics. Examples of proposals that have run into this challenge include
+[generalized label selectors](http://issues.k8s.io/341) and [pod-level security
+context](http://prs.k8s.io/12823).
+
+As another interesting example, enumerated values cause similar challenges.
+Adding a new value to an enumerated set is *not* a compatible change. Clients
+which assume they know how to handle all possible values of a given field will
+not be able to handle the new values. However, removing value from an enumerated
+set *can* be a compatible change, if handled properly (treat the removed value
+as deprecated but allowed). This is actually a special case of a new
+representation, discussed above.
+
+For [Unions](api-conventions.md#unions), sets of fields where at most one should
+be set, it is acceptable to add a new option to the union if the [appropriate
+conventions](api-conventions.md#objects) were followed in the original object.
+Removing an option requires following the deprecation process.
+
+## Incompatible API changes
+
+There are times when this might be OK, but mostly we want changes that meet this
+definition. If you think you need to break compatibility, you should talk to the
+Kubernetes team first.
+
+Breaking compatibility of a beta or stable API version, such as v1, is
+unacceptable. Compatibility for experimental or alpha APIs is not strictly
+required, but breaking compatibility should not be done lightly, as it disrupts
+all users of the feature. Experimental APIs may be removed. Alpha and beta API
+versions may be deprecated and eventually removed wholesale, as described in the
+[versioning document](../design/versioning.md). Document incompatible changes
+across API versions under the appropriate
+[{v? conversion tips tag in the api.md doc](../api.md).
+
+If your change is going to be backward incompatible or might be a breaking
+change for API consumers, please send an announcement to
+`kubernetes-dev@googlegroups.com` before the change gets in. If you are unsure,
+ask. Also make sure that the change gets documented in the release notes for the
+next release by labeling the PR with the "release-note" github label.
+
+If you found that your change accidentally broke clients, it should be reverted.
+
+In short, the expected API evolution is as follows:
+
+* `extensions/v1alpha1` ->
+* `newapigroup/v1alpha1` -> ... -> `newapigroup/v1alphaN` ->
+* `newapigroup/v1beta1` -> ... -> `newapigroup/v1betaN` ->
+* `newapigroup/v1` ->
+* `newapigroup/v2alpha1` -> ...
+
+While in extensions we have no obligation to move forward with the API at all
+and may delete or break it at any time.
+
+While in alpha we expect to move forward with it, but may break it.
+
+Once in beta we will preserve forward compatibility, but may introduce new
+versions and delete old ones.
+
+v1 must be backward-compatible for an extended length of time.
+
+## Changing versioned APIs
+
+For most changes, you will probably find it easiest to change the versioned
+APIs first. This forces you to think about how to make your change in a
+compatible way. Rather than doing each step in every version, it's usually
+easier to do each versioned API one at a time, or to do all of one version
+before starting "all the rest".
+
+### Edit types.go
+
+The struct definitions for each API are in `pkg/api/<version>/types.go`. Edit
+those files to reflect the change you want to make. Note that all types and
+non-inline fields in versioned APIs must be preceded by descriptive comments -
+these are used to generate documentation. Comments for types should not contain
+the type name; API documentation is generated from these comments and end-users
+should not be exposed to golang type names.
+
+Optional fields should have the `,omitempty` json tag; fields are interpreted as
+being required otherwise.
+
+### Edit defaults.go
+
+If your change includes new fields for which you will need default values, you
+need to add cases to `pkg/api/<version>/defaults.go`. Of course, since you
+have added code, you have to add a test: `pkg/api/<version>/defaults_test.go`.
+
+Do use pointers to scalars when you need to distinguish between an unset value
+and an automatic zero value. For example,
+`PodSpec.TerminationGracePeriodSeconds` is defined as `*int64` the go type
+definition. A zero value means 0 seconds, and a nil value asks the system to
+pick a default.
+
+Don't forget to run the tests!
+
+### Edit conversion.go
+
+Given that you have not yet changed the internal structs, this might feel
+premature, and that's because it is. You don't yet have anything to convert to
+or from. We will revisit this in the "internal" section. If you're doing this
+all in a different order (i.e. you started with the internal structs), then you
+should jump to that topic below. In the very rare case that you are making an
+incompatible change you might or might not want to do this now, but you will
+have to do more later. The files you want are
+`pkg/api/<version>/conversion.go` and `pkg/api/<version>/conversion_test.go`.
+
+Note that the conversion machinery doesn't generically handle conversion of
+values, such as various kinds of field references and API constants. [The client
+library](../../pkg/client/restclient/request.go) has custom conversion code for
+field references. You also need to add a call to
+api.Scheme.AddFieldLabelConversionFunc with a mapping function that understands
+supported translations.
+
+## Changing the internal structures
+
+Now it is time to change the internal structs so your versioned changes can be
+used.
+
+### Edit types.go
+
+Similar to the versioned APIs, the definitions for the internal structs are in
+`pkg/api/types.go`. Edit those files to reflect the change you want to make.
+Keep in mind that the internal structs must be able to express *all* of the
+versioned APIs.
+
+## Edit validation.go
+
+Most changes made to the internal structs need some form of input validation.
+Validation is currently done on internal objects in
+`pkg/api/validation/validation.go`. This validation is the one of the first
+opportunities we have to make a great user experience - good error messages and
+thorough validation help ensure that users are giving you what you expect and,
+when they don't, that they know why and how to fix it. Think hard about the
+contents of `string` fields, the bounds of `int` fields and the
+requiredness/optionalness of fields.
+
+Of course, code needs tests - `pkg/api/validation/validation_test.go`.
+
+## Edit version conversions
+
+At this point you have both the versioned API changes and the internal
+structure changes done. If there are any notable differences - field names,
+types, structural change in particular - you must add some logic to convert
+versioned APIs to and from the internal representation. If you see errors from
+the `serialization_test`, it may indicate the need for explicit conversions.
+
+Performance of conversions very heavily influence performance of apiserver.
+Thus, we are auto-generating conversion functions that are much more efficient
+than the generic ones (which are based on reflections and thus are highly
+inefficient).
+
+The conversion code resides with each versioned API. There are two files:
+
+ - `pkg/api/<version>/conversion.go` containing manually written conversion
+functions
+ - `pkg/api/<version>/conversion_generated.go` containing auto-generated
+conversion functions
+ - `pkg/apis/extensions/<version>/conversion.go` containing manually written
+conversion functions
+ - `pkg/apis/extensions/<version>/conversion_generated.go` containing
+auto-generated conversion functions
+
+Since auto-generated conversion functions are using manually written ones,
+those manually written should be named with a defined convention, i.e. a
+function converting type X in pkg a to type Y in pkg b, should be named:
+`convert_a_X_To_b_Y`.
+
+Also note that you can (and for efficiency reasons should) use auto-generated
+conversion functions when writing your conversion functions.
+
+Once all the necessary manually written conversions are added, you need to
+regenerate auto-generated ones. To regenerate them run:
+
+```sh
+hack/update-codegen.sh
+```
+
+As part of the build, kubernetes will also generate code to handle deep copy of
+your versioned api objects. The deep copy code resides with each versioned API:
+ - `<path_to_versioned_api>/zz_generated.deepcopy.go` containing auto-generated copy functions
+
+If regeneration is somehow not possible due to compile errors, the easiest
+workaround is to comment out the code causing errors and let the script to
+regenerate it. If the auto-generated conversion methods are not used by the
+manually-written ones, it's fine to just remove the whole file and let the
+generator to create it from scratch.
+
+Unsurprisingly, adding manually written conversion also requires you to add
+tests to `pkg/api/<version>/conversion_test.go`.
+
+
+## Generate protobuf objects
+
+For any core API object, we also need to generate the Protobuf IDL and marshallers.
+That generation is done with
+
+```sh
+hack/update-generated-protobuf.sh
+```
+
+The vast majority of objects will not need any consideration when converting
+to protobuf, but be aware that if you depend on a Golang type in the standard
+library there may be additional work required, although in practice we typically
+use our own equivalents for JSON serialization. The `pkg/api/serialization_test.go`
+will verify that your protobuf serialization preserves all fields - be sure to
+run it several times to ensure there are no incompletely calculated fields.
+
+## Edit json (un)marshaling code
+
+We are auto-generating code for marshaling and unmarshaling json representation
+of api objects - this is to improve the overall system performance.
+
+The auto-generated code resides with each versioned API:
+
+ - `pkg/api/<version>/types.generated.go`
+ - `pkg/apis/extensions/<version>/types.generated.go`
+
+To regenerate them run:
+
+```sh
+hack/update-codecgen.sh
+```
+
+## Making a new API Group
+
+This section is under construction, as we make the tooling completely generic.
+
+At the moment, you'll have to make a new directory under `pkg/apis/`; copy the
+directory structure from `pkg/apis/authentication`. Add the new group/version to all
+of the `hack/{verify,update}-generated-{deep-copy,conversions,swagger}.sh` files
+in the appropriate places--it should just require adding your new group/version
+to a bash array. See [docs on adding an API group](adding-an-APIGroup.md) for
+more.
+
+Adding API groups outside of the `pkg/apis/` directory is not currently
+supported, but is clearly desirable. The deep copy & conversion generators need
+to work by parsing go files instead of by reflection; then they will be easy to
+point at arbitrary directories: see issue [#13775](http://issue.k8s.io/13775).
+
+## Update the fuzzer
+
+Part of our testing regimen for APIs is to "fuzz" (fill with random values) API
+objects and then convert them to and from the different API versions. This is
+a great way of exposing places where you lost information or made bad
+assumptions. If you have added any fields which need very careful formatting
+(the test does not run validation) or if you have made assumptions such as
+"this slice will always have at least 1 element", you may get an error or even
+a panic from the `serialization_test`. If so, look at the diff it produces (or
+the backtrace in case of a panic) and figure out what you forgot. Encode that
+into the fuzzer's custom fuzz functions. Hint: if you added defaults for a
+field, that field will need to have a custom fuzz function that ensures that the
+field is fuzzed to a non-empty value.
+
+The fuzzer can be found in `pkg/api/testing/fuzzer.go`.
+
+## Update the semantic comparisons
+
+VERY VERY rarely is this needed, but when it hits, it hurts. In some rare cases
+we end up with objects (e.g. resource quantities) that have morally equivalent
+values with different bitwise representations (e.g. value 10 with a base-2
+formatter is the same as value 0 with a base-10 formatter). The only way Go
+knows how to do deep-equality is through field-by-field bitwise comparisons.
+This is a problem for us.
+
+The first thing you should do is try not to do that. If you really can't avoid
+this, I'd like to introduce you to our `semantic DeepEqual` routine. It supports
+custom overrides for specific types - you can find that in `pkg/api/helpers.go`.
+
+There's one other time when you might have to touch this: `unexported fields`.
+You see, while Go's `reflect` package is allowed to touch `unexported fields`,
+us mere mortals are not - this includes `semantic DeepEqual`. Fortunately, most
+of our API objects are "dumb structs" all the way down - all fields are exported
+(start with a capital letter) and there are no unexported fields. But sometimes
+you want to include an object in our API that does have unexported fields
+somewhere in it (for example, `time.Time` has unexported fields). If this hits
+you, you may have to touch the `semantic DeepEqual` customization functions.
+
+## Implement your change
+
+Now you have the API all changed - go implement whatever it is that you're
+doing!
+
+## Write end-to-end tests
+
+Check out the [E2E docs](e2e-tests.md) for detailed information about how to
+write end-to-end tests for your feature.
+
+## Examples and docs
+
+At last, your change is done, all unit tests pass, e2e passes, you're done,
+right? Actually, no. You just changed the API. If you are touching an existing
+facet of the API, you have to try *really* hard to make sure that *all* the
+examples and docs are updated. There's no easy way to do this, due in part to
+JSON and YAML silently dropping unknown fields. You're clever - you'll figure it
+out. Put `grep` or `ack` to good use.
+
+If you added functionality, you should consider documenting it and/or writing
+an example to illustrate your change.
+
+Make sure you update the swagger and OpenAPI spec by running:
+
+```sh
+hack/update-swagger-spec.sh
+hack/update-openapi-spec.sh
+```
+
+The API spec changes should be in a commit separate from your other changes.
+
+## Alpha, Beta, and Stable Versions
+
+New feature development proceeds through a series of stages of increasing
+maturity:
+
+- Development level
+ - Object Versioning: no convention
+ - Availability: not committed to main kubernetes repo, and thus not available
+in official releases
+ - Audience: other developers closely collaborating on a feature or
+proof-of-concept
+ - Upgradeability, Reliability, Completeness, and Support: no requirements or
+guarantees
+- Alpha level
+ - Object Versioning: API version name contains `alpha` (e.g. `v1alpha1`)
+ - Availability: committed to main kubernetes repo; appears in an official
+release; feature is disabled by default, but may be enabled by flag
+ - Audience: developers and expert users interested in giving early feedback on
+features
+ - Completeness: some API operations, CLI commands, or UI support may not be
+implemented; the API need not have had an *API review* (an intensive and
+targeted review of the API, on top of a normal code review)
+ - Upgradeability: the object schema and semantics may change in a later
+software release, without any provision for preserving objects in an existing
+cluster; removing the upgradability concern allows developers to make rapid
+progress; in particular, API versions can increment faster than the minor
+release cadence and the developer need not maintain multiple versions;
+developers should still increment the API version when object schema or
+semantics change in an [incompatible way](#on-compatibility)
+ - Cluster Reliability: because the feature is relatively new, and may lack
+complete end-to-end tests, enabling the feature via a flag might expose bugs
+with destabilize the cluster (e.g. a bug in a control loop might rapidly create
+excessive numbers of object, exhausting API storage).
+ - Support: there is *no commitment* from the project to complete the feature;
+the feature may be dropped entirely in a later software release
+ - Recommended Use Cases: only in short-lived testing clusters, due to
+complexity of upgradeability and lack of long-term support and lack of
+upgradability.
+- Beta level:
+ - Object Versioning: API version name contains `beta` (e.g. `v2beta3`)
+ - Availability: in official Kubernetes releases, and enabled by default
+ - Audience: users interested in providing feedback on features
+ - Completeness: all API operations, CLI commands, and UI support should be
+implemented; end-to-end tests complete; the API has had a thorough API review
+and is thought to be complete, though use during beta may frequently turn up API
+issues not thought of during review
+ - Upgradeability: the object schema and semantics may change in a later
+software release; when this happens, an upgrade path will be documented; in some
+cases, objects will be automatically converted to the new version; in other
+cases, a manual upgrade may be necessary; a manual upgrade may require downtime
+for anything relying on the new feature, and may require manual conversion of
+objects to the new version; when manual conversion is necessary, the project
+will provide documentation on the process (for an example, see [v1 conversion
+tips](../api.md#v1-conversion-tips))
+ - Cluster Reliability: since the feature has e2e tests, enabling the feature
+via a flag should not create new bugs in unrelated features; because the feature
+is new, it may have minor bugs
+ - Support: the project commits to complete the feature, in some form, in a
+subsequent Stable version; typically this will happen within 3 months, but
+sometimes longer; releases should simultaneously support two consecutive
+versions (e.g. `v1beta1` and `v1beta2`; or `v1beta2` and `v1`) for at least one
+minor release cycle (typically 3 months) so that users have enough time to
+upgrade and migrate objects
+ - Recommended Use Cases: in short-lived testing clusters; in production
+clusters as part of a short-lived evaluation of the feature in order to provide
+feedback
+- Stable level:
+ - Object Versioning: API version `vX` where `X` is an integer (e.g. `v1`)
+ - Availability: in official Kubernetes releases, and enabled by default
+ - Audience: all users
+ - Completeness: same as beta
+ - Upgradeability: only [strictly compatible](#on-compatibility) changes
+allowed in subsequent software releases
+ - Cluster Reliability: high
+ - Support: API version will continue to be present for many subsequent
+software releases;
+ - Recommended Use Cases: any
+
+### Adding Unstable Features to Stable Versions
+
+When adding a feature to an object which is already Stable, the new fields and
+new behaviors need to meet the Stable level requirements. If these cannot be
+met, then the new field cannot be added to the object.
+
+For example, consider the following object:
+
+```go
+// API v6.
+type Frobber struct {
+ Height int `json:"height"`
+ Param string `json:"param"`
+}
+```
+
+A developer is considering adding a new `Width` parameter, like this:
+
+```go
+// API v6.
+type Frobber struct {
+ Height int `json:"height"`
+ Width int `json:"height"`
+ Param string `json:"param"`
+}
+```
+
+However, the new feature is not stable enough to be used in a stable version
+(`v6`). Some reasons for this might include:
+
+- the final representation is undecided (e.g. should it be called `Width` or
+`Breadth`?)
+- the implementation is not stable enough for general use (e.g. the `Area()`
+routine sometimes overflows.)
+
+The developer cannot add the new field until stability is met. However,
+sometimes stability cannot be met until some users try the new feature, and some
+users are only able or willing to accept a released version of Kubernetes. In
+that case, the developer has a few options, both of which require staging work
+over several releases.
+
+
+A preferred option is to first make a release where the new value (`Width` in
+this example) is specified via an annotation, like this:
+
+```go
+kind: frobber
+version: v6
+metadata:
+ name: myfrobber
+ annotations:
+ frobbing.alpha.kubernetes.io/width: 2
+height: 4
+param: "green and blue"
+```
+
+This format allows users to specify the new field, but makes it clear that they
+are using a Alpha feature when they do, since the word `alpha` is in the
+annotation key.
+
+Another option is to introduce a new type with an new `alpha` or `beta` version
+designator, like this:
+
+```
+// API v6alpha2
+type Frobber struct {
+ Height int `json:"height"`
+ Width int `json:"height"`
+ Param string `json:"param"`
+}
+```
+
+The latter requires that all objects in the same API group as `Frobber` to be
+replicated in the new version, `v6alpha2`. This also requires user to use a new
+client which uses the other version. Therefore, this is not a preferred option.
+
+A related issue is how a cluster manager can roll back from a new version
+with a new feature, that is already being used by users. See
+https://github.com/kubernetes/kubernetes/issues/4855.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/api_changes.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/automation.md b/contributors/devel/automation.md
new file mode 100644
index 00000000..3a9f1754
--- /dev/null
+++ b/contributors/devel/automation.md
@@ -0,0 +1,116 @@
+# Kubernetes Development Automation
+
+## Overview
+
+Kubernetes uses a variety of automated tools in an attempt to relieve developers
+of repetitive, low brain power work. This document attempts to describe these
+processes.
+
+
+## Submit Queue
+
+In an effort to
+ * reduce load on core developers
+ * maintain e2e stability
+ * load test github's label feature
+
+We have added an automated [submit-queue]
+(https://github.com/kubernetes/contrib/blob/master/mungegithub/mungers/submit-queue.go)
+to the
+[github "munger"](https://github.com/kubernetes/contrib/tree/master/mungegithub)
+for kubernetes.
+
+The submit-queue does the following:
+
+```go
+for _, pr := range readyToMergePRs() {
+ if testsAreStable() {
+ if retestPR(pr) == success {
+ mergePR(pr)
+ }
+ }
+}
+```
+
+The status of the submit-queue is [online.](http://submit-queue.k8s.io/)
+
+### Ready to merge status
+
+The submit-queue lists what it believes are required on the [merge requirements tab](http://submit-queue.k8s.io/#/info) of the info page. That may be more up to date.
+
+A PR is considered "ready for merging" if it matches the following:
+ * The PR must have the label "cla: yes" or "cla: human-approved"
+ * The PR must be mergeable. aka cannot need a rebase
+ * All of the following github statuses must be green
+ * Jenkins GCE Node e2e
+ * Jenkins GCE e2e
+ * Jenkins unit/integration
+ * The PR cannot have any prohibited future milestones (such as a v1.5 milestone during v1.4 code freeze)
+ * The PR must have the "lgtm" label. The "lgtm" label is automatically applied
+ following a review comment consisting of only "LGTM" (case-insensitive)
+ * The PR must not have been updated since the "lgtm" label was applied
+ * The PR must not have the "do-not-merge" label
+
+### Merge process
+
+Merges _only_ occur when the [critical builds](http://submit-queue.k8s.io/#/e2e)
+are passing. We're open to including more builds here, let us know...
+
+Merges are serialized, so only a single PR is merged at a time, to ensure
+against races.
+
+If the PR has the `retest-not-required` label, it is simply merged. If the PR does
+not have this label the e2e, unit/integration, and node tests are re-run. If these
+tests pass a second time, the PR will be merged as long as the `critical builds` are
+green when this PR finishes retesting.
+
+## Github Munger
+
+We run [github "mungers"](https://github.com/kubernetes/contrib/tree/master/mungegithub).
+
+This runs repeatedly over github pulls and issues and runs modular "mungers"
+similar to "mungedocs." The mungers include the 'submit-queue' referenced above along
+with numerous other functions. See the README in the link above.
+
+Please feel free to unleash your creativity on this tool, send us new mungers
+that you think will help support the Kubernetes development process.
+
+### Closing stale pull-requests
+
+Github Munger will close pull-requests that don't have human activity in the
+last 90 days. It will warn about this process 60 days before closing the
+pull-request, and warn again 30 days later. One way to prevent this from
+happening is to add the "keep-open" label on the pull-request.
+
+Feel free to re-open and maybe add the "keep-open" label if this happens to a
+valid pull-request. It may also be a good opportunity to get more attention by
+verifying that it is properly assigned and/or mention people that might be
+interested. Commenting on the pull-request will also keep it open for another 90
+days.
+
+## PR builder
+
+We also run a robotic PR builder that attempts to run tests for each PR.
+
+Before a PR from an unknown user is run, the PR builder bot (`k8s-bot`) asks to
+a message from a contributor that a PR is "ok to test", the contributor replies
+with that message. ("please" is optional, but remember to treat your robots with
+kindness...)
+
+## FAQ:
+
+#### How can I ask my PR to be tested again for Jenkins failures?
+
+PRs should only need to be manually re-tested if you believe there was a flake
+during the original test. All flakes should be filed as an
+[issue](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake).
+Once you find or file a flake a contributer (this may be you!) should request
+a retest with "@k8s-bot test this issue: #NNNNN", where NNNNN is replaced with
+the issue number you found or filed.
+
+Any pushes of new code to the PR will automatically trigger a new test. No human
+interraction is required.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/automation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/bazel.md b/contributors/devel/bazel.md
new file mode 100644
index 00000000..e6a4e9c5
--- /dev/null
+++ b/contributors/devel/bazel.md
@@ -0,0 +1,44 @@
+# Build with Bazel
+
+Building with bazel is currently experimental. Automanaged BUILD rules have the
+tag "automanaged" and are maintained by
+[gazel](https://github.com/mikedanese/gazel). Instructions for installing bazel
+can be found [here](https://www.bazel.io/versions/master/docs/install.html).
+
+To build docker images for the components, run:
+
+```
+$ bazel build //build-tools/...
+```
+
+To run many of the unit tests, run:
+
+```
+$ bazel test //cmd/... //build-tools/... //pkg/... //federation/... //plugin/...
+```
+
+To update automanaged build files, run:
+
+```
+$ ./hack/update-bazel.sh
+```
+
+**NOTES**: `update-bazel.sh` only works if check out directory of Kubernetes is "$GOPATH/src/k8s.io/kubernetes".
+
+To update a single build file, run:
+
+```
+$ # get gazel
+$ go get -u github.com/mikedanese/gazel
+$ # .e.g. ./pkg/kubectl/BUILD
+$ gazel -root="${YOUR_KUBE_ROOT_PATH}" ./pkg/kubectl
+```
+
+Updating BUILD file for a package will be required when:
+* Files are added to or removed from a package
+* Import dependencies change for a package
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/bazel.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/cherry-picks.md b/contributors/devel/cherry-picks.md
new file mode 100644
index 00000000..ad8df62d
--- /dev/null
+++ b/contributors/devel/cherry-picks.md
@@ -0,0 +1,64 @@
+# Overview
+
+This document explains cherry picks are managed on release branches within the
+Kubernetes projects. Patches are either applied in batches or individually
+depending on the point in the release cycle.
+
+## Propose a Cherry Pick
+
+1. Cherrypicks are [managed with labels and milestones]
+(pull-requests.md#release-notes)
+1. To get a PR merged to the release branch, first ensure the following labels
+ are on the original **master** branch PR:
+ * An appropriate milestone (e.g. v1.3)
+ * The `cherrypick-candidate` label
+1. If `release-note-none` is set on the master PR, the cherrypick PR will need
+ to set the same label to confirm that no release note is needed.
+1. `release-note` labeled PRs generate a release note using the PR title by
+ default OR the release-note block in the PR template if filled in.
+ * See the [PR template](../../.github/PULL_REQUEST_TEMPLATE.md) for more
+ details.
+ * PR titles and body comments are mutable and can be modified at any time
+ prior to the release to reflect a release note friendly message.
+
+### How do cherrypick-candidates make it to the release branch?
+
+1. **BATCHING:** After a branch is first created and before the X.Y.0 release
+ * Branch owners review the list of `cherrypick-candidate` labeled PRs.
+ * PRs batched up and merged to the release branch get a `cherrypick-approved`
+label and lose the `cherrypick-candidate` label.
+ * PRs that won't be merged to the release branch, lose the
+`cherrypick-candidate` label.
+
+1. **INDIVIDUAL CHERRYPICKS:** After the first X.Y.0 on a branch
+ * Run the cherry pick script. This example applies a master branch PR #98765
+to the remote branch `upstream/release-3.14`:
+`hack/cherry_pick_pull.sh upstream/release-3.14 98765`
+ * Your cherrypick PR (targeted to the branch) will immediately get the
+`do-not-merge` label. The branch owner will triage PRs targeted to
+the branch and label the ones to be merged by applying the `lgtm`
+label.
+
+There is an [issue](https://github.com/kubernetes/kubernetes/issues/23347) open
+tracking the tool to automate the batching procedure.
+
+## Cherry Pick Review
+
+Cherry pick pull requests are reviewed differently than normal pull requests. In
+particular, they may be self-merged by the release branch owner without fanfare,
+in the case the release branch owner knows the cherry pick was already
+requested - this should not be the norm, but it may happen.
+
+## Searching for Cherry Picks
+
+See the [cherrypick queue dashboard](http://cherrypick.k8s.io/#/queue) for
+status of PRs labeled as `cherrypick-candidate`.
+
+[Contributor License Agreements](http://releases.k8s.io/HEAD/CONTRIBUTING.md) is
+considered implicit for all code within cherry-pick pull requests, ***unless
+there is a large conflict***.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/cherry-picks.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/cli-roadmap.md b/contributors/devel/cli-roadmap.md
new file mode 100644
index 00000000..cd21da08
--- /dev/null
+++ b/contributors/devel/cli-roadmap.md
@@ -0,0 +1,11 @@
+# Kubernetes CLI/Configuration Roadmap
+
+See github issues with the following labels:
+* [area/app-config-deployment](https://github.com/kubernetes/kubernetes/labels/area/app-config-deployment)
+* [component/kubectl](https://github.com/kubernetes/kubernetes/labels/component/kubectl)
+* [component/clientlib](https://github.com/kubernetes/kubernetes/labels/component/clientlib)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/cli-roadmap.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/client-libraries.md b/contributors/devel/client-libraries.md
new file mode 100644
index 00000000..d38f9fd7
--- /dev/null
+++ b/contributors/devel/client-libraries.md
@@ -0,0 +1,27 @@
+## Kubernetes API client libraries
+
+### Supported
+
+ * [Go](https://github.com/kubernetes/client-go)
+
+### User Contributed
+
+*Note: Libraries provided by outside parties are supported by their authors, not
+the core Kubernetes team*
+
+ * [Clojure](https://github.com/yanatan16/clj-kubernetes-api)
+ * [Java (OSGi)](https://bitbucket.org/amdatulabs/amdatu-kubernetes)
+ * [Java (Fabric8, OSGi)](https://github.com/fabric8io/kubernetes-client)
+ * [Node.js](https://github.com/tenxcloud/node-kubernetes-client)
+ * [Node.js](https://github.com/godaddy/kubernetes-client)
+ * [Perl](https://metacpan.org/pod/Net::Kubernetes)
+ * [PHP](https://github.com/devstub/kubernetes-api-php-client)
+ * [PHP](https://github.com/maclof/kubernetes-client)
+ * [Python](https://github.com/eldarion-gondor/pykube)
+ * [Ruby](https://github.com/Ch00k/kuber)
+ * [Ruby](https://github.com/abonas/kubeclient)
+ * [Scala](https://github.com/doriordan/skuber)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/client-libraries.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/coding-conventions.md b/contributors/devel/coding-conventions.md
new file mode 100644
index 00000000..bcfab41d
--- /dev/null
+++ b/contributors/devel/coding-conventions.md
@@ -0,0 +1,147 @@
+# Coding Conventions
+
+Updated: 5/3/2016
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Coding Conventions](#coding-conventions)
+ - [Code conventions](#code-conventions)
+ - [Testing conventions](#testing-conventions)
+ - [Directory and file conventions](#directory-and-file-conventions)
+ - [Coding advice](#coding-advice)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## Code conventions
+
+ - Bash
+
+ - https://google.github.io/styleguide/shell.xml
+
+ - Ensure that build, release, test, and cluster-management scripts run on
+OS X
+
+ - Go
+
+ - Ensure your code passes the [presubmit checks](development.md#hooks)
+
+ - [Go Code Review
+Comments](https://github.com/golang/go/wiki/CodeReviewComments)
+
+ - [Effective Go](https://golang.org/doc/effective_go.html)
+
+ - Comment your code.
+ - [Go's commenting
+conventions](http://blog.golang.org/godoc-documenting-go-code)
+ - If reviewers ask questions about why the code is the way it is, that's a
+sign that comments might be helpful.
+
+
+ - Command-line flags should use dashes, not underscores
+
+
+ - Naming
+ - Please consider package name when selecting an interface name, and avoid
+redundancy.
+
+ - e.g.: `storage.Interface` is better than `storage.StorageInterface`.
+
+ - Do not use uppercase characters, underscores, or dashes in package
+names.
+ - Please consider parent directory name when choosing a package name.
+
+ - so pkg/controllers/autoscaler/foo.go should say `package autoscaler`
+not `package autoscalercontroller`.
+ - Unless there's a good reason, the `package foo` line should match
+the name of the directory in which the .go file exists.
+ - Importers can use a different name if they need to disambiguate.
+
+ - Locks should be called `lock` and should never be embedded (always `lock
+sync.Mutex`). When multiple locks are present, give each lock a distinct name
+following Go conventions - `stateLock`, `mapLock` etc.
+
+ - [API changes](api_changes.md)
+
+ - [API conventions](api-conventions.md)
+
+ - [Kubectl conventions](kubectl-conventions.md)
+
+ - [Logging conventions](logging.md)
+
+## Testing conventions
+
+ - All new packages and most new significant functionality must come with unit
+tests
+
+ - Table-driven tests are preferred for testing multiple scenarios/inputs; for
+example, see [TestNamespaceAuthorization](../../test/integration/auth/auth_test.go)
+
+ - Significant features should come with integration (test/integration) and/or
+[end-to-end (test/e2e) tests](e2e-tests.md)
+ - Including new kubectl commands and major features of existing commands
+
+ - Unit tests must pass on OS X and Windows platforms - if you use Linux
+specific features, your test case must either be skipped on windows or compiled
+out (skipped is better when running Linux specific commands, compiled out is
+required when your code does not compile on Windows).
+
+ - Avoid relying on Docker hub (e.g. pull from Docker hub). Use gcr.io instead.
+
+ - Avoid waiting for a short amount of time (or without waiting) and expect an
+asynchronous thing to happen (e.g. wait for 1 seconds and expect a Pod to be
+running). Wait and retry instead.
+
+ - See the [testing guide](testing.md) for additional testing advice.
+
+## Directory and file conventions
+
+ - Avoid package sprawl. Find an appropriate subdirectory for new packages.
+(See [#4851](http://issues.k8s.io/4851) for discussion.)
+ - Libraries with no more appropriate home belong in new package
+subdirectories of pkg/util
+
+ - Avoid general utility packages. Packages called "util" are suspect. Instead,
+derive a name that describes your desired function. For example, the utility
+functions dealing with waiting for operations are in the "wait" package and
+include functionality like Poll. So the full name is wait.Poll
+
+ - All filenames should be lowercase
+
+ - Go source files and directories use underscores, not dashes
+ - Package directories should generally avoid using separators as much as
+possible (when packages are multiple words, they usually should be in nested
+subdirectories).
+
+ - Document directories and filenames should use dashes rather than underscores
+
+ - Contrived examples that illustrate system features belong in
+/docs/user-guide or /docs/admin, depending on whether it is a feature primarily
+intended for users that deploy applications or cluster administrators,
+respectively. Actual application examples belong in /examples.
+ - Examples should also illustrate [best practices for configuration and
+using the system](../user-guide/config-best-practices.md)
+
+ - Third-party code
+
+ - Go code for normal third-party dependencies is managed using
+[Godeps](https://github.com/tools/godep)
+
+ - Other third-party code belongs in `/third_party`
+ - forked third party Go code goes in `/third_party/forked`
+ - forked _golang stdlib_ code goes in `/third_party/golang`
+
+ - Third-party code must include licenses
+
+ - This includes modified third-party code and excerpts, as well
+
+## Coding advice
+
+ - Go
+
+ - [Go landmines](https://gist.github.com/lavalamp/4bd23295a9f32706a48f)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/coding-conventions.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/collab.md b/contributors/devel/collab.md
new file mode 100644
index 00000000..b4a6281d
--- /dev/null
+++ b/contributors/devel/collab.md
@@ -0,0 +1,87 @@
+# On Collaborative Development
+
+Kubernetes is open source, but many of the people working on it do so as their
+day job. In order to avoid forcing people to be "at work" effectively 24/7, we
+want to establish some semi-formal protocols around development. Hopefully these
+rules make things go more smoothly. If you find that this is not the case,
+please complain loudly.
+
+## Patches welcome
+
+First and foremost: as a potential contributor, your changes and ideas are
+welcome at any hour of the day or night, weekdays, weekends, and holidays.
+Please do not ever hesitate to ask a question or send a PR.
+
+## Code reviews
+
+All changes must be code reviewed. For non-maintainers this is obvious, since
+you can't commit anyway. But even for maintainers, we want all changes to get at
+least one review, preferably (for non-trivial changes obligatorily) from someone
+who knows the areas the change touches. For non-trivial changes we may want two
+reviewers. The primary reviewer will make this decision and nominate a second
+reviewer, if needed. Except for trivial changes, PRs should not be committed
+until relevant parties (e.g. owners of the subsystem affected by the PR) have
+had a reasonable chance to look at PR in their local business hours.
+
+Most PRs will find reviewers organically. If a maintainer intends to be the
+primary reviewer of a PR they should set themselves as the assignee on GitHub
+and say so in a reply to the PR. Only the primary reviewer of a change should
+actually do the merge, except in rare cases (e.g. they are unavailable in a
+reasonable timeframe).
+
+If a PR has gone 2 work days without an owner emerging, please poke the PR
+thread and ask for a reviewer to be assigned.
+
+Except for rare cases, such as trivial changes (e.g. typos, comments) or
+emergencies (e.g. broken builds), maintainers should not merge their own
+changes.
+
+Expect reviewers to request that you avoid [common go style
+mistakes](https://github.com/golang/go/wiki/CodeReviewComments) in your PRs.
+
+## Assigned reviews
+
+Maintainers can assign reviews to other maintainers, when appropriate. The
+assignee becomes the shepherd for that PR and is responsible for merging the PR
+once they are satisfied with it or else closing it. The assignee might request
+reviews from non-maintainers.
+
+## Merge hours
+
+Maintainers will do merges of appropriately reviewed-and-approved changes during
+their local "business hours" (typically 7:00 am Monday to 5:00 pm (17:00h)
+Friday). PRs that arrive over the weekend or on holidays will only be merged if
+there is a very good reason for it and if the code review requirements have been
+met. Concretely this means that nobody should merge changes immediately before
+going to bed for the night.
+
+There may be discussion an even approvals granted outside of the above hours,
+but merges will generally be deferred.
+
+If a PR is considered complex or controversial, the merge of that PR should be
+delayed to give all interested parties in all timezones the opportunity to
+provide feedback. Concretely, this means that such PRs should be held for 24
+hours before merging. Of course "complex" and "controversial" are left to the
+judgment of the people involved, but we trust that part of being a committer is
+the judgment required to evaluate such things honestly, and not be motivated by
+your desire (or your cube-mate's desire) to get their code merged. Also see
+"Holds" below, any reviewer can issue a "hold" to indicate that the PR is in
+fact complicated or complex and deserves further review.
+
+PRs that are incorrectly judged to be merge-able, may be reverted and subject to
+re-review, if subsequent reviewers believe that they in fact are controversial
+or complex.
+
+
+## Holds
+
+Any maintainer or core contributor who wants to review a PR but does not have
+time immediately may put a hold on a PR simply by saying so on the PR discussion
+and offering an ETA measured in single-digit days at most. Any PR that has a
+hold shall not be merged until the person who requested the hold acks the
+review, withdraws their hold, or is overruled by a preponderance of maintainers.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/collab.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/community-expectations.md b/contributors/devel/community-expectations.md
new file mode 100644
index 00000000..ff2487fd
--- /dev/null
+++ b/contributors/devel/community-expectations.md
@@ -0,0 +1,87 @@
+## Community Expectations
+
+Kubernetes is a community project. Consequently, it is wholly dependent on
+its community to provide a productive, friendly and collaborative environment.
+
+The first and foremost goal of the Kubernetes community to develop orchestration
+technology that radically simplifies the process of creating reliable
+distributed systems. However a second, equally important goal is the creation
+of a community that fosters easy, agile development of such orchestration
+systems.
+
+We therefore describe the expectations for
+members of the Kubernetes community. This document is intended to be a living one
+that evolves as the community evolves via the same PR and code review process
+that shapes the rest of the project. It currently covers the expectations
+of conduct that govern all members of the community as well as the expectations
+around code review that govern all active contributors to Kubernetes.
+
+### Code of Conduct
+
+The most important expectation of the Kubernetes community is that all members
+abide by the Kubernetes [community code of conduct](../../code-of-conduct.md).
+Only by respecting each other can we develop a productive, collaborative
+community.
+
+### Code review
+
+As a community we believe in the [value of code review for all contributions](collab.md).
+Code review increases both the quality and readability of our codebase, which
+in turn produces high quality software.
+
+However, the code review process can also introduce latency for contributors
+and additional work for reviewers that can frustrate both parties.
+
+Consequently, as a community we expect that all active participants in the
+community will also be active reviewers.
+
+We ask that active contributors to the project participate in the code review process
+in areas where that contributor has expertise. Active
+contributors are considered to be anyone who meets any of the following criteria:
+ * Sent more than two pull requests (PRs) in the previous one month, or more
+ than 20 PRs in the previous year.
+ * Filed more than three issues in the previous month, or more than 30 issues in
+ the previous 12 months.
+ * Commented on more than pull requests in the previous month, or
+ more than 50 pull requests in the previous 12 months.
+ * Marked any PR as LGTM in the previous month.
+ * Have *collaborator* permissions in the Kubernetes github project.
+
+In addition to these community expectations, any community member who wants to
+be an active reviewer can also add their name to an *active reviewer* file
+(location tbd) which will make them an active reviewer for as long as they
+are included in the file.
+
+#### Expectations of reviewers: Review comments
+
+Because reviewers are often the first points of contact between new members of
+the community and can significantly impact the first impression of the
+Kubernetes community, reviewers are especially important in shaping the
+Kubernetes community. Reviewers are highly encouraged to review the
+[code of conduct](../../code-of-conduct.md) and are strongly encouraged to go above
+and beyond the code of conduct to promote a collaborative, respectful
+Kubernetes community.
+
+#### Expectations of reviewers: Review latency
+
+Reviewers are expected to respond in a timely fashion to PRs that are assigned
+to them. Reviewers are expected to respond to an *active* PRs with reasonable
+latency, and if reviewers fail to respond, those PRs may be assigned to other
+reviewers.
+
+*Active* PRs are considered those which have a proper CLA (`cla:yes`) label
+and do not need rebase to be merged. PRs that do not have a proper CLA, or
+require a rebase are not considered active PRs.
+
+## Thanks
+
+Many thanks in advance to everyone who contributes their time and effort to
+making Kubernetes both a successful system as well as a successful community.
+The strength of our software shines in the strengths of each individual
+community member. Thanks!
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/community-expectations.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/container-runtime-interface.md b/contributors/devel/container-runtime-interface.md
new file mode 100644
index 00000000..7ab085f7
--- /dev/null
+++ b/contributors/devel/container-runtime-interface.md
@@ -0,0 +1,127 @@
+# CRI: the Container Runtime Interface
+
+## What is CRI?
+
+CRI (_Container Runtime Interface_) consists of a
+[protobuf API](../../pkg/kubelet/api/v1alpha1/runtime/api.proto),
+specifications/requirements (to-be-added),
+and [libraries] (https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/server/streaming)
+for container runtimes to integrate with kubelet on a node. CRI is currently in Alpha.
+
+In the future, we plan to add more developer tools such as the CRI validation
+tests.
+
+## Why develop CRI?
+
+Prior to the existence of CRI, container runtimes (e.g., `docker`, `rkt`) were
+integrated with kubelet through implementing an internal, high-level interface
+in kubelet. The entrance barrier for runtimes was high because the integration
+required understanding the internals of kubelet and contributing to the main
+Kubernetes repository. More importantly, this would not scale because every new
+addition incurs a significant maintenance overhead in the main kubernetes
+repository.
+
+Kubernetes aims to be extensible. CRI is one small, yet important step to enable
+pluggable container runtimes and build a healthier ecosystem.
+
+## How to use CRI?
+
+1. Start the image and runtime services on your node. You can have a single
+ service acting as both image and runtime services.
+2. Set the kubelet flags
+ - Pass the unix socket(s) to which your services listen to kubelet:
+ `--container-runtime-endpoint` and `--image-service-endpoint`.
+ - Enable CRI in kubelet by`--experimental-cri=true`.
+ - Use the "remote" runtime by `--container-runtime=remote`.
+
+Please see the [Status Update](#status-update) section for known issues for
+each release.
+
+Note that CRI is still in its early stages. We are actively incorporating
+feedback from early developers to improve the API. Developers should expect
+occasional API breaking changes.
+
+## Does Kubelet use CRI today?
+
+No, but we are working on it.
+
+The first step is to switch kubelet to integrate with Docker via CRI by
+default. The current [Docker CRI implementation](https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/dockershim)
+already passes most end-to-end tests, and has mandatory PR builders to prevent
+regressions. While we are expanding the test coverage gradually, it is
+difficult to test on all combinations of OS distributions, platforms, and
+plugins. There are also many experimental or even undocumented features relied
+upon by some users. We would like to **encourage the community to help test
+this Docker-CRI integration and report bugs and/or missing features** to
+smooth the transition in the near future. Please file a Github issue and
+include @kubernetes/sig-node for any CRI problem.
+
+### How to test the new Docker CRI integration?
+
+Start kubelet with the following flags:
+ - Use the Docker container runtime by `--container-runtime=docker`(the default).
+ - Enable CRI in kubelet by`--experimental-cri=true`.
+
+Please also see the [known issues](#docker-cri-1.5-known-issues) before trying
+out.
+
+## Design docs and proposals
+
+We plan to add CRI specifications/requirements in the near future. For now,
+these proposals and design docs are the best sources to understand CRI
+besides discussions on Github issues.
+
+ - [Original proposal](https://github.com/kubernetes/kubernetes/blob/release-1.5/docs/proposals/container-runtime-interface-v1.md)
+ - [Exec/attach/port-forward streaming requests](https://docs.google.com/document/d/1OE_QoInPlVCK9rMAx9aybRmgFiVjHpJCHI9LrfdNM_s/edit?usp=sharing)
+ - [Container stdout/stderr logs](https://github.com/kubernetes/kubernetes/blob/release-1.5/docs/proposals/kubelet-cri-logging.md)
+ - Networking: The CRI runtime handles network plugins and the
+ setup/teardown of the pod sandbox.
+
+## Work-In-Progress CRI runtimes
+
+ - [cri-o](https://github.com/kubernetes-incubator/cri-o)
+ - [rktlet](https://github.com/kubernetes-incubator/rktlet)
+ - [frakti](https://github.com/kubernetes/frakti)
+
+## [Status update](#status-update)
+
+### Kubernetes v1.5 release (CRI v1alpha1)
+
+ - [v1alpha1 version](https://github.com/kubernetes/kubernetes/blob/release-1.5/pkg/kubelet/api/v1alpha1/runtime/api.proto) of CRI is released.
+
+#### [CRI known issues](#cri-1.5-known-issues):
+
+ - [#27097](https://github.com/kubernetes/kubernetes/issues/27097): Container
+ metrics are not yet defined in CRI.
+ - [#36401](https://github.com/kubernetes/kubernetes/issues/36401): The new
+ container log path/format is not yet supported by the logging pipeline
+ (e.g., fluentd, GCL).
+ - CRI may not be compatible with other experimental features (e.g., Seccomp).
+ - Streaming server needs to be hardened.
+ - [#36666](https://github.com/kubernetes/kubernetes/issues/36666):
+ Authentication.
+ - [#36187](https://github.com/kubernetes/kubernetes/issues/36187): Avoid
+ including user data in the redirect URL.
+
+#### [Docker CRI integration known issues](#docker-cri-1.5-known-issues)
+
+ - Docker compatibility: Support only Docker v1.11 and v1.12.
+ - Network:
+ - [#35457](https://github.com/kubernetes/kubernetes/issues/35457): Does
+ not support host ports.
+ - [#37315](https://github.com/kubernetes/kubernetes/issues/37315): Does
+ not support bandwidth shaping.
+ - Exec/attach/port-forward (streaming requests):
+ - [#35747](https://github.com/kubernetes/kubernetes/issues/35747): Does
+ not support `nsenter` as the exec handler (`--exec-handler=nsenter`).
+ - Also see (#cri-1.5-known-issues) for limitations on CRI streaming.
+
+## Contacts
+
+ - Email: sig-node (kubernetes-sig-node@googlegroups.com)
+ - Slack: https://kubernetes.slack.com/messages/sig-node
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/container-runtime-interface.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/controllers.md b/contributors/devel/controllers.md
new file mode 100644
index 00000000..daedc236
--- /dev/null
+++ b/contributors/devel/controllers.md
@@ -0,0 +1,186 @@
+# Writing Controllers
+
+A Kubernetes controller is an active reconciliation process. That is, it watches some object for the world's desired
+state, and it watches the world's actual state, too. Then, it sends instructions to try and make the world's current
+state be more like the desired state.
+
+The simplest implementation of this is a loop:
+
+```go
+for {
+ desired := getDesiredState()
+ current := getCurrentState()
+ makeChanges(desired, current)
+}
+```
+
+Watches, etc, are all merely optimizations of this logic.
+
+## Guidelines
+
+When you’re writing controllers, there are few guidelines that will help make sure you get the results and performance
+you’re looking for.
+
+1. Operate on one item at a time. If you use a `workqueue.Interface`, you’ll be able to queue changes for a
+ particular resource and later pop them in multiple “worker” gofuncs with a guarantee that no two gofuncs will
+ work on the same item at the same time.
+
+ Many controllers must trigger off multiple resources (I need to "check X if Y changes"), but nearly all controllers
+ can collapse those into a queue of “check this X” based on relationships. For instance, a ReplicaSetController needs
+ to react to a pod being deleted, but it does that by finding the related ReplicaSets and queuing those.
+
+
+1. Random ordering between resources. When controllers queue off multiple types of resources, there is no guarantee
+ of ordering amongst those resources.
+
+ Distinct watches are updated independently. Even with an objective ordering of “created resourceA/X” and “created
+ resourceB/Y”, your controller could observe “created resourceB/Y” and “created resourceA/X”.
+
+
+1. Level driven, not edge driven. Just like having a shell script that isn’t running all the time, your controller
+ may be off for an indeterminate amount of time before running again.
+
+ If an API object appears with a marker value of `true`, you can’t count on having seen it turn from `false` to `true`,
+ only that you now observe it being `true`. Even an API watch suffers from this problem, so be sure that you’re not
+ counting on seeing a change unless your controller is also marking the information it last made the decision on in
+ the object's status.
+
+
+1. Use `SharedInformers`. `SharedInformers` provide hooks to receive notifications of adds, updates, and deletes for
+ a particular resource. They also provide convenience functions for accessing shared caches and determining when a
+ cache is primed.
+
+ Use the factory methods down in https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/framework/informers/factory.go
+ to ensure that you are sharing the same instance of the cache as everyone else.
+
+ This saves us connections against the API server, duplicate serialization costs server-side, duplicate deserialization
+ costs controller-side, and duplicate caching costs controller-side.
+
+ You may see other mechanisms like reflectors and deltafifos driving controllers. Those were older mechanisms that we
+ later used to build the `SharedInformers`. You should avoid using them in new controllers
+
+
+1. Never mutate original objects! Caches are shared across controllers, this means that if you mutate your "copy"
+ (actually a reference or shallow copy) of an object, you’ll mess up other controllers (not just your own).
+
+ The most common point of failure is making a shallow copy, then mutating a map, like `Annotations`. Use
+ `api.Scheme.Copy` to make a deep copy.
+
+
+1. Wait for your secondary caches. Many controllers have primary and secondary resources. Primary resources are the
+ resources that you’ll be updating `Status` for. Secondary resources are resources that you’ll be managing
+ (creating/deleting) or using for lookups.
+
+ Use the `framework.WaitForCacheSync` function to wait for your secondary caches before starting your primary sync
+ functions. This will make sure that things like a Pod count for a ReplicaSet isn’t working off of known out of date
+ information that results in thrashing.
+
+
+1. There are other actors in the system. Just because you haven't changed an object doesn't mean that somebody else
+ hasn't.
+
+ Don't forget that the current state may change at any moment--it's not sufficient to just watch the desired state.
+ If you use the absence of objects in the desired state to indicate that things in the current state should be deleted,
+ make sure you don't have a bug in your observation code (e.g., act before your cache has filled).
+
+
+1. Percolate errors to the top level for consistent re-queuing. We have a `workqueue.RateLimitingInterface` to allow
+ simple requeuing with reasonable backoffs.
+
+ Your main controller func should return an error when requeuing is necessary. When it isn’t, it should use
+ `utilruntime.HandleError` and return nil instead. This makes it very easy for reviewers to inspect error handling
+ cases and to be confident that your controller doesn’t accidentally lose things it should retry for.
+
+
+1. Watches and Informers will “sync”. Periodically, they will deliver every matching object in the cluster to your
+ `Update` method. This is good for cases where you may need to take additional action on the object, but sometimes you
+ know there won’t be more work to do.
+
+ In cases where you are *certain* that you don't need to requeue items when there are no new changes, you can compare the
+ resource version of the old and new objects. If they are the same, you skip requeuing the work. Be careful when you
+ do this. If you ever skip requeuing your item on failures, you could fail, not requeue, and then never retry that
+ item again.
+
+
+## Rough Structure
+
+Overall, your controller should look something like this:
+
+```go
+type Controller struct{
+ // podLister is secondary cache of pods which is used for object lookups
+ podLister cache.StoreToPodLister
+
+ // queue is where incoming work is placed to de-dup and to allow "easy" rate limited requeues on errors
+ queue workqueue.RateLimitingInterface
+}
+
+func (c *Controller) Run(threadiness int, stopCh chan struct{}){
+ // don't let panics crash the process
+ defer utilruntime.HandleCrash()
+ // make sure the work queue is shutdown which will trigger workers to end
+ defer dsc.queue.ShutDown()
+
+ glog.Infof("Starting <NAME> controller")
+
+ // wait for your secondary caches to fill before starting your work
+ if !framework.WaitForCacheSync(stopCh, c.podStoreSynced) {
+ return
+ }
+
+ // start up your worker threads based on threadiness. Some controllers have multiple kinds of workers
+ for i := 0; i < threadiness; i++ {
+ // runWorker will loop until "something bad" happens. The .Until will then rekick the worker
+ // after one second
+ go wait.Until(c.runWorker, time.Second, stopCh)
+ }
+
+ // wait until we're told to stop
+ <-stopCh
+ glog.Infof("Shutting down <NAME> controller")
+}
+
+func (c *Controller) runWorker() {
+ // hot loop until we're told to stop. processNextWorkItem will automatically wait until there's work
+ // available, so we don't don't worry about secondary waits
+ for c.processNextWorkItem() {
+ }
+}
+
+// processNextWorkItem deals with one key off the queue. It returns false when it's time to quit.
+func (c *Controller) processNextWorkItem() bool {
+ // pull the next work item from queue. It should be a key we use to lookup something in a cache
+ key, quit := c.queue.Get()
+ if quit {
+ return false
+ }
+ // you always have to indicate to the queue that you've completed a piece of work
+ defer c.queue.Done(key)
+
+ // do your work on the key. This method will contains your "do stuff" logic"
+ err := c.syncHandler(key.(string))
+ if err == nil {
+ // if you had no error, tell the queue to stop tracking history for your key. This will
+ // reset things like failure counts for per-item rate limiting
+ c.queue.Forget(key)
+ return true
+ }
+
+ // there was a failure so be sure to report it. This method allows for pluggable error handling
+ // which can be used for things like cluster-monitoring
+ utilruntime.HandleError(fmt.Errorf("%v failed with : %v", key, err))
+ // since we failed, we should requeue the item to work on later. This method will add a backoff
+ // to avoid hotlooping on particular items (they're probably still not going to work right away)
+ // and overall controller protection (everything I've done is broken, this controller needs to
+ // calm down or it can starve other useful work) cases.
+ c.queue.AddRateLimited(key)
+
+ return true
+}
+
+```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/controllers.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/developer-guides/vagrant.md b/contributors/devel/developer-guides/vagrant.md
new file mode 100755
index 00000000..b53b0002
--- /dev/null
+++ b/contributors/devel/developer-guides/vagrant.md
@@ -0,0 +1,432 @@
+## Getting started with Vagrant
+
+Running Kubernetes with Vagrant is an easy way to run/test/develop on your
+local machine in an environment using the same setup procedures when running on
+GCE or AWS cloud providers. This provider is not tested on a per PR basis, if
+you experience bugs when testing from HEAD, please open an issue.
+
+### Prerequisites
+
+1. Install latest version >= 1.8.1 of vagrant from
+http://www.vagrantup.com/downloads.html
+
+2. Install a virtual machine host. Examples:
+ 1. [Virtual Box](https://www.virtualbox.org/wiki/Downloads)
+ 2. [VMWare Fusion](https://www.vmware.com/products/fusion/) plus
+[Vagrant VMWare Fusion provider](https://www.vagrantup.com/vmware)
+ 3. [Parallels Desktop](https://www.parallels.com/products/desktop/)
+plus
+[Vagrant Parallels provider](https://parallels.github.io/vagrant-parallels/)
+
+3. Get or build a
+[binary release](../../../docs/getting-started-guides/binary_release.md)
+
+### Setup
+
+Setting up a cluster is as simple as running:
+
+```shell
+export KUBERNETES_PROVIDER=vagrant
+curl -sS https://get.k8s.io | bash
+```
+
+Alternatively, you can download
+[Kubernetes release](https://github.com/kubernetes/kubernetes/releases) and
+extract the archive. To start your local cluster, open a shell and run:
+
+```shell
+cd kubernetes
+
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+The `KUBERNETES_PROVIDER` environment variable tells all of the various cluster
+management scripts which variant to use. If you forget to set this, the
+assumption is you are running on Google Compute Engine.
+
+By default, the Vagrant setup will create a single master VM (called
+kubernetes-master) and one node (called kubernetes-node-1). Each VM will take 1
+GB, so make sure you have at least 2GB to 4GB of free memory (plus appropriate
+free disk space).
+
+Vagrant will provision each machine in the cluster with all the necessary
+components to run Kubernetes. The initial setup can take a few minutes to
+complete on each machine.
+
+If you installed more than one Vagrant provider, Kubernetes will usually pick
+the appropriate one. However, you can override which one Kubernetes will use by
+setting the
+[`VAGRANT_DEFAULT_PROVIDER`](https://docs.vagrantup.com/v2/providers/default.html)
+environment variable:
+
+```shell
+export VAGRANT_DEFAULT_PROVIDER=parallels
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+By default, each VM in the cluster is running Fedora.
+
+To access the master or any node:
+
+```shell
+vagrant ssh master
+vagrant ssh node-1
+```
+
+If you are running more than one node, you can access the others by:
+
+```shell
+vagrant ssh node-2
+vagrant ssh node-3
+```
+
+Each node in the cluster installs the docker daemon and the kubelet.
+
+The master node instantiates the Kubernetes master components as pods on the
+machine.
+
+To view the service status and/or logs on the kubernetes-master:
+
+```shell
+[vagrant@kubernetes-master ~] $ vagrant ssh master
+[vagrant@kubernetes-master ~] $ sudo su
+
+[root@kubernetes-master ~] $ systemctl status kubelet
+[root@kubernetes-master ~] $ journalctl -ru kubelet
+
+[root@kubernetes-master ~] $ systemctl status docker
+[root@kubernetes-master ~] $ journalctl -ru docker
+
+[root@kubernetes-master ~] $ tail -f /var/log/kube-apiserver.log
+[root@kubernetes-master ~] $ tail -f /var/log/kube-controller-manager.log
+[root@kubernetes-master ~] $ tail -f /var/log/kube-scheduler.log
+```
+
+To view the services on any of the nodes:
+
+```shell
+[vagrant@kubernetes-master ~] $ vagrant ssh node-1
+[vagrant@kubernetes-master ~] $ sudo su
+
+[root@kubernetes-master ~] $ systemctl status kubelet
+[root@kubernetes-master ~] $ journalctl -ru kubelet
+
+[root@kubernetes-master ~] $ systemctl status docker
+[root@kubernetes-master ~] $ journalctl -ru docker
+```
+
+### Interacting with your Kubernetes cluster with Vagrant.
+
+With your Kubernetes cluster up, you can manage the nodes in your cluster with
+the regular Vagrant commands.
+
+To push updates to new Kubernetes code after making source changes:
+
+```shell
+./cluster/kube-push.sh
+```
+
+To stop and then restart the cluster:
+
+```shell
+vagrant halt
+./cluster/kube-up.sh
+```
+
+To destroy the cluster:
+
+```shell
+vagrant destroy
+```
+
+Once your Vagrant machines are up and provisioned, the first thing to do is to
+check that you can use the `kubectl.sh` script.
+
+You may need to build the binaries first, you can do this with `make`
+
+```shell
+$ ./cluster/kubectl.sh get nodes
+```
+
+### Authenticating with your master
+
+When using the vagrant provider in Kubernetes, the `cluster/kubectl.sh` script
+will cache your credentials in a `~/.kubernetes_vagrant_auth` file so you will
+not be prompted for them in the future.
+
+```shell
+cat ~/.kubernetes_vagrant_auth
+```
+
+```json
+{ "User": "vagrant",
+ "Password": "vagrant",
+ "CAFile": "/home/k8s_user/.kubernetes.vagrant.ca.crt",
+ "CertFile": "/home/k8s_user/.kubecfg.vagrant.crt",
+ "KeyFile": "/home/k8s_user/.kubecfg.vagrant.key"
+}
+```
+
+You should now be set to use the `cluster/kubectl.sh` script. For example try to
+list the nodes that you have started with:
+
+```shell
+./cluster/kubectl.sh get nodes
+```
+
+### Running containers
+
+You can use `cluster/kube-*.sh` commands to interact with your VM machines:
+
+```shell
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+
+$ ./cluster/kubectl.sh get services
+NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
+
+$ ./cluster/kubectl.sh get deployments
+CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
+```
+
+To Start a container running nginx with a Deployment and three replicas:
+
+```shell
+$ ./cluster/kubectl.sh run my-nginx --image=nginx --replicas=3 --port=80
+```
+
+When listing the pods, you will see that three containers have been started and
+are in Waiting state:
+
+```shell
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-3800858182-4e6pe 0/1 ContainerCreating 0 3s
+my-nginx-3800858182-8ko0s 1/1 Running 0 3s
+my-nginx-3800858182-seu3u 0/1 ContainerCreating 0 3s
+```
+
+When the provisioning is complete:
+
+```shell
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-3800858182-4e6pe 1/1 Running 0 40s
+my-nginx-3800858182-8ko0s 1/1 Running 0 40s
+my-nginx-3800858182-seu3u 1/1 Running 0 40s
+
+$ ./cluster/kubectl.sh get services
+NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
+
+$ ./cluster/kubectl.sh get deployments
+NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
+my-nginx 3 3 3 3 1m
+```
+
+We did not start any Services, hence there are none listed. But we see three
+replicas displayed properly. Check the
+[guestbook](https://github.com/kubernetes/kubernetes/tree/%7B%7Bpage.githubbranch%7D%7D/examples/guestbook)
+application to learn how to create a Service. You can already play with scaling
+the replicas with:
+
+```shell
+$ ./cluster/kubectl.sh scale deployments my-nginx --replicas=2
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-3800858182-4e6pe 1/1 Running 0 2m
+my-nginx-3800858182-8ko0s 1/1 Running 0 2m
+```
+
+Congratulations!
+
+### Testing
+
+The following will run all of the end-to-end testing scenarios assuming you set
+your environment:
+
+```shell
+NUM_NODES=3 go run hack/e2e.go -v --build --up --test --down
+```
+
+### Troubleshooting
+
+#### I keep downloading the same (large) box all the time!
+
+By default the Vagrantfile will download the box from S3. You can change this
+(and cache the box locally) by providing a name and an alternate URL when
+calling `kube-up.sh`
+
+```shell
+export KUBERNETES_BOX_NAME=choose_your_own_name_for_your_kuber_box
+export KUBERNETES_BOX_URL=path_of_your_kuber_box
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+#### I am getting timeouts when trying to curl the master from my host!
+
+During provision of the cluster, you may see the following message:
+
+```shell
+Validating node-1
+.............
+Waiting for each node to be registered with cloud provider
+error: couldn't read version from server: Get https://10.245.1.2/api: dial tcp 10.245.1.2:443: i/o timeout
+```
+
+Some users have reported VPNs may prevent traffic from being routed to the host
+machine into the virtual machine network.
+
+To debug, first verify that the master is binding to the proper IP address:
+
+```
+$ vagrant ssh master
+$ ifconfig | grep eth1 -C 2
+eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.245.1.2 netmask
+ 255.255.255.0 broadcast 10.245.1.255
+```
+
+Then verify that your host machine has a network connection to a bridge that can
+serve that address:
+
+```shell
+$ ifconfig | grep 10.245.1 -C 2
+
+vboxnet5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
+ inet 10.245.1.1 netmask 255.255.255.0 broadcast 10.245.1.255
+ inet6 fe80::800:27ff:fe00:5 prefixlen 64 scopeid 0x20<link>
+ ether 0a:00:27:00:00:05 txqueuelen 1000 (Ethernet)
+```
+
+If you do not see a response on your host machine, you will most likely need to
+connect your host to the virtual network created by the virtualization provider.
+
+If you do see a network, but are still unable to ping the machine, check if your
+VPN is blocking the request.
+
+#### I just created the cluster, but I am getting authorization errors!
+
+You probably have an incorrect ~/.kubernetes_vagrant_auth file for the cluster
+you are attempting to contact.
+
+```shell
+rm ~/.kubernetes_vagrant_auth
+```
+
+After using kubectl.sh make sure that the correct credentials are set:
+
+```shell
+cat ~/.kubernetes_vagrant_auth
+```
+
+```json
+{
+ "User": "vagrant",
+ "Password": "vagrant"
+}
+```
+
+#### I just created the cluster, but I do not see my container running!
+
+If this is your first time creating the cluster, the kubelet on each node
+schedules a number of docker pull requests to fetch prerequisite images. This
+can take some time and as a result may delay your initial pod getting
+provisioned.
+
+#### I have Vagrant up but the nodes won't validate!
+
+Log on to one of the nodes (`vagrant ssh node-1`) and inspect the salt node
+log (`sudo cat /var/log/salt/node`).
+
+#### I want to change the number of nodes!
+
+You can control the number of nodes that are instantiated via the environment
+variable `NUM_NODES` on your host machine. If you plan to work with replicas, we
+strongly encourage you to work with enough nodes to satisfy your largest
+intended replica size. If you do not plan to work with replicas, you can save
+some system resources by running with a single node. You do this, by setting
+`NUM_NODES` to 1 like so:
+
+```shell
+export NUM_NODES=1
+```
+
+#### I want my VMs to have more memory!
+
+You can control the memory allotted to virtual machines with the
+`KUBERNETES_MEMORY` environment variable. Just set it to the number of megabytes
+you would like the machines to have. For example:
+
+```shell
+export KUBERNETES_MEMORY=2048
+```
+
+If you need more granular control, you can set the amount of memory for the
+master and nodes independently. For example:
+
+```shell
+export KUBERNETES_MASTER_MEMORY=1536
+export KUBERNETES_NODE_MEMORY=2048
+```
+
+#### I want to set proxy settings for my Kubernetes cluster boot strapping!
+
+If you are behind a proxy, you need to install the Vagrant proxy plugin and set
+the proxy settings:
+
+```shell
+vagrant plugin install vagrant-proxyconf
+export KUBERNETES_HTTP_PROXY=http://username:password@proxyaddr:proxyport
+export KUBERNETES_HTTPS_PROXY=https://username:password@proxyaddr:proxyport
+```
+
+You can also specify addresses that bypass the proxy, for example:
+
+```shell
+export KUBERNETES_NO_PROXY=127.0.0.1
+```
+
+If you are using sudo to make Kubernetes build, use the `-E` flag to pass in the
+environment variables. For example, if running `make quick-release`, use:
+
+```shell
+sudo -E make quick-release
+```
+
+#### I have repository access errors during VM provisioning!
+
+Sometimes VM provisioning may fail with errors that look like this:
+
+```
+Timeout was reached for https://mirrors.fedoraproject.org/metalink?repo=fedora-23&arch=x86_64 [Connection timed out after 120002 milliseconds]
+```
+
+You may use a custom Fedora repository URL to fix this:
+
+```shell
+export CUSTOM_FEDORA_REPOSITORY_URL=https://download.fedoraproject.org/pub/fedora/
+```
+
+#### I ran vagrant suspend and nothing works!
+
+`vagrant suspend` seems to mess up the network. It's not supported at this time.
+
+#### I want vagrant to sync folders via nfs!
+
+You can ensure that vagrant uses nfs to sync folders with virtual machines by
+setting the KUBERNETES_VAGRANT_USE_NFS environment variable to 'true'. nfs is
+faster than virtualbox or vmware's 'shared folders' and does not require guest
+additions. See the
+[vagrant docs](http://docs.vagrantup.com/v2/synced-folders/nfs.html) for details
+on configuring nfs on the host. This setting will have no effect on the libvirt
+provider, which uses nfs by default. For example:
+
+```shell
+export KUBERNETES_VAGRANT_USE_NFS=true
+```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/developer-guides/vagrant.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/development.md b/contributors/devel/development.md
new file mode 100644
index 00000000..1349e003
--- /dev/null
+++ b/contributors/devel/development.md
@@ -0,0 +1,251 @@
+# Development Guide
+
+This document is intended to be the canonical source of truth for things like
+supported toolchain versions for building Kubernetes. If you find a
+requirement that this doc does not capture, please
+[submit an issue](https://github.com/kubernetes/kubernetes/issues) on github. If
+you find other docs with references to requirements that are not simply links to
+this doc, please [submit an issue](https://github.com/kubernetes/kubernetes/issues).
+
+This document is intended to be relative to the branch in which it is found.
+It is guaranteed that requirements will change over time for the development
+branch, but release branches of Kubernetes should not change.
+
+## Building Kubernetes with Docker
+
+Official releases are built using Docker containers. To build Kubernetes using
+Docker please follow [these instructions]
+(http://releases.k8s.io/HEAD/build-tools/README.md).
+
+## Building Kubernetes on a local OS/shell environment
+
+Many of the Kubernetes development helper scripts rely on a fairly up-to-date
+GNU tools environment, so most recent Linux distros should work just fine
+out-of-the-box. Note that Mac OS X ships with somewhat outdated BSD-based tools,
+some of which may be incompatible in subtle ways, so we recommend
+[replacing those with modern GNU tools]
+(https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/).
+
+### Go development environment
+
+Kubernetes is written in the [Go](http://golang.org) programming language.
+To build Kubernetes without using Docker containers, you'll need a Go
+development environment. Builds for Kubernetes 1.0 - 1.2 require Go version
+1.4.2. Builds for Kubernetes 1.3 and higher require Go version 1.6.0. If you
+haven't set up a Go development environment, please follow [these
+instructions](http://golang.org/doc/code.html) to install the go tools.
+
+Set up your GOPATH and add a path entry for go binaries to your PATH. Typically
+added to your ~/.profile:
+
+```sh
+export GOPATH=$HOME/go
+export PATH=$PATH:$GOPATH/bin
+```
+
+### Godep dependency management
+
+Kubernetes build and test scripts use [godep](https://github.com/tools/godep) to
+manage dependencies.
+
+#### Install godep
+
+Ensure that [mercurial](http://mercurial.selenic.com/wiki/Download) is
+installed on your system. (some of godep's dependencies use the mercurial
+source control system). Use `apt-get install mercurial` or `yum install
+mercurial` on Linux, or [brew.sh](http://brew.sh) on OS X, or download directly
+from mercurial.
+
+Install godep and go-bindata (may require sudo):
+
+```sh
+go get -u github.com/tools/godep
+go get -u github.com/jteeuwen/go-bindata/go-bindata
+```
+
+Note:
+At this time, godep version >= v63 is known to work in the Kubernetes project.
+
+To check your version of godep:
+
+```sh
+$ godep version
+godep v74 (linux/amd64/go1.6.2)
+```
+
+Developers planning to managing dependencies in the `vendor/` tree may want to
+explore alternative environment setups. See
+[using godep to manage dependencies](godep.md).
+
+### Local build using make
+
+To build Kubernetes using your local Go development environment (generate linux
+binaries):
+
+```sh
+ make
+```
+
+You may pass build options and packages to the script as necessary. For example,
+to build with optimizations disabled for enabling use of source debug tools:
+
+```sh
+ make GOGCFLAGS="-N -l"
+```
+
+To build binaries for all platforms:
+
+```sh
+ make cross
+```
+
+### How to update the Go version used to test & build k8s
+
+The kubernetes project tries to stay on the latest version of Go so it can
+benefit from the improvements to the language over time and can easily
+bump to a minor release version for security updates.
+
+Since kubernetes is mostly built and tested in containers, there are a few
+unique places you need to update the go version.
+
+- The image for cross compiling in [build-tools/build-image/cross/](../../build-tools/build-image/cross/). The `VERSION` file and `Dockerfile`.
+- Update [dockerized-e2e-runner.sh](https://github.com/kubernetes/test-infra/blob/master/jenkins/dockerized-e2e-runner.sh) to run a kubekins-e2e with the desired go version, which requires pushing [e2e-image](https://github.com/kubernetes/test-infra/tree/master/jenkins/e2e-image) and [test-image](https://github.com/kubernetes/test-infra/tree/master/jenkins/test-image) images that are `FROM` the desired go version.
+- The docker image being run in [gotest-dockerized.sh](https://github.com/kubernetes/test-infra/tree/master/jenkins/gotest-dockerized.sh).
+- The cross tag `KUBE_BUILD_IMAGE_CROSS_TAG` in [build-tools/common.sh](../../build-tools/common.sh)
+
+## Workflow
+
+Below, we outline one of the more common git workflows that core developers use.
+Other git workflows are also valid.
+
+### Visual overview
+
+![Git workflow](git_workflow.png)
+
+### Fork the main repository
+
+1. Go to https://github.com/kubernetes/kubernetes
+2. Click the "Fork" button (at the top right)
+
+### Clone your fork
+
+The commands below require that you have $GOPATH set ([$GOPATH
+docs](https://golang.org/doc/code.html#GOPATH)). We highly recommend you put
+Kubernetes' code into your GOPATH. Note: the commands below will not work if
+there is more than one directory in your `$GOPATH`.
+
+```sh
+mkdir -p $GOPATH/src/k8s.io
+cd $GOPATH/src/k8s.io
+# Replace "$YOUR_GITHUB_USERNAME" below with your github username
+git clone https://github.com/$YOUR_GITHUB_USERNAME/kubernetes.git
+cd kubernetes
+git remote add upstream 'https://github.com/kubernetes/kubernetes.git'
+```
+
+### Create a branch and make changes
+
+```sh
+git checkout -b my-feature
+# Make your code changes
+```
+
+### Keeping your development fork in sync
+
+```sh
+git fetch upstream
+git rebase upstream/master
+```
+
+Note: If you have write access to the main repository at
+github.com/kubernetes/kubernetes, you should modify your git configuration so
+that you can't accidentally push to upstream:
+
+```sh
+git remote set-url --push upstream no_push
+```
+
+### Committing changes to your fork
+
+Before committing any changes, please link/copy the pre-commit hook into your
+.git directory. This will keep you from accidentally committing non-gofmt'd Go
+code. This hook will also do a build and test whether documentation generation
+scripts need to be executed.
+
+The hook requires both Godep and etcd on your `PATH`.
+
+```sh
+cd kubernetes/.git/hooks/
+ln -s ../../hooks/pre-commit .
+```
+
+Then you can commit your changes and push them to your fork:
+
+```sh
+git commit
+git push -f origin my-feature
+```
+
+### Creating a pull request
+
+1. Visit https://github.com/$YOUR_GITHUB_USERNAME/kubernetes
+2. Click the "Compare & pull request" button next to your "my-feature" branch.
+3. Check out the pull request [process](pull-requests.md) for more details
+
+**Note:** If you have write access, please refrain from using the GitHub UI for creating PRs, because GitHub will create the PR branch inside the main repository rather than inside your fork.
+
+### Getting a code review
+
+Once your pull request has been opened it will be assigned to one or more
+reviewers. Those reviewers will do a thorough code review, looking for
+correctness, bugs, opportunities for improvement, documentation and comments,
+and style.
+
+Very small PRs are easy to review. Very large PRs are very difficult to
+review. Github has a built-in code review tool, which is what most people use.
+At the assigned reviewer's discretion, a PR may be switched to use
+[Reviewable](https://reviewable.k8s.io) instead. Once a PR is switched to
+Reviewable, please ONLY send or reply to comments through reviewable. Mixing
+code review tools can be very confusing.
+
+See [Faster Reviews](faster_reviews.md) for some thoughts on how to streamline
+the review process.
+
+### When to retain commits and when to squash
+
+Upon merge, all git commits should represent meaningful milestones or units of
+work. Use commits to add clarity to the development and review process.
+
+Before merging a PR, squash any "fix review feedback", "typo", and "rebased"
+sorts of commits. It is not imperative that every commit in a PR compile and
+pass tests independently, but it is worth striving for. For mass automated
+fixups (e.g. automated doc formatting), use one or more commits for the
+changes to tooling and a final commit to apply the fixup en masse. This makes
+reviews much easier.
+
+## Testing
+
+Three basic commands let you run unit, integration and/or e2e tests:
+
+```sh
+cd kubernetes
+make test # Run every unit test
+make test WHAT=pkg/util/cache GOFLAGS=-v # Run tests of a package verbosely
+make test-integration # Run integration tests, requires etcd
+make test-e2e # Run e2e tests
+```
+
+See the [testing guide](testing.md) and [end-to-end tests](e2e-tests.md) for additional information and scenarios.
+
+## Regenerating the CLI documentation
+
+```sh
+hack/update-generated-docs.sh
+```
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/development.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/e2e-node-tests.md b/contributors/devel/e2e-node-tests.md
new file mode 100644
index 00000000..5e5f5b49
--- /dev/null
+++ b/contributors/devel/e2e-node-tests.md
@@ -0,0 +1,231 @@
+# Node End-To-End tests
+
+Node e2e tests are component tests meant for testing the Kubelet code on a custom host environment.
+
+Tests can be run either locally or against a host running on GCE.
+
+Node e2e tests are run as both pre- and post- submit tests by the Kubernetes project.
+
+*Note: Linux only. Mac and Windows unsupported.*
+
+*Note: There is no scheduler running. The e2e tests have to do manual scheduling, e.g. by using `framework.PodClient`.*
+
+# Running tests
+
+## Locally
+
+Why run tests *Locally*? Much faster than running tests Remotely.
+
+Prerequisites:
+- [Install etcd](https://github.com/coreos/etcd/releases) on your PATH
+ - Verify etcd is installed correctly by running `which etcd`
+ - Or make etcd binary available and executable at `/tmp/etcd`
+- [Install ginkgo](https://github.com/onsi/ginkgo) on your PATH
+ - Verify ginkgo is installed correctly by running `which ginkgo`
+
+From the Kubernetes base directory, run:
+
+```sh
+make test-e2e-node
+```
+
+This will: run the *ginkgo* binary against the subdirectory *test/e2e_node*, which will in turn:
+- Ask for sudo access (needed for running some of the processes)
+- Build the Kubernetes source code
+- Pre-pull docker images used by the tests
+- Start a local instance of *etcd*
+- Start a local instance of *kube-apiserver*
+- Start a local instance of *kubelet*
+- Run the test using the locally started processes
+- Output the test results to STDOUT
+- Stop *kubelet*, *kube-apiserver*, and *etcd*
+
+## Remotely
+
+Why Run tests *Remotely*? Tests will be run in a customized pristine environment. Closely mimics what will be done
+as pre- and post- submit testing performed by the project.
+
+Prerequisites:
+- [join the googlegroup](https://groups.google.com/forum/#!forum/kubernetes-dev)
+`kubernetes-dev@googlegroups.com`
+ - *This provides read access to the node test images.*
+- Setup a [Google Cloud Platform](https://cloud.google.com/) account and project with Google Compute Engine enabled
+- Install and setup the [gcloud sdk](https://cloud.google.com/sdk/downloads)
+ - Verify the sdk is setup correctly by running `gcloud compute instances list` and `gcloud compute images list --project kubernetes-node-e2e-images`
+
+Run:
+
+```sh
+make test-e2e-node REMOTE=true
+```
+
+This will:
+- Build the Kubernetes source code
+- Create a new GCE instance using the default test image
+ - Instance will be called **test-e2e-node-containervm-v20160321-image**
+- Lookup the instance public ip address
+- Copy a compressed archive file to the host containing the following binaries:
+ - ginkgo
+ - kubelet
+ - kube-apiserver
+ - e2e_node.test (this binary contains the actual tests to be run)
+- Unzip the archive to a directory under **/tmp/gcloud**
+- Run the tests using the `ginkgo` command
+ - Starts etcd, kube-apiserver, kubelet
+ - The ginkgo command is used because this supports more features than running the test binary directly
+- Output the remote test results to STDOUT
+- `scp` the log files back to the local host under /tmp/_artifacts/e2e-node-containervm-v20160321-image
+- Stop the processes on the remote host
+- **Leave the GCE instance running**
+
+**Note: Subsequent tests run using the same image will *reuse the existing host* instead of deleting it and
+provisioning a new one. To delete the GCE instance after each test see
+*[DELETE_INSTANCE](#delete-instance-after-tests-run)*.**
+
+
+# Additional Remote Options
+
+## Run tests using different images
+
+This is useful if you want to run tests against a host using a different OS distro or container runtime than
+provided by the default image.
+
+List the available test images using gcloud.
+
+```sh
+make test-e2e-node LIST_IMAGES=true
+```
+
+This will output a list of the available images for the default image project.
+
+Then run:
+
+```sh
+make test-e2e-node REMOTE=true IMAGES="<comma-separated-list-images>"
+```
+
+## Run tests against a running GCE instance (not an image)
+
+This is useful if you have an host instance running already and want to run the tests there instead of on a new instance.
+
+```sh
+make test-e2e-node REMOTE=true HOSTS="<comma-separated-list-of-hostnames>"
+```
+
+## Delete instance after tests run
+
+This is useful if you want recreate the instance for each test run to trigger flakes related to starting the instance.
+
+```sh
+make test-e2e-node REMOTE=true DELETE_INSTANCES=true
+```
+
+## Keep instance, test binaries, and *processes* around after tests run
+
+This is useful if you want to manually inspect or debug the kubelet process run as part of the tests.
+
+```sh
+make test-e2e-node REMOTE=true CLEANUP=false
+```
+
+## Run tests using an image in another project
+
+This is useful if you want to create your own host image in another project and use it for testing.
+
+```sh
+make test-e2e-node REMOTE=true IMAGE_PROJECT="<name-of-project-with-images>" IMAGES="<image-name>"
+```
+
+Setting up your own host image may require additional steps such as installing etcd or docker. See
+[setup_host.sh](../../test/e2e_node/environment/setup_host.sh) for common steps to setup hosts to run node tests.
+
+## Create instances using a different instance name prefix
+
+This is useful if you want to create instances using a different name so that you can run multiple copies of the
+test in parallel against different instances of the same image.
+
+```sh
+make test-e2e-node REMOTE=true INSTANCE_PREFIX="my-prefix"
+```
+
+# Additional Test Options for both Remote and Local execution
+
+## Only run a subset of the tests
+
+To run tests matching a regex:
+
+```sh
+make test-e2e-node REMOTE=true FOCUS="<regex-to-match>"
+```
+
+To run tests NOT matching a regex:
+
+```sh
+make test-e2e-node REMOTE=true SKIP="<regex-to-match>"
+```
+
+## Run tests continually until they fail
+
+This is useful if you are trying to debug a flaky test failure. This will cause ginkgo to continually
+run the tests until they fail. **Note: this will only perform test setup once (e.g. creating the instance) and is
+less useful for catching flakes related creating the instance from an image.**
+
+```sh
+make test-e2e-node REMOTE=true RUN_UNTIL_FAILURE=true
+```
+
+## Run tests in parallel
+
+Running test in parallel can usually shorten the test duration. By default node
+e2e test runs with`--nodes=8` (see ginkgo flag
+[--nodes](https://onsi.github.io/ginkgo/#parallel-specs)). You can use the
+`PARALLELISM` option to change the parallelism.
+
+```sh
+make test-e2e-node PARALLELISM=4 # run test with 4 parallel nodes
+make test-e2e-node PARALLELISM=1 # run test sequentially
+```
+
+## Run tests with kubenet network plugin
+
+[kubenet](http://kubernetes.io/docs/admin/network-plugins/#kubenet) is
+the default network plugin used by kubelet since Kubernetes 1.3. The
+plugin requires [CNI](https://github.com/containernetworking/cni) and
+[nsenter](http://man7.org/linux/man-pages/man1/nsenter.1.html).
+
+Currently, kubenet is enabled by default for Remote execution `REMOTE=true`,
+but disabled for Local execution. **Note: kubenet is not supported for
+local execution currently. This may cause network related test result to be
+different for Local and Remote execution. So if you want to run network
+related test, Remote execution is recommended.**
+
+To enable/disable kubenet:
+
+```sh
+make test_e2e_node TEST_ARGS="--disable-kubenet=true" # enable kubenet
+make test_e2e_node TEST_ARGS="--disable-kubenet=false" # disable kubenet
+```
+
+## Additional QoS Cgroups Hierarchy level testing
+
+For testing with the QoS Cgroup Hierarchy enabled, you can pass --experimental-cgroups-per-qos flag as an argument into Ginkgo using TEST_ARGS
+
+```sh
+make test_e2e_node TEST_ARGS="--experimental-cgroups-per-qos=true"
+```
+
+# Notes on tests run by the Kubernetes project during pre-, post- submit.
+
+The node e2e tests are run by the PR builder for each Pull Request and the results published at
+the bottom of the comments section. To re-run just the node e2e tests from the PR builder add the comment
+`@k8s-bot node e2e test this issue: #<Flake-Issue-Number or IGNORE>` and **include a link to the test
+failure logs if caused by a flake.**
+
+The PR builder runs tests against the images listed in [jenkins-pull.properties](../../test/e2e_node/jenkins/jenkins-pull.properties)
+
+The post submit tests run against the images listed in [jenkins-ci.properties](../../test/e2e_node/jenkins/jenkins-ci.properties)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/e2e-node-tests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/e2e-tests.md b/contributors/devel/e2e-tests.md
new file mode 100644
index 00000000..fc8f1995
--- /dev/null
+++ b/contributors/devel/e2e-tests.md
@@ -0,0 +1,719 @@
+# End-to-End Testing in Kubernetes
+
+Updated: 5/3/2016
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [End-to-End Testing in Kubernetes](#end-to-end-testing-in-kubernetes)
+ - [Overview](#overview)
+ - [Building and Running the Tests](#building-and-running-the-tests)
+ - [Cleaning up](#cleaning-up)
+ - [Advanced testing](#advanced-testing)
+ - [Bringing up a cluster for testing](#bringing-up-a-cluster-for-testing)
+ - [Federation e2e tests](#federation-e2e-tests)
+ - [Configuring federation e2e tests](#configuring-federation-e2e-tests)
+ - [Image Push Repository](#image-push-repository)
+ - [Build](#build)
+ - [Deploy federation control plane](#deploy-federation-control-plane)
+ - [Run the Tests](#run-the-tests)
+ - [Teardown](#teardown)
+ - [Shortcuts for test developers](#shortcuts-for-test-developers)
+ - [Debugging clusters](#debugging-clusters)
+ - [Local clusters](#local-clusters)
+ - [Testing against local clusters](#testing-against-local-clusters)
+ - [Version-skewed and upgrade testing](#version-skewed-and-upgrade-testing)
+ - [Kinds of tests](#kinds-of-tests)
+ - [Viper configuration and hierarchichal test parameters.](#viper-configuration-and-hierarchichal-test-parameters)
+ - [Conformance tests](#conformance-tests)
+ - [Defining Conformance Subset](#defining-conformance-subset)
+ - [Continuous Integration](#continuous-integration)
+ - [What is CI?](#what-is-ci)
+ - [What runs in CI?](#what-runs-in-ci)
+ - [Non-default tests](#non-default-tests)
+ - [The PR-builder](#the-pr-builder)
+ - [Adding a test to CI](#adding-a-test-to-ci)
+ - [Moving a test out of CI](#moving-a-test-out-of-ci)
+ - [Performance Evaluation](#performance-evaluation)
+ - [One More Thing](#one-more-thing)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## Overview
+
+End-to-end (e2e) tests for Kubernetes provide a mechanism to test end-to-end
+behavior of the system, and is the last signal to ensure end user operations
+match developer specifications. Although unit and integration tests provide a
+good signal, in a distributed system like Kubernetes it is not uncommon that a
+minor change may pass all unit and integration tests, but cause unforeseen
+changes at the system level.
+
+The primary objectives of the e2e tests are to ensure a consistent and reliable
+behavior of the kubernetes code base, and to catch hard-to-test bugs before
+users do, when unit and integration tests are insufficient.
+
+The e2e tests in kubernetes are built atop of
+[Ginkgo](http://onsi.github.io/ginkgo/) and
+[Gomega](http://onsi.github.io/gomega/). There are a host of features that this
+Behavior-Driven Development (BDD) testing framework provides, and it is
+recommended that the developer read the documentation prior to diving into the
+ tests.
+
+The purpose of *this* document is to serve as a primer for developers who are
+looking to execute or add tests using a local development environment.
+
+Before writing new tests or making substantive changes to existing tests, you
+should also read [Writing Good e2e Tests](writing-good-e2e-tests.md)
+
+## Building and Running the Tests
+
+There are a variety of ways to run e2e tests, but we aim to decrease the number
+of ways to run e2e tests to a canonical way: `hack/e2e.go`.
+
+You can run an end-to-end test which will bring up a master and nodes, perform
+some tests, and then tear everything down. Make sure you have followed the
+getting started steps for your chosen cloud platform (which might involve
+changing the `KUBERNETES_PROVIDER` environment variable to something other than
+"gce").
+
+To build Kubernetes, up a cluster, run tests, and tear everything down, use:
+
+```sh
+go run hack/e2e.go -v --build --up --test --down
+```
+
+If you'd like to just perform one of these steps, here are some examples:
+
+```sh
+# Build binaries for testing
+go run hack/e2e.go -v --build
+
+# Create a fresh cluster. Deletes a cluster first, if it exists
+go run hack/e2e.go -v --up
+
+# Run all tests
+go run hack/e2e.go -v --test
+
+# Run tests matching the regex "\[Feature:Performance\]"
+go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\[Feature:Performance\]"
+
+# Conversely, exclude tests that match the regex "Pods.*env"
+go run hack/e2e.go -v --test --test_args="--ginkgo.skip=Pods.*env"
+
+# Run tests in parallel, skip any that must be run serially
+GINKGO_PARALLEL=y go run hack/e2e.go --v --test --test_args="--ginkgo.skip=\[Serial\]"
+
+# Run tests in parallel, skip any that must be run serially and keep the test namespace if test failed
+GINKGO_PARALLEL=y go run hack/e2e.go --v --test --test_args="--ginkgo.skip=\[Serial\] --delete-namespace-on-failure=false"
+
+# Flags can be combined, and their actions will take place in this order:
+# --build, --up, --test, --down
+#
+# You can also specify an alternative provider, such as 'aws'
+#
+# e.g.:
+KUBERNETES_PROVIDER=aws go run hack/e2e.go -v --build --up --test --down
+
+# -ctl can be used to quickly call kubectl against your e2e cluster. Useful for
+# cleaning up after a failed test or viewing logs. Use -v to avoid suppressing
+# kubectl output.
+go run hack/e2e.go -v -ctl='get events'
+go run hack/e2e.go -v -ctl='delete pod foobar'
+```
+
+The tests are built into a single binary which can be run used to deploy a
+Kubernetes system or run tests against an already-deployed Kubernetes system.
+See `go run hack/e2e.go --help` (or the flag definitions in `hack/e2e.go`) for
+more options, such as reusing an existing cluster.
+
+### Cleaning up
+
+During a run, pressing `control-C` should result in an orderly shutdown, but if
+something goes wrong and you still have some VMs running you can force a cleanup
+with this command:
+
+```sh
+go run hack/e2e.go -v --down
+```
+
+## Advanced testing
+
+### Bringing up a cluster for testing
+
+If you want, you may bring up a cluster in some other manner and run tests
+against it. To do so, or to do other non-standard test things, you can pass
+arguments into Ginkgo using `--test_args` (e.g. see above). For the purposes of
+brevity, we will look at a subset of the options, which are listed below:
+
+```
+--ginkgo.dryRun=false: If set, ginkgo will walk the test hierarchy without
+actually running anything. Best paired with -v.
+
+--ginkgo.failFast=false: If set, ginkgo will stop running a test suite after a
+failure occurs.
+
+--ginkgo.failOnPending=false: If set, ginkgo will mark the test suite as failed
+if any specs are pending.
+
+--ginkgo.focus="": If set, ginkgo will only run specs that match this regular
+expression.
+
+--ginkgo.skip="": If set, ginkgo will only run specs that do not match this
+regular expression.
+
+--ginkgo.trace=false: If set, default reporter prints out the full stack trace
+when a failure occurs
+
+--ginkgo.v=false: If set, default reporter print out all specs as they begin.
+
+--host="": The host, or api-server, to connect to
+
+--kubeconfig="": Path to kubeconfig containing embedded authinfo.
+
+--prom-push-gateway="": The URL to prometheus gateway, so that metrics can be
+pushed during e2es and scraped by prometheus. Typically something like
+127.0.0.1:9091.
+
+--provider="": The name of the Kubernetes provider (gce, gke, local, vagrant,
+etc.)
+
+--repo-root="../../": Root directory of kubernetes repository, for finding test
+files.
+```
+
+Prior to running the tests, you may want to first create a simple auth file in
+your home directory, e.g. `$HOME/.kube/config`, with the following:
+
+```
+{
+ "User": "root",
+ "Password": ""
+}
+```
+
+As mentioned earlier there are a host of other options that are available, but
+they are left to the developer.
+
+**NOTE:** If you are running tests on a local cluster repeatedly, you may need
+to periodically perform some manual cleanup:
+
+ - `rm -rf /var/run/kubernetes`, clear kube generated credentials, sometimes
+stale permissions can cause problems.
+
+ - `sudo iptables -F`, clear ip tables rules left by the kube-proxy.
+
+### Federation e2e tests
+
+By default, `e2e.go` provisions a single Kubernetes cluster, and any `Feature:Federation` ginkgo tests will be skipped.
+
+Federation e2e testing involve bringing up multiple "underlying" Kubernetes clusters,
+and deploying the federation control plane as a Kubernetes application on the underlying clusters.
+
+The federation e2e tests are still managed via `e2e.go`, but require some extra configuration items.
+
+#### Configuring federation e2e tests
+
+The following environment variables will enable federation e2e building, provisioning and testing.
+
+```sh
+$ export FEDERATION=true
+$ export E2E_ZONES="us-central1-a us-central1-b us-central1-f"
+```
+
+A Kubernetes cluster will be provisioned in each zone listed in `E2E_ZONES`. A zone can only appear once in the `E2E_ZONES` list.
+
+#### Image Push Repository
+
+Next, specify the docker repository where your ci images will be pushed.
+
+* **If `KUBERNETES_PROVIDER=gce` or `KUBERNETES_PROVIDER=gke`**:
+
+ If you use the same GCP project where you to run the e2e tests as the container image repository,
+ FEDERATION_PUSH_REPO_BASE environment variable will be defaulted to "gcr.io/${DEFAULT_GCP_PROJECT_NAME}".
+ You can skip ahead to the **Build** section.
+
+ You can simply set your push repo base based on your project name, and the necessary repositories will be
+ auto-created when you first push your container images.
+
+ ```sh
+ $ export FEDERATION_PUSH_REPO_BASE="gcr.io/${GCE_PROJECT_NAME}"
+ ```
+
+ Skip ahead to the **Build** section.
+
+* **For all other providers**:
+
+ You'll be responsible for creating and managing access to the repositories manually.
+
+ ```sh
+ $ export FEDERATION_PUSH_REPO_BASE="quay.io/colin_hom"
+ ```
+
+ Given this example, the `federation-apiserver` container image will be pushed to the repository
+ `quay.io/colin_hom/federation-apiserver`.
+
+ The docker client on the machine running `e2e.go` must have push access for the following pre-existing repositories:
+
+ * `${FEDERATION_PUSH_REPO_BASE}/federation-apiserver`
+ * `${FEDERATION_PUSH_REPO_BASE}/federation-controller-manager`
+
+ These repositories must allow public read access, as the e2e node docker daemons will not have any credentials. If you're using
+ GCE/GKE as your provider, the repositories will have read-access by default.
+
+#### Build
+
+* Compile the binaries and build container images:
+
+ ```sh
+ $ KUBE_RELEASE_RUN_TESTS=n KUBE_FASTBUILD=true go run hack/e2e.go -v -build
+ ```
+
+* Push the federation container images
+
+ ```sh
+ $ build-tools/push-federation-images.sh
+ ```
+
+#### Deploy federation control plane
+
+The following command will create the underlying Kubernetes clusters in each of `E2E_ZONES`, and then provision the
+federation control plane in the cluster occupying the last zone in the `E2E_ZONES` list.
+
+```sh
+$ go run hack/e2e.go -v --up
+```
+
+#### Run the Tests
+
+This will run only the `Feature:Federation` e2e tests. You can omit the `ginkgo.focus` argument to run the entire e2e suite.
+
+```sh
+$ go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\[Feature:Federation\]"
+```
+
+#### Teardown
+
+```sh
+$ go run hack/e2e.go -v --down
+```
+
+#### Shortcuts for test developers
+
+* To speed up `e2e.go -up`, provision a single-node kubernetes cluster in a single e2e zone:
+
+ `NUM_NODES=1 E2E_ZONES="us-central1-f"`
+
+ Keep in mind that some tests may require multiple underlying clusters and/or minimum compute resource availability.
+
+* You can quickly recompile the e2e testing framework via `go install ./test/e2e`. This will not do anything besides
+ allow you to verify that the go code compiles.
+
+* If you want to run your e2e testing framework without re-provisioning the e2e setup, you can do so via
+ `make WHAT=test/e2e/e2e.test` and then re-running the ginkgo tests.
+
+* If you're hacking around with the federation control plane deployment itself,
+ you can quickly re-deploy the federation control plane Kubernetes manifests without tearing any resources down.
+ To re-deploy the federation control plane after running `-up` for the first time:
+
+ ```sh
+ $ federation/cluster/federation-up.sh
+ ```
+
+### Debugging clusters
+
+If a cluster fails to initialize, or you'd like to better understand cluster
+state to debug a failed e2e test, you can use the `cluster/log-dump.sh` script
+to gather logs.
+
+This script requires that the cluster provider supports ssh. Assuming it does,
+running:
+
+```
+cluster/log-dump.sh <directory>
+````
+
+will ssh to the master and all nodes and download a variety of useful logs to
+the provided directory (which should already exist).
+
+The Google-run Jenkins builds automatically collected these logs for every
+build, saving them in the `artifacts` directory uploaded to GCS.
+
+### Local clusters
+
+It can be much faster to iterate on a local cluster instead of a cloud-based
+one. To start a local cluster, you can run:
+
+```sh
+# The PATH construction is needed because PATH is one of the special-cased
+# environment variables not passed by sudo -E
+sudo PATH=$PATH hack/local-up-cluster.sh
+```
+
+This will start a single-node Kubernetes cluster than runs pods using the local
+docker daemon. Press Control-C to stop the cluster.
+
+You can generate a valid kubeconfig file by following instructions printed at the
+end of aforementioned script.
+
+#### Testing against local clusters
+
+In order to run an E2E test against a locally running cluster, point the tests
+at a custom host directly:
+
+```sh
+export KUBECONFIG=/path/to/kubeconfig
+export KUBE_MASTER_IP="http://127.0.0.1:<PORT>"
+export KUBE_MASTER=local
+go run hack/e2e.go -v --test
+```
+
+To control the tests that are run:
+
+```sh
+go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\"Secrets\""
+```
+
+### Version-skewed and upgrade testing
+
+We run version-skewed tests to check that newer versions of Kubernetes work
+similarly enough to older versions. The general strategy is to cover the following cases:
+
+1. One version of `kubectl` with another version of the cluster and tests (e.g.
+ that v1.2 and v1.4 `kubectl` doesn't break v1.3 tests running against a v1.3
+ cluster).
+1. A newer version of the Kubernetes master with older nodes and tests (e.g.
+ that upgrading a master to v1.3 with nodes at v1.2 still passes v1.2 tests).
+1. A newer version of the whole cluster with older tests (e.g. that a cluster
+ upgraded---master and nodes---to v1.3 still passes v1.2 tests).
+1. That an upgraded cluster functions the same as a brand-new cluster of the
+ same version (e.g. a cluster upgraded to v1.3 passes the same v1.3 tests as
+ a newly-created v1.3 cluster).
+
+[hack/e2e-runner.sh](http://releases.k8s.io/HEAD/hack/jenkins/e2e-runner.sh) is
+the authoritative source on how to run version-skewed tests, but below is a
+quick-and-dirty tutorial.
+
+```sh
+# Assume you have two copies of the Kubernetes repository checked out, at
+# ./kubernetes and ./kubernetes_old
+
+# If using GKE:
+export KUBERNETES_PROVIDER=gke
+export CLUSTER_API_VERSION=${OLD_VERSION}
+
+# Deploy a cluster at the old version; see above for more details
+cd ./kubernetes_old
+go run ./hack/e2e.go -v --up
+
+# Upgrade the cluster to the new version
+#
+# If using GKE, add --upgrade-target=${NEW_VERSION}
+#
+# You can target Feature:MasterUpgrade or Feature:ClusterUpgrade
+cd ../kubernetes
+go run ./hack/e2e.go -v --test --check_version_skew=false --test_args="--ginkgo.focus=\[Feature:MasterUpgrade\]"
+
+# Run old tests with new kubectl
+cd ../kubernetes_old
+go run ./hack/e2e.go -v --test --test_args="--kubectl-path=$(pwd)/../kubernetes/cluster/kubectl.sh"
+```
+
+If you are just testing version-skew, you may want to just deploy at one
+version and then test at another version, instead of going through the whole
+upgrade process:
+
+```sh
+# With the same setup as above
+
+# Deploy a cluster at the new version
+cd ./kubernetes
+go run ./hack/e2e.go -v --up
+
+# Run new tests with old kubectl
+go run ./hack/e2e.go -v --test --test_args="--kubectl-path=$(pwd)/../kubernetes_old/cluster/kubectl.sh"
+
+# Run old tests with new kubectl
+cd ../kubernetes_old
+go run ./hack/e2e.go -v --test --test_args="--kubectl-path=$(pwd)/../kubernetes/cluster/kubectl.sh"
+```
+
+## Kinds of tests
+
+We are working on implementing clearer partitioning of our e2e tests to make
+running a known set of tests easier (#10548). Tests can be labeled with any of
+the following labels, in order of increasing precedence (that is, each label
+listed below supersedes the previous ones):
+
+ - If a test has no labels, it is expected to run fast (under five minutes), be
+able to be run in parallel, and be consistent.
+
+ - `[Slow]`: If a test takes more than five minutes to run (by itself or in
+parallel with many other tests), it is labeled `[Slow]`. This partition allows
+us to run almost all of our tests quickly in parallel, without waiting for the
+stragglers to finish.
+
+ - `[Serial]`: If a test cannot be run in parallel with other tests (e.g. it
+takes too many resources or restarts nodes), it is labeled `[Serial]`, and
+should be run in serial as part of a separate suite.
+
+ - `[Disruptive]`: If a test restarts components that might cause other tests
+to fail or break the cluster completely, it is labeled `[Disruptive]`. Any
+`[Disruptive]` test is also assumed to qualify for the `[Serial]` label, but
+need not be labeled as both. These tests are not run against soak clusters to
+avoid restarting components.
+
+ - `[Flaky]`: If a test is found to be flaky and we have decided that it's too
+hard to fix in the short term (e.g. it's going to take a full engineer-week), it
+receives the `[Flaky]` label until it is fixed. The `[Flaky]` label should be
+used very sparingly, and should be accompanied with a reference to the issue for
+de-flaking the test, because while a test remains labeled `[Flaky]`, it is not
+monitored closely in CI. `[Flaky]` tests are by default not run, unless a
+`focus` or `skip` argument is explicitly given.
+
+ - `[Feature:.+]`: If a test has non-default requirements to run or targets
+some non-core functionality, and thus should not be run as part of the standard
+suite, it receives a `[Feature:.+]` label, e.g. `[Feature:Performance]` or
+`[Feature:Ingress]`. `[Feature:.+]` tests are not run in our core suites,
+instead running in custom suites. If a feature is experimental or alpha and is
+not enabled by default due to being incomplete or potentially subject to
+breaking changes, it does *not* block the merge-queue, and thus should run in
+some separate test suites owned by the feature owner(s)
+(see [Continuous Integration](#continuous-integration) below).
+
+### Viper configuration and hierarchichal test parameters.
+
+The future of e2e test configuration idioms will be increasingly defined using viper, and decreasingly via flags.
+
+Flags in general fall apart once tests become sufficiently complicated. So, even if we could use another flag library, it wouldn't be ideal.
+
+To use viper, rather than flags, to configure your tests:
+
+- Just add "e2e.json" to the current directory you are in, and define parameters in it... i.e. `"kubeconfig":"/tmp/x"`.
+
+Note that advanced testing parameters, and hierarchichally defined parameters, are only defined in viper, to see what they are, you can dive into [TestContextType](../../test/e2e/framework/test_context.go).
+
+In time, it is our intent to add or autogenerate a sample viper configuration that includes all e2e parameters, to ship with kubernetes.
+
+### Conformance tests
+
+Finally, `[Conformance]` tests represent a subset of the e2e-tests we expect to
+pass on **any** Kubernetes cluster. The `[Conformance]` label does not supersede
+any other labels.
+
+As each new release of Kubernetes providers new functionality, the subset of
+tests necessary to demonstrate conformance grows with each release. Conformance
+is thus considered versioned, with the same backwards compatibility guarantees
+as laid out in [our versioning policy](../design/versioning.md#supported-releases).
+Conformance tests for a given version should be run off of the release branch
+that corresponds to that version. Thus `v1.2` conformance tests would be run
+from the head of the `release-1.2` branch. eg:
+
+ - A v1.3 development cluster should pass v1.1, v1.2 conformance tests
+
+ - A v1.2 cluster should pass v1.1, v1.2 conformance tests
+
+ - A v1.1 cluster should pass v1.0, v1.1 conformance tests, and fail v1.2
+conformance tests
+
+Conformance tests are designed to be run with no cloud provider configured.
+Conformance tests can be run against clusters that have not been created with
+`hack/e2e.go`, just provide a kubeconfig with the appropriate endpoint and
+credentials.
+
+```sh
+# setup for conformance tests
+export KUBECONFIG=/path/to/kubeconfig
+export KUBERNETES_CONFORMANCE_TEST=y
+export KUBERNETES_PROVIDER=skeleton
+
+# run all conformance tests
+go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\[Conformance\]"
+
+# run all parallel-safe conformance tests in parallel
+GINKGO_PARALLEL=y go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\[Conformance\] --ginkgo.skip=\[Serial\]"
+
+# ... and finish up with remaining tests in serial
+go run hack/e2e.go -v --test --test_args="--ginkgo.focus=\[Serial\].*\[Conformance\]"
+```
+
+### Defining Conformance Subset
+
+It is impossible to define the entire space of Conformance tests without knowing
+the future, so instead, we define the compliment of conformance tests, below
+(`Please update this with companion PRs as necessary`):
+
+ - A conformance test cannot test cloud provider specific features (i.e. GCE
+monitoring, S3 Bucketing, ...)
+
+ - A conformance test cannot rely on any particular non-standard file system
+permissions granted to containers or users (i.e. sharing writable host /tmp with
+a container)
+
+ - A conformance test cannot rely on any binaries that are not required for the
+linux kernel or for a kubelet to run (i.e. git)
+
+ - A conformance test cannot test a feature which obviously cannot be supported
+on a broad range of platforms (i.e. testing of multiple disk mounts, GPUs, high
+density)
+
+## Continuous Integration
+
+A quick overview of how we run e2e CI on Kubernetes.
+
+### What is CI?
+
+We run a battery of `e2e` tests against `HEAD` of the master branch on a
+continuous basis, and block merges via the [submit
+queue](http://submit-queue.k8s.io/) on a subset of those tests if they fail (the
+subset is defined in the [munger config]
+(https://github.com/kubernetes/contrib/blob/master/mungegithub/mungers/submit-queue.go)
+via the `jenkins-jobs` flag; note we also block on `kubernetes-build` and
+`kubernetes-test-go` jobs for build and unit and integration tests).
+
+CI results can be found at [ci-test.k8s.io](http://ci-test.k8s.io), e.g.
+[ci-test.k8s.io/kubernetes-e2e-gce/10594](http://ci-test.k8s.io/kubernetes-e2e-gce/10594).
+
+### What runs in CI?
+
+We run all default tests (those that aren't marked `[Flaky]` or `[Feature:.+]`)
+against GCE and GKE. To minimize the time from regression-to-green-run, we
+partition tests across different jobs:
+
+ - `kubernetes-e2e-<provider>` runs all non-`[Slow]`, non-`[Serial]`,
+non-`[Disruptive]`, non-`[Flaky]`, non-`[Feature:.+]` tests in parallel.
+
+ - `kubernetes-e2e-<provider>-slow` runs all `[Slow]`, non-`[Serial]`,
+non-`[Disruptive]`, non-`[Flaky]`, non-`[Feature:.+]` tests in parallel.
+
+ - `kubernetes-e2e-<provider>-serial` runs all `[Serial]` and `[Disruptive]`,
+non-`[Flaky]`, non-`[Feature:.+]` tests in serial.
+
+We also run non-default tests if the tests exercise general-availability ("GA")
+features that require a special environment to run in, e.g.
+`kubernetes-e2e-gce-scalability` and `kubernetes-kubemark-gce`, which test for
+Kubernetes performance.
+
+#### Non-default tests
+
+Many `[Feature:.+]` tests we don't run in CI. These tests are for features that
+are experimental (often in the `experimental` API), and aren't enabled by
+default.
+
+### The PR-builder
+
+We also run a battery of tests against every PR before we merge it. These tests
+are equivalent to `kubernetes-gce`: it runs all non-`[Slow]`, non-`[Serial]`,
+non-`[Disruptive]`, non-`[Flaky]`, non-`[Feature:.+]` tests in parallel. These
+tests are considered "smoke tests" to give a decent signal that the PR doesn't
+break most functionality. Results for your PR can be found at
+[pr-test.k8s.io](http://pr-test.k8s.io), e.g.
+[pr-test.k8s.io/20354](http://pr-test.k8s.io/20354) for #20354.
+
+### Adding a test to CI
+
+As mentioned above, prior to adding a new test, it is a good idea to perform a
+`-ginkgo.dryRun=true` on the system, in order to see if a behavior is already
+being tested, or to determine if it may be possible to augment an existing set
+of tests for a specific use case.
+
+If a behavior does not currently have coverage and a developer wishes to add a
+new e2e test, navigate to the ./test/e2e directory and create a new test using
+the existing suite as a guide.
+
+TODO(#20357): Create a self-documented example which has been disabled, but can
+be copied to create new tests and outlines the capabilities and libraries used.
+
+When writing a test, consult #kinds_of_tests above to determine how your test
+should be marked, (e.g. `[Slow]`, `[Serial]`; remember, by default we assume a
+test can run in parallel with other tests!).
+
+When first adding a test it should *not* go straight into CI, because failures
+block ordinary development. A test should only be added to CI after is has been
+running in some non-CI suite long enough to establish a track record showing
+that the test does not fail when run against *working* software. Note also that
+tests running in CI are generally running on a well-loaded cluster, so must
+contend for resources; see above about [kinds of tests](#kinds_of_tests).
+
+Generally, a feature starts as `experimental`, and will be run in some suite
+owned by the team developing the feature. If a feature is in beta or GA, it
+*should* block the merge-queue. In moving from experimental to beta or GA, tests
+that are expected to pass by default should simply remove the `[Feature:.+]`
+label, and will be incorporated into our core suites. If tests are not expected
+to pass by default, (e.g. they require a special environment such as added
+quota,) they should remain with the `[Feature:.+]` label, and the suites that
+run them should be incorporated into the
+[munger config](https://github.com/kubernetes/contrib/blob/master/mungegithub/mungers/submit-queue.go)
+via the `jenkins-jobs` flag.
+
+Occasionally, we'll want to add tests to better exercise features that are
+already GA. These tests also shouldn't go straight to CI. They should begin by
+being marked as `[Flaky]` to be run outside of CI, and once a track-record for
+them is established, they may be promoted out of `[Flaky]`.
+
+### Moving a test out of CI
+
+If we have determined that a test is known-flaky and cannot be fixed in the
+short-term, we may move it out of CI indefinitely. This move should be used
+sparingly, as it effectively means that we have no coverage of that test. When a
+test is demoted, it should be marked `[Flaky]` with a comment accompanying the
+label with a reference to an issue opened to fix the test.
+
+## Performance Evaluation
+
+Another benefit of the e2e tests is the ability to create reproducible loads on
+the system, which can then be used to determine the responsiveness, or analyze
+other characteristics of the system. For example, the density tests load the
+system to 30,50,100 pods per/node and measures the different characteristics of
+the system, such as throughput, api-latency, etc.
+
+For a good overview of how we analyze performance data, please read the
+following [post](http://blog.kubernetes.io/2015/09/kubernetes-performance-measurements-and.html)
+
+For developers who are interested in doing their own performance analysis, we
+recommend setting up [prometheus](http://prometheus.io/) for data collection,
+and using [promdash](http://prometheus.io/docs/visualization/promdash/) to
+visualize the data. There also exists the option of pushing your own metrics in
+from the tests using a
+[prom-push-gateway](http://prometheus.io/docs/instrumenting/pushing/).
+Containers for all of these components can be found
+[here](https://hub.docker.com/u/prom/).
+
+For more accurate measurements, you may wish to set up prometheus external to
+kubernetes in an environment where it can access the major system components
+(api-server, controller-manager, scheduler). This is especially useful when
+attempting to gather metrics in a load-balanced api-server environment, because
+all api-servers can be analyzed independently as well as collectively. On
+startup, configuration file is passed to prometheus that specifies the endpoints
+that prometheus will scrape, as well as the sampling interval.
+
+```
+#prometheus.conf
+job: {
+ name: "kubernetes"
+ scrape_interval: "1s"
+ target_group: {
+ # apiserver(s)
+ target: "http://localhost:8080/metrics"
+ # scheduler
+ target: "http://localhost:10251/metrics"
+ # controller-manager
+ target: "http://localhost:10252/metrics"
+ }
+}
+```
+
+Once prometheus is scraping the kubernetes endpoints, that data can then be
+plotted using promdash, and alerts can be created against the assortment of
+metrics that kubernetes provides.
+
+## One More Thing
+
+You should also know the [testing conventions](coding-conventions.md#testing-conventions).
+
+**HAPPY TESTING!**
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/e2e-tests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/faster_reviews.md b/contributors/devel/faster_reviews.md
new file mode 100644
index 00000000..85568d3f
--- /dev/null
+++ b/contributors/devel/faster_reviews.md
@@ -0,0 +1,218 @@
+# How to get faster PR reviews
+
+Most of what is written here is not at all specific to Kubernetes, but it bears
+being written down in the hope that it will occasionally remind people of "best
+practices" around code reviews.
+
+You've just had a brilliant idea on how to make Kubernetes better. Let's call
+that idea "Feature-X". Feature-X is not even that complicated. You have a pretty
+good idea of how to implement it. You jump in and implement it, fixing a bunch
+of stuff along the way. You send your PR - this is awesome! And it sits. And
+sits. A week goes by and nobody reviews it. Finally someone offers a few
+comments, which you fix up and wait for more review. And you wait. Another
+week or two goes by. This is horrible.
+
+What went wrong? One particular problem that comes up frequently is this - your
+PR is too big to review. You've touched 39 files and have 8657 insertions. When
+your would-be reviewers pull up the diffs they run away - this PR is going to
+take 4 hours to review and they don't have 4 hours right now. They'll get to it
+later, just as soon as they have more free time (ha!).
+
+Let's talk about how to avoid this.
+
+## 0. Familiarize yourself with project conventions
+
+* [Development guide](development.md)
+* [Coding conventions](coding-conventions.md)
+* [API conventions](api-conventions.md)
+* [Kubectl conventions](kubectl-conventions.md)
+
+## 1. Don't build a cathedral in one PR
+
+Are you sure Feature-X is something the Kubernetes team wants or will accept, or
+that it is implemented to fit with other changes in flight? Are you willing to
+bet a few days or weeks of work on it? If you have any doubt at all about the
+usefulness of your feature or the design - make a proposal doc (in
+docs/proposals; for example [the QoS proposal](http://prs.k8s.io/11713)) or a
+sketch PR (e.g., just the API or Go interface) or both. Write or code up just
+enough to express the idea and the design and why you made those choices, then
+get feedback on this. Be clear about what type of feedback you are asking for.
+Now, if we ask you to change a bunch of facets of the design, you won't have to
+re-write it all.
+
+## 2. Smaller diffs are exponentially better
+
+Small PRs get reviewed faster and are more likely to be correct than big ones.
+Let's face it - attention wanes over time. If your PR takes 60 minutes to
+review, I almost guarantee that the reviewer's eye for detail is not as keen in
+the last 30 minutes as it was in the first. This leads to multiple rounds of
+review when one might have sufficed. In some cases the review is delayed in its
+entirety by the need for a large contiguous block of time to sit and read your
+code.
+
+Whenever possible, break up your PRs into multiple commits. Making a series of
+discrete commits is a powerful way to express the evolution of an idea or the
+different ideas that make up a single feature. There's a balance to be struck,
+obviously. If your commits are too small they become more cumbersome to deal
+with. Strive to group logically distinct ideas into separate commits.
+
+For example, if you found that Feature-X needed some "prefactoring" to fit in,
+make a commit that JUST does that prefactoring. Then make a new commit for
+Feature-X. Don't lump unrelated things together just because you didn't think
+about prefactoring. If you need to, fork a new branch, do the prefactoring
+there and send a PR for that. If you can explain why you are doing seemingly
+no-op work ("it makes the Feature-X change easier, I promise") we'll probably be
+OK with it.
+
+Obviously, a PR with 25 commits is still very cumbersome to review, so use
+common sense.
+
+## 3. Multiple small PRs are often better than multiple commits
+
+If you can extract whole ideas from your PR and send those as PRs of their own,
+you can avoid the painful problem of continually rebasing. Kubernetes is a
+fast-moving codebase - lock in your changes ASAP, and make merges be someone
+else's problem.
+
+Obviously, we want every PR to be useful on its own, so you'll have to use
+common sense in deciding what can be a PR vs. what should be a commit in a larger
+PR. Rule of thumb - if this commit or set of commits is directly related to
+Feature-X and nothing else, it should probably be part of the Feature-X PR. If
+you can plausibly imagine someone finding value in this commit outside of
+Feature-X, try it as a PR.
+
+Don't worry about flooding us with PRs. We'd rather have 100 small, obvious PRs
+than 10 unreviewable monoliths.
+
+## 4. Don't rename, reformat, comment, etc in the same PR
+
+Often, as you are implementing Feature-X, you find things that are just wrong.
+Bad comments, poorly named functions, bad structure, weak type-safety. You
+should absolutely fix those things (or at least file issues, please) - but not
+in this PR. See the above points - break unrelated changes out into different
+PRs or commits. Otherwise your diff will have WAY too many changes, and your
+reviewer won't see the forest because of all the trees.
+
+## 5. Comments matter
+
+Read up on GoDoc - follow those general rules. If you're writing code and you
+think there is any possible chance that someone might not understand why you did
+something (or that you won't remember what you yourself did), comment it. If
+you think there's something pretty obvious that we could follow up on, add a
+TODO. Many code-review comments are about this exact issue.
+
+## 5. Tests are almost always required
+
+Nothing is more frustrating than doing a review, only to find that the tests are
+inadequate or even entirely absent. Very few PRs can touch code and NOT touch
+tests. If you don't know how to test Feature-X - ask! We'll be happy to help
+you design things for easy testing or to suggest appropriate test cases.
+
+## 6. Look for opportunities to generify
+
+If you find yourself writing something that touches a lot of modules, think hard
+about the dependencies you are introducing between packages. Can some of what
+you're doing be made more generic and moved up and out of the Feature-X package?
+Do you need to use a function or type from an otherwise unrelated package? If
+so, promote! We have places specifically for hosting more generic code.
+
+Likewise if Feature-X is similar in form to Feature-W which was checked in last
+month and it happens to exactly duplicate some tricky stuff from Feature-W,
+consider prefactoring core logic out and using it in both Feature-W and
+Feature-X. But do that in a different commit or PR, please.
+
+## 7. Fix feedback in a new commit
+
+Your reviewer has finally sent you some feedback on Feature-X. You make a bunch
+of changes and ... what? You could patch those into your commits with git
+"squash" or "fixup" logic. But that makes your changes hard to verify. Unless
+your whole PR is pretty trivial, you should instead put your fixups into a new
+commit and re-push. Your reviewer can then look at that commit on its own - so
+much faster to review than starting over.
+
+We might still ask you to clean up your commits at the very end, for the sake
+of a more readable history, but don't do this until asked, typically at the
+point where the PR would otherwise be tagged LGTM.
+
+General squashing guidelines:
+
+* Sausage => squash
+
+ When there are several commits to fix bugs in the original commit(s), address
+reviewer feedback, etc. Really we only want to see the end state and commit
+message for the whole PR.
+
+* Layers => don't squash
+
+ When there are independent changes layered upon each other to achieve a single
+goal. For instance, writing a code munger could be one commit, applying it could
+be another, and adding a precommit check could be a third. One could argue they
+should be separate PRs, but there's really no way to test/review the munger
+without seeing it applied, and there needs to be a precommit check to ensure the
+munged output doesn't immediately get out of date.
+
+A commit, as much as possible, should be a single logical change. Each commit
+should always have a good title line (<70 characters) and include an additional
+description paragraph describing in more detail the change intended. Do not link
+pull requests by `#` in a commit description, because GitHub creates lots of
+spam. Instead, reference other PRs via the PR your commit is in.
+
+## 8. KISS, YAGNI, MVP, etc
+
+Sometimes we need to remind each other of core tenets of software design - Keep
+It Simple, You Aren't Gonna Need It, Minimum Viable Product, and so on. Adding
+features "because we might need it later" is antithetical to software that
+ships. Add the things you need NOW and (ideally) leave room for things you
+might need later - but don't implement them now.
+
+## 9. Push back
+
+We understand that it is hard to imagine, but sometimes we make mistakes. It's
+OK to push back on changes requested during a review. If you have a good reason
+for doing something a certain way, you are absolutely allowed to debate the
+merits of a requested change. You might be overruled, but you might also
+prevail. We're mostly pretty reasonable people. Mostly.
+
+## 10. I'm still getting stalled - help?!
+
+So, you've done all that and you still aren't getting any PR love? Here's some
+things you can do that might help kick a stalled process along:
+
+ * Make sure that your PR has an assigned reviewer (assignee in GitHub). If
+this is not the case, reply to the PR comment stream asking for one to be
+assigned.
+
+ * Ping the assignee (@username) on the PR comment stream asking for an
+estimate of when they can get to it.
+
+ * Ping the assignee by email (many of us have email addresses that are well
+published or are the same as our GitHub handle @google.com or @redhat.com).
+
+ * Ping the [team](https://github.com/orgs/kubernetes/teams) (via @team-name)
+that works in the area you're submitting code.
+
+If you think you have fixed all the issues in a round of review, and you haven't
+heard back, you should ping the reviewer (assignee) on the comment stream with a
+"please take another look" (PTAL) or similar comment indicating you are done and
+you think it is ready for re-review. In fact, this is probably a good habit for
+all PRs.
+
+One phenomenon of open-source projects (where anyone can comment on any issue)
+is the dog-pile - your PR gets so many comments from so many people it becomes
+hard to follow. In this situation you can ask the primary reviewer (assignee)
+whether they want you to fork a new PR to clear out all the comments. Remember:
+you don't HAVE to fix every issue raised by every person who feels like
+commenting, but you should at least answer reasonable comments with an
+explanation.
+
+## Final: Use common sense
+
+Obviously, none of these points are hard rules. There is no document that can
+take the place of common sense and good taste. Use your best judgment, but put
+a bit of thought into how your work can be made easier to review. If you do
+these things your PRs will flow much more easily.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/faster_reviews.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/flaky-tests.md b/contributors/devel/flaky-tests.md
new file mode 100644
index 00000000..9656bd5f
--- /dev/null
+++ b/contributors/devel/flaky-tests.md
@@ -0,0 +1,194 @@
+# Flaky tests
+
+Any test that fails occasionally is "flaky". Since our merges only proceed when
+all tests are green, and we have a number of different CI systems running the
+tests in various combinations, even a small percentage of flakes results in a
+lot of pain for people waiting for their PRs to merge.
+
+Therefore, it's very important that we write tests defensively. Situations that
+"almost never happen" happen with some regularity when run thousands of times in
+resource-constrained environments. Since flakes can often be quite hard to
+reproduce while still being common enough to block merges occasionally, it's
+additionally important that the test logs be useful for narrowing down exactly
+what caused the failure.
+
+Note that flakes can occur in unit tests, integration tests, or end-to-end
+tests, but probably occur most commonly in end-to-end tests.
+
+## Filing issues for flaky tests
+
+Because flakes may be rare, it's very important that all relevant logs be
+discoverable from the issue.
+
+1. Search for the test name. If you find an open issue and you're 90% sure the
+ flake is exactly the same, add a comment instead of making a new issue.
+2. If you make a new issue, you should title it with the test name, prefixed by
+ "e2e/unit/integration flake:" (whichever is appropriate)
+3. Reference any old issues you found in step one. Also, make a comment in the
+ old issue referencing your new issue, because people monitoring only their
+ email do not see the backlinks github adds. Alternatively, tag the person or
+ people who most recently worked on it.
+4. Paste, in block quotes, the entire log of the individual failing test, not
+ just the failure line.
+5. Link to durable storage with the rest of the logs. This means (for all the
+ tests that Google runs) the GCS link is mandatory! The Jenkins test result
+ link is nice but strictly optional: not only does it expire more quickly,
+ it's not accessible to non-Googlers.
+
+## Finding filed flaky test cases
+
+Find flaky tests issues on GitHub under the [kind/flake issue label][flake].
+There are significant numbers of flaky tests reported on a regular basis and P2
+flakes are under-investigated. Fixing flakes is a quick way to gain expertise
+and community goodwill.
+
+[flake]: https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
+
+## Expectations when a flaky test is assigned to you
+
+Note that we won't randomly assign these issues to you unless you've opted in or
+you're part of a group that has opted in. We are more than happy to accept help
+from anyone in fixing these, but due to the severity of the problem when merges
+are blocked, we need reasonably quick turn-around time on test flakes. Therefore
+we have the following guidelines:
+
+1. If a flaky test is assigned to you, it's more important than anything else
+ you're doing unless you can get a special dispensation (in which case it will
+ be reassigned). If you have too many flaky tests assigned to you, or you
+ have such a dispensation, then it's *still* your responsibility to find new
+ owners (this may just mean giving stuff back to the relevant Team or SIG Lead).
+2. You should make a reasonable effort to reproduce it. Somewhere between an
+ hour and half a day of concentrated effort is "reasonable". It is perfectly
+ reasonable to ask for help!
+3. If you can reproduce it (or it's obvious from the logs what happened), you
+ should then be able to fix it, or in the case where someone is clearly more
+ qualified to fix it, reassign it with very clear instructions.
+4. PRs that fix or help debug flakes may have the P0 priority set to get them
+ through the merge queue as fast as possible.
+5. Once you have made a change that you believe fixes a flake, it is conservative
+ to keep the issue for the flake open and see if it manifests again after the
+ change is merged.
+6. If you can't reproduce a flake: __don't just close it!__ Every time a flake comes
+ back, at least 2 hours of merge time is wasted. So we need to make monotonic
+ progress towards narrowing it down every time a flake occurs. If you can't
+ figure it out from the logs, add log messages that would have help you figure
+ it out. If you make changes to make a flake more reproducible, please link
+ your pull request to the flake you're working on.
+7. If a flake has been open, could not be reproduced, and has not manifested in
+ 3 months, it is reasonable to close the flake issue with a note saying
+ why.
+
+# Reproducing unit test flakes
+
+Try the [stress command](https://godoc.org/golang.org/x/tools/cmd/stress).
+
+Just
+
+```
+$ go install golang.org/x/tools/cmd/stress
+```
+
+Then build your test binary
+
+```
+$ go test -c -race
+```
+
+Then run it under stress
+
+```
+$ stress ./package.test -test.run=FlakyTest
+```
+
+It runs the command and writes output to `/tmp/gostress-*` files when it fails.
+It periodically reports with run counts. Be careful with tests that use the
+`net/http/httptest` package; they could exhaust the available ports on your
+system!
+
+# Hunting flaky unit tests in Kubernetes
+
+Sometimes unit tests are flaky. This means that due to (usually) race
+conditions, they will occasionally fail, even though most of the time they pass.
+
+We have a goal of 99.9% flake free tests. This means that there is only one
+flake in one thousand runs of a test.
+
+Running a test 1000 times on your own machine can be tedious and time consuming.
+Fortunately, there is a better way to achieve this using Kubernetes.
+
+_Note: these instructions are mildly hacky for now, as we get run once semantics
+and logging they will get better_
+
+There is a testing image `brendanburns/flake` up on the docker hub. We will use
+this image to test our fix.
+
+Create a replication controller with the following config:
+
+```yaml
+apiVersion: v1
+kind: ReplicationController
+metadata:
+ name: flakecontroller
+spec:
+ replicas: 24
+ template:
+ metadata:
+ labels:
+ name: flake
+ spec:
+ containers:
+ - name: flake
+ image: brendanburns/flake
+ env:
+ - name: TEST_PACKAGE
+ value: pkg/tools
+ - name: REPO_SPEC
+ value: https://github.com/kubernetes/kubernetes
+```
+
+Note that we omit the labels and the selector fields of the replication
+controller, because they will be populated from the labels field of the pod
+template by default.
+
+```sh
+kubectl create -f ./controller.yaml
+```
+
+This will spin up 24 instances of the test. They will run to completion, then
+exit, and the kubelet will restart them, accumulating more and more runs of the
+test.
+
+You can examine the recent runs of the test by calling `docker ps -a` and
+looking for tasks that exited with non-zero exit codes. Unfortunately, docker
+ps -a only keeps around the exit status of the last 15-20 containers with the
+same image, so you have to check them frequently.
+
+You can use this script to automate checking for failures, assuming your cluster
+is running on GCE and has four nodes:
+
+```sh
+echo "" > output.txt
+for i in {1..4}; do
+ echo "Checking kubernetes-node-${i}"
+ echo "kubernetes-node-${i}:" >> output.txt
+ gcloud compute ssh "kubernetes-node-${i}" --command="sudo docker ps -a" >> output.txt
+done
+grep "Exited ([^0])" output.txt
+```
+
+Eventually you will have sufficient runs for your purposes. At that point you
+can delete the replication controller by running:
+
+```sh
+kubectl delete replicationcontroller flakecontroller
+```
+
+If you do a final check for flakes with `docker ps -a`, ignore tasks that
+exited -1, since that's what happens when you stop the replication controller.
+
+Happy flake hunting!
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/flaky-tests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/generating-clientset.md b/contributors/devel/generating-clientset.md
new file mode 100644
index 00000000..cbb6141c
--- /dev/null
+++ b/contributors/devel/generating-clientset.md
@@ -0,0 +1,41 @@
+# Generation and release cycle of clientset
+
+Client-gen is an automatic tool that generates [clientset](../../docs/proposals/client-package-structure.md#high-level-client-sets) based on API types. This doc introduces the use the client-gen, and the release cycle of the generated clientsets.
+
+## Using client-gen
+
+The workflow includes three steps:
+
+1. Marking API types with tags: in `pkg/apis/${GROUP}/${VERSION}/types.go`, mark the types (e.g., Pods) that you want to generate clients for with the `// +genclient=true` tag. If the resource associated with the type is not namespace scoped (e.g., PersistentVolume), you need to append the `nonNamespaced=true` tag as well.
+
+2.
+ - a. If you are developing in the k8s.io/kubernetes repository, you just need to run hack/update-codegen.sh.
+
+ - b. If you are running client-gen outside of k8s.io/kubernetes, you need to use the command line argument `--input` to specify the groups and versions of the APIs you want to generate clients for, client-gen will then look into `pkg/apis/${GROUP}/${VERSION}/types.go` and generate clients for the types you have marked with the `genclient` tags. For example, to generated a clientset named "my_release" including clients for api/v1 objects and extensions/v1beta1 objects, you need to run:
+
+```
+$ client-gen --input="api/v1,extensions/v1beta1" --clientset-name="my_release"
+```
+
+3. ***Adding expansion methods***: client-gen only generates the common methods, such as CRUD. You can manually add additional methods through the expansion interface. For example, this [file](../../pkg/client/clientset_generated/release_1_5/typed/core/v1/pod_expansion.go) adds additional methods to Pod's client. As a convention, we put the expansion interface and its methods in file ${TYPE}_expansion.go. In most cases, you don't want to remove existing expansion files. So to make life easier, instead of creating a new clientset from scratch, ***you can copy and rename an existing clientset (so that all the expansion files are copied)***, and then run client-gen.
+
+## Output of client-gen
+
+- clientset: the clientset will be generated at `pkg/client/clientset_generated/` by default, and you can change the path via the `--clientset-path` command line argument.
+
+- Individual typed clients and client for group: They will be generated at `pkg/client/clientset_generated/${clientset_name}/typed/generated/${GROUP}/${VERSION}/`
+
+## Released clientsets
+
+If you are contributing code to k8s.io/kubernetes, try to use the release_X_Y clientset in this [directory](../../pkg/client/clientset_generated/).
+
+If you need a stable Go client to build your own project, please refer to the [client-go repository](https://github.com/kubernetes/client-go).
+
+We are migrating k8s.io/kubernetes to use client-go as well, see issue [#35159](https://github.com/kubernetes/kubernetes/issues/35159).
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS --> [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/generating-clientset.md?pixel)]() <!-- END MUNGE: GENERATED_ANALYTICS -->
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/generating-clientset.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/getting-builds.md b/contributors/devel/getting-builds.md
new file mode 100644
index 00000000..86563390
--- /dev/null
+++ b/contributors/devel/getting-builds.md
@@ -0,0 +1,52 @@
+# Getting Kubernetes Builds
+
+You can use [hack/get-build.sh](http://releases.k8s.io/HEAD/hack/get-build.sh)
+to get a build or to use as a reference on how to get the most recent builds
+with curl. With `get-build.sh` you can grab the most recent stable build, the
+most recent release candidate, or the most recent build to pass our ci and gce
+e2e tests (essentially a nightly build).
+
+Run `./hack/get-build.sh -h` for its usage.
+
+To get a build at a specific version (v1.1.1) use:
+
+```console
+./hack/get-build.sh v1.1.1
+```
+
+To get the latest stable release:
+
+```console
+./hack/get-build.sh release/stable
+```
+
+Use the "-v" option to print the version number of a build without retrieving
+it. For example, the following prints the version number for the latest ci
+build:
+
+```console
+./hack/get-build.sh -v ci/latest
+```
+
+You can also use the gsutil tool to explore the Google Cloud Storage release
+buckets. Here are some examples:
+
+```sh
+gsutil cat gs://kubernetes-release-dev/ci/latest.txt # output the latest ci version number
+gsutil cat gs://kubernetes-release-dev/ci/latest-green.txt # output the latest ci version number that passed gce e2e
+gsutil ls gs://kubernetes-release-dev/ci/v0.20.0-29-g29a55cc/ # list the contents of a ci release
+gsutil ls gs://kubernetes-release/release # list all official releases and rcs
+```
+
+## Install `gsutil`
+
+Example installation:
+
+```console
+$ curl -sSL https://storage.googleapis.com/pub/gsutil.tar.gz | sudo tar -xz -C /usr/local/src
+$ sudo ln -s /usr/local/src/gsutil/gsutil /usr/bin/gsutil
+```
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/getting-builds.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/git_workflow.png b/contributors/devel/git_workflow.png
new file mode 100644
index 00000000..80a66248
--- /dev/null
+++ b/contributors/devel/git_workflow.png
Binary files differ
diff --git a/contributors/devel/go-code.md b/contributors/devel/go-code.md
new file mode 100644
index 00000000..2af055f4
--- /dev/null
+++ b/contributors/devel/go-code.md
@@ -0,0 +1,32 @@
+# Kubernetes Go Tools and Tips
+
+Kubernetes is one of the largest open source Go projects, so good tooling a solid understanding of
+Go is critical to Kubernetes development. This document provides a collection of resources, tools
+and tips that our developers have found useful.
+
+## Recommended Reading
+
+- [Kubernetes Go development environment](development.md#go-development-environment)
+- [The Go Spec](https://golang.org/ref/spec) - The Go Programming Language
+ Specification.
+- [Go Tour](https://tour.golang.org/welcome/2) - Official Go tutorial.
+- [Effective Go](https://golang.org/doc/effective_go.html) - A good collection of Go advice.
+- [Kubernetes Code conventions](coding-conventions.md) - Style guide for Kubernetes code.
+- [Three Go Landmines](https://gist.github.com/lavalamp/4bd23295a9f32706a48f) - Surprising behavior in the Go language. These have caused real bugs!
+
+## Recommended Tools
+
+- [godep](https://github.com/tools/godep) - Used for Kubernetes dependency management. See also [Kubernetes godep and dependency management](development.md#godep-and-dependency-management)
+- [Go Version Manager](https://github.com/moovweb/gvm) - A handy tool for managing Go versions.
+- [godepq](https://github.com/google/godepq) - A tool for analyzing go import trees.
+
+## Go Tips
+
+- [Godoc bookmarklet](https://gist.github.com/timstclair/c891fb8aeb24d663026371d91dcdb3fc) - navigate from a github page to the corresponding godoc page.
+- Consider making a separate Go tree for each project, which can make overlapping dependency management much easier. Remember to set the `$GOPATH` correctly! Consider [scripting](https://gist.github.com/timstclair/17ca792a20e0d83b06dddef7d77b1ea0) this.
+- Emacs users - setup [go-mode](https://github.com/dominikh/go-mode.el)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/go-code.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/godep.md b/contributors/devel/godep.md
new file mode 100644
index 00000000..ddd6c5b1
--- /dev/null
+++ b/contributors/devel/godep.md
@@ -0,0 +1,123 @@
+# Using godep to manage dependencies
+
+This document is intended to show a way for managing `vendor/` tree dependencies
+in Kubernetes. If you are not planning on managing `vendor` dependencies go here
+[Godep dependency management](development.md#godep-dependency-management).
+
+## Alternate GOPATH for installing and using godep
+
+There are many ways to build and host Go binaries. Here is one way to get
+utilities like `godep` installed:
+
+Create a new GOPATH just for your go tools and install godep:
+
+```sh
+export GOPATH=$HOME/go-tools
+mkdir -p $GOPATH
+go get -u github.com/tools/godep
+```
+
+Add this $GOPATH/bin to your path. Typically you'd add this to your ~/.profile:
+
+```sh
+export GOPATH=$HOME/go-tools
+export PATH=$PATH:$GOPATH/bin
+```
+
+## Using godep
+
+Here's a quick walkthrough of one way to use godeps to add or update a
+Kubernetes dependency into `vendor/`. For more details, please see the
+instructions in [godep's documentation](https://github.com/tools/godep).
+
+1) Devote a directory to this endeavor:
+
+_Devoting a separate directory is not strictly required, but it is helpful to
+separate dependency updates from other changes._
+
+```sh
+export KPATH=$HOME/code/kubernetes
+mkdir -p $KPATH/src/k8s.io
+cd $KPATH/src/k8s.io
+git clone https://github.com/$YOUR_GITHUB_USERNAME/kubernetes.git # assumes your fork is 'kubernetes'
+# Or copy your existing local repo here. IMPORTANT: making a symlink doesn't work.
+```
+
+2) Set up your GOPATH.
+
+```sh
+# This will *not* let your local builds see packages that exist elsewhere on your system.
+export GOPATH=$KPATH
+```
+
+3) Populate your new GOPATH.
+
+```sh
+cd $KPATH/src/k8s.io/kubernetes
+godep restore
+```
+
+4) Next, you can either add a new dependency or update an existing one.
+
+To add a new dependency is simple (if a bit slow):
+
+```sh
+cd $KPATH/src/k8s.io/kubernetes
+DEP=example.com/path/to/dependency
+godep get $DEP/...
+# Now change code in Kubernetes to use the dependency.
+./hack/godep-save.sh
+```
+
+To update an existing dependency is a bit more complicated. Godep has an
+`update` command, but none of us can figure out how to actually make it work.
+Instead, this procedure seems to work reliably:
+
+```sh
+cd $KPATH/src/k8s.io/kubernetes
+DEP=example.com/path/to/dependency
+# NB: For the next step, $DEP is assumed be the repo root. If it is actually a
+# subdir of the repo, use the repo root here. This is required to keep godep
+# from getting angry because `godep restore` left the tree in a "detached head"
+# state.
+rm -rf $KPATH/src/$DEP # repo root
+godep get $DEP/...
+# Change code in Kubernetes, if necessary.
+rm -rf Godeps
+rm -rf vendor
+./hack/godep-save.sh
+git checkout -- $(git status -s | grep "^ D" | awk '{print $2}' | grep ^Godeps)
+```
+
+_If `go get -u path/to/dependency` fails with compilation errors, instead try
+`go get -d -u path/to/dependency` to fetch the dependencies without compiling
+them. This is unusual, but has been observed._
+
+After all of this is done, `git status` should show you what files have been
+modified and added/removed. Make sure to `git add` and `git rm` them. It is
+commonly advised to make one `git commit` which includes just the dependency
+update and Godeps files, and another `git commit` that includes changes to
+Kubernetes code to use the new/updated dependency. These commits can go into a
+single pull request.
+
+5) Before sending your PR, it's a good idea to sanity check that your
+Godeps.json file and the contents of `vendor/ `are ok by running `hack/verify-godeps.sh`
+
+_If `hack/verify-godeps.sh` fails after a `godep update`, it is possible that a
+transitive dependency was added or removed but not updated by godeps. It then
+may be necessary to perform a `hack/godep-save.sh` to pick up the transitive
+dependency changes._
+
+It is sometimes expedient to manually fix the /Godeps/Godeps.json file to
+minimize the changes. However without great care this can lead to failures
+with `hack/verify-godeps.sh`. This must pass for every PR.
+
+6) If you updated the Godeps, please also update `Godeps/LICENSES` by running
+`hack/update-godep-licenses.sh`.
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/godep.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/gubernator-images/filterpage.png b/contributors/devel/gubernator-images/filterpage.png
new file mode 100644
index 00000000..2d08bd8e
--- /dev/null
+++ b/contributors/devel/gubernator-images/filterpage.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/filterpage1.png b/contributors/devel/gubernator-images/filterpage1.png
new file mode 100644
index 00000000..838cb0fa
--- /dev/null
+++ b/contributors/devel/gubernator-images/filterpage1.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/filterpage2.png b/contributors/devel/gubernator-images/filterpage2.png
new file mode 100644
index 00000000..63da782e
--- /dev/null
+++ b/contributors/devel/gubernator-images/filterpage2.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/filterpage3.png b/contributors/devel/gubernator-images/filterpage3.png
new file mode 100644
index 00000000..33066d78
--- /dev/null
+++ b/contributors/devel/gubernator-images/filterpage3.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/skipping1.png b/contributors/devel/gubernator-images/skipping1.png
new file mode 100644
index 00000000..a5dea440
--- /dev/null
+++ b/contributors/devel/gubernator-images/skipping1.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/skipping2.png b/contributors/devel/gubernator-images/skipping2.png
new file mode 100644
index 00000000..b133347e
--- /dev/null
+++ b/contributors/devel/gubernator-images/skipping2.png
Binary files differ
diff --git a/contributors/devel/gubernator-images/testfailures.png b/contributors/devel/gubernator-images/testfailures.png
new file mode 100644
index 00000000..1b331248
--- /dev/null
+++ b/contributors/devel/gubernator-images/testfailures.png
Binary files differ
diff --git a/contributors/devel/gubernator.md b/contributors/devel/gubernator.md
new file mode 100644
index 00000000..3fd2e445
--- /dev/null
+++ b/contributors/devel/gubernator.md
@@ -0,0 +1,142 @@
+# Gubernator
+
+*This document is oriented at developers who want to use Gubernator to debug while developing for Kubernetes.*
+
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Gubernator](#gubernator)
+ - [What is Gubernator?](#what-is-gubernator)
+ - [Gubernator Features](#gubernator-features)
+ - [Test Failures list](#test-failures-list)
+ - [Log Filtering](#log-filtering)
+ - [Gubernator for Local Tests](#gubernator-for-local-tests)
+ - [Future Work](#future-work)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## What is Gubernator?
+
+[Gubernator](https://k8s-gubernator.appspot.com/) is a webpage for viewing and filtering Kubernetes
+test results.
+
+Gubernator simplifies the debugging proccess and makes it easier to track down failures by automating many
+steps commonly taken in searching through logs, and by offering tools to filter through logs to find relevant lines.
+Gubernator automates the steps of finding the failed tests, displaying relevant logs, and determining the
+failed pods and the corresponing pod UID, namespace, and container ID.
+It also allows for filtering of the log files to display relevant lines based on selected keywords, and
+allows for multiple logs to be woven together by timestamp.
+
+Gubernator runs on Google App Engine and fetches logs stored on Google Cloud Storage.
+
+## Gubernator Features
+
+### Test Failures list
+
+Issues made by k8s-merge-robot will post a link to a page listing the failed tests.
+Each failed test comes with the corresponding error log from a junit file and a link
+to filter logs for that test.
+
+Based on the message logged in the junit file, the pod name may be displayed.
+
+![alt text](gubernator-images/testfailures.png)
+
+[Test Failures List Example](https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke/11721)
+
+### Log Filtering
+
+The log filtering page comes with checkboxes and textboxes to aid in filtering. Filtered keywords will be bolded
+and lines including keywords will be highlighted. Up to four lines around the line of interest will also be displayed.
+
+![alt text](gubernator-images/filterpage.png)
+
+If less than 100 lines are skipped, the "... skipping xx lines ..." message can be clicked to expand and show
+the hidden lines.
+
+Before expansion:
+![alt text](gubernator-images/skipping1.png)
+After expansion:
+![alt text](gubernator-images/skipping2.png)
+
+If the pod name was displayed in the Test Failures list, it will automatically be included in the filters.
+If it is not found in the error message, it can be manually entered into the textbox. Once a pod name
+is entered, the Pod UID, Namespace, and ContainerID may be automatically filled in as well. These can be
+altered as well. To apply the filter, check off the options corresponding to the filter.
+
+![alt text](gubernator-images/filterpage1.png)
+
+To add a filter, type the term to be filtered into the textbox labeled "Add filter:" and press enter.
+Additional filters will be displayed as checkboxes under the textbox.
+
+![alt text](gubernator-images/filterpage3.png)
+
+To choose which logs to view check off the checkboxes corresponding to the logs of interest. If multiple logs are
+included, the "Weave by timestamp" option can weave the selected logs together based on the timestamp in each line.
+
+![alt text](gubernator-images/filterpage2.png)
+
+[Log Filtering Example 1](https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/5535/nodelog?pod=pod-configmaps-b5b876cb-3e1e-11e6-8956-42010af0001d&junit=junit_03.xml&wrap=on&logfiles=%2Fkubernetes-jenkins%2Flogs%2Fkubelet-gce-e2e-ci%2F5535%2Fartifacts%2Ftmp-node-e2e-7a5a3b40-e2e-node-coreos-stable20160622-image%2Fkube-apiserver.log&logfiles=%2Fkubernetes-jenkins%2Flogs%2Fkubelet-gce-e2e-ci%2F5535%2Fartifacts%2Ftmp-node-e2e-7a5a3b40-e2e-node-coreos-stable20160622-image%2Fkubelet.log&UID=on&poduid=b5b8a59e-3e1e-11e6-b358-42010af0001d&ns=e2e-tests-configmap-oi12h&cID=tmp-node-e2e-7a5a3b40-e2e-node-coreos-stable20160622-image)
+
+[Log Filtering Example 2](https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke/11721/nodelog?pod=client-containers-a53f813c-503e-11e6-88dd-0242ac110003&junit=junit_19.xml&wrap=on)
+
+
+### Gubernator for Local Tests
+
+*Currently Gubernator can only be used with remote node e2e tests.*
+
+**NOTE: Using Gubernator with local tests will publically upload your test logs to Google Cloud Storage**
+
+To use Gubernator to view logs from local test runs, set the GUBERNATOR tag to true.
+A URL link to view the test results will be printed to the console.
+Please note that running with the Gubernator tag will bypass the user confirmation for uploading to GCS.
+
+```console
+
+$ make test-e2e-node REMOTE=true GUBERNATOR=true
+...
+================================================================
+Running gubernator.sh
+
+Gubernator linked below:
+k8s-gubernator.appspot.com/build/yourusername-g8r-logs/logs/e2e-node/timestamp
+```
+
+The gubernator.sh script can be run after running a remote node e2e test for the same effect.
+
+```console
+$ ./test/e2e_node/gubernator.sh
+Do you want to run gubernator.sh and upload logs publicly to GCS? [y/n]y
+...
+Gubernator linked below:
+k8s-gubernator.appspot.com/build/yourusername-g8r-logs/logs/e2e-node/timestamp
+```
+
+## Future Work
+
+Gubernator provides a framework for debugging failures and introduces useful features.
+There is still a lot of room for more features and growth to make the debugging process more efficient.
+
+How to contribute (see https://github.com/kubernetes/test-infra/blob/master/gubernator/README.md)
+
+* Extend GUBERNATOR flag to all local tests
+
+* More accurate identification of pod name, container ID, etc.
+ * Change content of logged strings for failures to include more information
+ * Better regex in Gubernator
+
+* Automate discovery of more keywords
+ * Volume Name
+ * Disk Name
+ * Pod IP
+
+* Clickable API objects in the displayed lines in order to add them as filters
+
+* Construct story of pod's lifetime
+ * Have concise view of what a pod went through from when pod was started to failure
+
+* Improve UI
+ * Have separate folders of logs in rows instead of in one long column
+ * Improve interface for adding additional features (maybe instead of textbox and checkbox, have chips)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/gubernator.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/how-to-doc.md b/contributors/devel/how-to-doc.md
new file mode 100644
index 00000000..891969d7
--- /dev/null
+++ b/contributors/devel/how-to-doc.md
@@ -0,0 +1,205 @@
+# Document Conventions
+
+Updated: 11/3/2015
+
+*This document is oriented at users and developers who want to write documents
+for Kubernetes.*
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Document Conventions](#document-conventions)
+ - [General Concepts](#general-concepts)
+ - [How to Get a Table of Contents](#how-to-get-a-table-of-contents)
+ - [How to Write Links](#how-to-write-links)
+ - [How to Include an Example](#how-to-include-an-example)
+ - [Misc.](#misc)
+ - [Code formatting](#code-formatting)
+ - [Syntax Highlighting](#syntax-highlighting)
+ - [Headings](#headings)
+ - [What Are Mungers?](#what-are-mungers)
+ - [Auto-added Mungers](#auto-added-mungers)
+ - [Generate Analytics](#generate-analytics)
+- [Generated documentation](#generated-documentation)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## General Concepts
+
+Each document needs to be munged to ensure its format is correct, links are
+valid, etc. To munge a document, simply run `hack/update-munge-docs.sh`. We
+verify that all documents have been munged using `hack/verify-munge-docs.sh`.
+The scripts for munging documents are called mungers, see the
+[mungers section](#what-are-mungers) below if you're curious about how mungers
+are implemented or if you want to write one.
+
+## How to Get a Table of Contents
+
+Instead of writing table of contents by hand, insert the following code in your
+md file:
+
+```
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+<!-- END MUNGE: GENERATED_TOC -->
+```
+
+After running `hack/update-munge-docs.sh`, you'll see a table of contents
+generated for you, layered based on the headings.
+
+## How to Write Links
+
+It's important to follow the rules when writing links. It helps us correctly
+versionize documents for each release.
+
+Use inline links instead of urls at all times. When you add internal links to
+`docs/` or `examples/`, use relative links; otherwise, use
+`http://releases.k8s.io/HEAD/<path/to/link>`. For example, avoid using:
+
+```
+[GCE](https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/gce.md) # note that it's under docs/
+[Kubernetes package](../../pkg/) # note that it's under pkg/
+http://kubernetes.io/ # external link
+```
+
+Instead, use:
+
+```
+[GCE](../getting-started-guides/gce.md) # note that it's under docs/
+[Kubernetes package](http://releases.k8s.io/HEAD/pkg/) # note that it's under pkg/
+[Kubernetes](http://kubernetes.io/) # external link
+```
+
+The above example generates the following links:
+[GCE](../getting-started-guides/gce.md),
+[Kubernetes package](http://releases.k8s.io/HEAD/pkg/), and
+[Kubernetes](http://kubernetes.io/).
+
+## How to Include an Example
+
+While writing examples, you may want to show the content of certain example
+files (e.g. [pod.yaml](../../test/fixtures/doc-yaml/user-guide/pod.yaml)). In this case, insert the
+following code in the md file:
+
+```
+<!-- BEGIN MUNGE: EXAMPLE path/to/file -->
+<!-- END MUNGE: EXAMPLE path/to/file -->
+```
+
+Note that you should replace `path/to/file` with the relative path to the
+example file. Then `hack/update-munge-docs.sh` will generate a code block with
+the content of the specified file, and a link to download it. This way, you save
+the time to do the copy-and-paste; what's better, the content won't become
+out-of-date every time you update the example file.
+
+For example, the following:
+
+```
+<!-- BEGIN MUNGE: EXAMPLE ../../test/fixtures/doc-yaml/user-guide/pod.yaml -->
+<!-- END MUNGE: EXAMPLE ../../test/fixtures/doc-yaml/user-guide/pod.yaml -->
+```
+
+generates the following after `hack/update-munge-docs.sh`:
+
+<!-- BEGIN MUNGE: EXAMPLE ../../test/fixtures/doc-yaml/user-guide/pod.yaml -->
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: nginx
+ labels:
+ app: nginx
+spec:
+ containers:
+ - name: nginx
+ image: nginx
+ ports:
+ - containerPort: 80
+```
+
+[Download example](../../test/fixtures/doc-yaml/user-guide/pod.yaml?raw=true)
+<!-- END MUNGE: EXAMPLE ../../test/fixtures/doc-yaml/user-guide/pod.yaml -->
+
+## Misc.
+
+### Code formatting
+
+Wrap a span of code with single backticks (`` ` ``). To format multiple lines of
+code as its own code block, use triple backticks (```` ``` ````).
+
+### Syntax Highlighting
+
+Adding syntax highlighting to code blocks improves readability. To do so, in
+your fenced block, add an optional language identifier. Some useful identifier
+includes `yaml`, `console` (for console output), and `sh` (for shell quote
+format). Note that in a console output, put `$ ` at the beginning of each
+command and put nothing at the beginning of the output. Here's an example of
+console code block:
+
+```
+```console
+
+$ kubectl create -f test/fixtures/doc-yaml/user-guide/pod.yaml
+pod "foo" created
+
+``` 
+```
+
+which renders as:
+
+```console
+$ kubectl create -f test/fixtures/doc-yaml/user-guide/pod.yaml
+pod "foo" created
+```
+
+### Headings
+
+Add a single `#` before the document title to create a title heading, and add
+`##` to the next level of section title, and so on. Note that the number of `#`
+will determine the size of the heading.
+
+## What Are Mungers?
+
+Mungers are like gofmt for md docs which we use to format documents. To use it,
+simply place
+
+```
+<!-- BEGIN MUNGE: xxxx -->
+<!-- END MUNGE: xxxx -->
+```
+
+in your md files. Note that xxxx is the placeholder for a specific munger.
+Appropriate content will be generated and inserted between two brackets after
+you run `hack/update-munge-docs.sh`. See
+[munger document](http://releases.k8s.io/HEAD/cmd/mungedocs/) for more details.
+
+## Auto-added Mungers
+
+After running `hack/update-munge-docs.sh`, you may see some code / mungers in
+your md file that are auto-added. You don't have to add them manually. It's
+recommended to just read this section as a reference instead of messing up with
+the following mungers.
+
+### Generate Analytics
+
+ANALYTICS munger inserts a Google Anaylytics link for this page.
+
+```
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+<!-- END MUNGE: GENERATED_ANALYTICS -->
+```
+
+# Generated documentation
+
+Some documents can be generated automatically. Run `hack/generate-docs.sh` to
+populate your repository with these generated documents, and a list of the files
+it generates is placed in `.generated_docs`. To reduce merge conflicts, we do
+not want to check these documents in; however, to make the link checker in the
+munger happy, we check in a placeholder. `hack/update-generated-docs.sh` puts a
+placeholder in the location where each generated document would go, and
+`hack/verify-generated-docs.sh` verifies that the placeholder is in place.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/how-to-doc.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/instrumentation.md b/contributors/devel/instrumentation.md
new file mode 100644
index 00000000..b73221a9
--- /dev/null
+++ b/contributors/devel/instrumentation.md
@@ -0,0 +1,52 @@
+## Instrumenting Kubernetes with a new metric
+
+The following is a step-by-step guide for adding a new metric to the Kubernetes
+code base.
+
+We use the Prometheus monitoring system's golang client library for
+instrumenting our code. Once you've picked out a file that you want to add a
+metric to, you should:
+
+1. Import "github.com/prometheus/client_golang/prometheus".
+
+2. Create a top-level var to define the metric. For this, you have to:
+
+ 1. Pick the type of metric. Use a Gauge for things you want to set to a
+particular value, a Counter for things you want to increment, or a Histogram or
+Summary for histograms/distributions of values (typically for latency).
+Histograms are better if you're going to aggregate the values across jobs, while
+summaries are better if you just want the job to give you a useful summary of
+the values.
+ 2. Give the metric a name and description.
+ 3. Pick whether you want to distinguish different categories of things using
+labels on the metric. If so, add "Vec" to the name of the type of metric you
+want and add a slice of the label names to the definition.
+
+ https://github.com/kubernetes/kubernetes/blob/cd3299307d44665564e1a5c77d0daa0286603ff5/pkg/apiserver/apiserver.go#L53
+ https://github.com/kubernetes/kubernetes/blob/cd3299307d44665564e1a5c77d0daa0286603ff5/pkg/kubelet/metrics/metrics.go#L31
+
+3. Register the metric so that prometheus will know to export it.
+
+ https://github.com/kubernetes/kubernetes/blob/cd3299307d44665564e1a5c77d0daa0286603ff5/pkg/kubelet/metrics/metrics.go#L74
+ https://github.com/kubernetes/kubernetes/blob/cd3299307d44665564e1a5c77d0daa0286603ff5/pkg/apiserver/apiserver.go#L78
+
+4. Use the metric by calling the appropriate method for your metric type (Set,
+Inc/Add, or Observe, respectively for Gauge, Counter, or Histogram/Summary),
+first calling WithLabelValues if your metric has any labels
+
+ https://github.com/kubernetes/kubernetes/blob/3ce7fe8310ff081dbbd3d95490193e1d5250d2c9/pkg/kubelet/kubelet.go#L1384
+ https://github.com/kubernetes/kubernetes/blob/cd3299307d44665564e1a5c77d0daa0286603ff5/pkg/apiserver/apiserver.go#L87
+
+
+These are the metric type definitions if you're curious to learn about them or
+need more information:
+
+https://github.com/prometheus/client_golang/blob/master/prometheus/gauge.go
+https://github.com/prometheus/client_golang/blob/master/prometheus/counter.go
+https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go
+https://github.com/prometheus/client_golang/blob/master/prometheus/summary.go
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/instrumentation.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/issues.md b/contributors/devel/issues.md
new file mode 100644
index 00000000..fe9e94d9
--- /dev/null
+++ b/contributors/devel/issues.md
@@ -0,0 +1,59 @@
+## GitHub Issues for the Kubernetes Project
+
+A quick overview of how we will review and prioritize incoming issues at
+https://github.com/kubernetes/kubernetes/issues
+
+### Priorities
+
+We use GitHub issue labels for prioritization. The absence of a priority label
+means the bug has not been reviewed and prioritized yet.
+
+We try to apply these priority labels consistently across the entire project,
+but if you notice an issue that you believe to be incorrectly prioritized,
+please do let us know and we will evaluate your counter-proposal.
+
+- **priority/P0**: Must be actively worked on as someone's top priority right
+now. Stuff is burning. If it's not being actively worked on, someone is expected
+to drop what they're doing immediately to work on it. Team leaders are
+responsible for making sure that all P0's in their area are being actively
+worked on. Examples include user-visible bugs in core features, broken builds or
+tests and critical security issues.
+
+- **priority/P1**: Must be staffed and worked on either currently, or very soon,
+ideally in time for the next release.
+
+- **priority/P2**: There appears to be general agreement that this would be good
+to have, but we may not have anyone available to work on it right now or in the
+immediate future. Community contributions would be most welcome in the mean time
+(although it might take a while to get them reviewed if reviewers are fully
+occupied with higher priority issues, for example immediately before a release).
+
+- **priority/P3**: Possibly useful, but not yet enough support to actually get
+it done. These are mostly place-holders for potentially good ideas, so that they
+don't get completely forgotten, and can be referenced/deduped every time they
+come up.
+
+### Milestones
+
+We additionally use milestones, based on minor version, for determining if a bug
+should be fixed for the next release. These milestones will be especially
+scrutinized as we get to the weeks just before a release. We can release a new
+version of Kubernetes once they are empty. We will have two milestones per minor
+release.
+
+- **vX.Y**: The list of bugs that will be merged for that milestone once ready.
+
+- **vX.Y-candidate**: The list of bug that we might merge for that milestone. A
+bug shouldn't be in this milestone for more than a day or two towards the end of
+a milestone. It should be triaged either into vX.Y, or moved out of the release
+milestones.
+
+The above priority scheme still applies. P0 and P1 issues are work we feel must
+get done before release. P2 and P3 issues are work we would merge into the
+release if it gets done, but we wouldn't block the release on it. A few days
+before release, we will probably move all P2 and P3 bugs out of that milestone
+in bulk.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/issues.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/kubectl-conventions.md b/contributors/devel/kubectl-conventions.md
new file mode 100644
index 00000000..1e94b3ba
--- /dev/null
+++ b/contributors/devel/kubectl-conventions.md
@@ -0,0 +1,411 @@
+# Kubectl Conventions
+
+Updated: 8/27/2015
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Kubectl Conventions](#kubectl-conventions)
+ - [Principles](#principles)
+ - [Command conventions](#command-conventions)
+ - [Create commands](#create-commands)
+ - [Rules for extending special resource alias - "all"](#rules-for-extending-special-resource-alias---all)
+ - [Flag conventions](#flag-conventions)
+ - [Output conventions](#output-conventions)
+ - [Documentation conventions](#documentation-conventions)
+ - [Command implementation conventions](#command-implementation-conventions)
+ - [Generators](#generators)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+## Principles
+
+* Strive for consistency across commands
+
+* Explicit should always override implicit
+
+ * Environment variables should override default values
+
+ * Command-line flags should override default values and environment variables
+
+ * `--namespace` should also override the value specified in a specified
+resource
+
+## Command conventions
+
+* Command names are all lowercase, and hyphenated if multiple words.
+
+* kubectl VERB NOUNs for commands that apply to multiple resource types.
+
+* Command itself should not have built-in aliases.
+
+* NOUNs may be specified as `TYPE name1 name2` or `TYPE/name1 TYPE/name2` or
+`TYPE1,TYPE2,TYPE3/name1`; TYPE is omitted when only a single type is expected.
+
+* Resource types are all lowercase, with no hyphens; both singular and plural
+forms are accepted.
+
+* NOUNs may also be specified by one or more file arguments: `-f file1 -f file2
+...`
+
+* Resource types may have 2- or 3-letter aliases.
+
+* Business logic should be decoupled from the command framework, so that it can
+be reused independently of kubectl, cobra, etc.
+ * Ideally, commonly needed functionality would be implemented server-side in
+order to avoid problems typical of "fat" clients and to make it readily
+available to non-Go clients.
+
+* Commands that generate resources, such as `run` or `expose`, should obey
+specific conventions, see [generators](#generators).
+
+* A command group (e.g., `kubectl config`) may be used to group related
+non-standard commands, such as custom generators, mutations, and computations.
+
+
+### Create commands
+
+`kubectl create <resource>` commands fill the gap between "I want to try
+Kubernetes, but I don't know or care what gets created" (`kubectl run`) and "I
+want to create exactly this" (author yaml and run `kubectl create -f`). They
+provide an easy way to create a valid object without having to know the vagaries
+of particular kinds, nested fields, and object key typos that are ignored by the
+yaml/json parser. Because editing an already created object is easier than
+authoring one from scratch, these commands only need to have enough parameters
+to create a valid object and set common immutable fields. It should default as
+much as is reasonably possible. Once that valid object is created, it can be
+further manipulated using `kubectl edit` or the eventual `kubectl set` commands.
+
+`kubectl create <resource> <special-case>` commands help in cases where you need
+to perform non-trivial configuration generation/transformation tailored for a
+common use case. `kubectl create secret` is a good example, there's a `generic`
+flavor with keys mapping to files, then there's a `docker-registry` flavor that
+is tailored for creating an image pull secret, and there's a `tls` flavor for
+creating tls secrets. You create these as separate commands to get distinct
+flags and separate help that is tailored for the particular usage.
+
+
+### Rules for extending special resource alias - "all"
+
+Here are the rules to add a new resource to the `kubectl get all` output.
+
+* No cluster scoped resources
+
+* No namespace admin level resources (limits, quota, policy, authorization
+rules)
+
+* No resources that are potentially unrecoverable (secrets and pvc)
+
+* Resources that are considered "similar" to #3 should be grouped
+the same (configmaps)
+
+
+## Flag conventions
+
+* Flags are all lowercase, with words separated by hyphens
+
+* Flag names and single-character aliases should have the same meaning across
+all commands
+
+* Flag descriptions should start with an uppercase letter and not have a
+period at the end of a sentence
+
+* Command-line flags corresponding to API fields should accept API enums
+exactly (e.g., `--restart=Always`)
+
+* Do not reuse flags for different semantic purposes, and do not use different
+flag names for the same semantic purpose -- grep for `"Flags()"` before adding a
+new flag
+
+* Use short flags sparingly, only for the most frequently used options, prefer
+lowercase over uppercase for the most common cases, try to stick to well known
+conventions for UNIX commands and/or Docker, where they exist, and update this
+list when adding new short flags
+
+ * `-f`: Resource file
+ * also used for `--follow` in `logs`, but should be deprecated in favor of `-F`
+ * `-n`: Namespace scope
+ * `-l`: Label selector
+ * also used for `--labels` in `expose`, but should be deprecated
+ * `-L`: Label columns
+ * `-c`: Container
+ * also used for `--client` in `version`, but should be deprecated
+ * `-i`: Attach stdin
+ * `-t`: Allocate TTY
+ * `-w`: Watch (currently also used for `--www` in `proxy`, but should be deprecated)
+ * `-p`: Previous
+ * also used for `--pod` in `exec`, but deprecated
+ * also used for `--patch` in `patch`, but should be deprecated
+ * also used for `--port` in `proxy`, but should be deprecated
+ * `-P`: Static file prefix in `proxy`, but should be deprecated
+ * `-r`: Replicas
+ * `-u`: Unix socket
+ * `-v`: Verbose logging level
+
+
+* `--dry-run`: Don't modify the live state; simulate the mutation and display
+the output. All mutations should support it.
+
+* `--local`: Don't contact the server; just do local read, transformation,
+generation, etc., and display the output
+
+* `--output-version=...`: Convert the output to a different API group/version
+
+* `--short`: Output a compact summary of normal output; the format is subject
+to change and is optimizied for reading not parsing.
+
+* `--validate`: Validate the resource schema
+
+## Output conventions
+
+* By default, output is intended for humans rather than programs
+ * However, affordances are made for simple parsing of `get` output
+
+* Only errors should be directed to stderr
+
+* `get` commands should output one row per resource, and one resource per row
+
+ * Column titles and values should not contain spaces in order to facilitate
+commands that break lines into fields: cut, awk, etc. Instead, use `-` as the
+word separator.
+
+ * By default, `get` output should fit within about 80 columns
+
+ * Eventually we could perhaps auto-detect width
+ * `-o wide` may be used to display additional columns
+
+
+ * The first column should be the resource name, titled `NAME` (may change this
+to an abbreviation of resource type)
+
+ * NAMESPACE should be displayed as the first column when --all-namespaces is
+specified
+
+ * The last default column should be time since creation, titled `AGE`
+
+ * `-Lkey` should append a column containing the value of label with key `key`,
+with `<none>` if not present
+
+ * json, yaml, Go template, and jsonpath template formats should be supported
+and encouraged for subsequent processing
+
+ * Users should use --api-version or --output-version to ensure the output
+uses the version they expect
+
+
+* `describe` commands may output on multiple lines and may include information
+from related resources, such as events. Describe should add additional
+information from related resources that a normal user may need to know - if a
+user would always run "describe resource1" and the immediately want to run a
+"get type2" or "describe resource2", consider including that info. Examples,
+persistent volume claims for pods that reference claims, events for most
+resources, nodes and the pods scheduled on them. When fetching related
+resources, a targeted field selector should be used in favor of client side
+filtering of related resources.
+
+* For fields that can be explicitly unset (booleans, integers, structs), the
+output should say `<unset>`. Likewise, for arrays `<none>` should be used; for
+external IP, `<nodes>` should be used; for load balancer, `<pending>` should be
+used. Lastly `<unknown>` should be used where unrecognized field type was
+specified.
+
+* Mutations should output TYPE/name verbed by default, where TYPE is singular;
+`-o name` may be used to just display TYPE/name, which may be used to specify
+resources in other commands
+
+## Documentation conventions
+
+* Commands are documented using Cobra; docs are then auto-generated by
+`hack/update-generated-docs.sh`.
+
+ * Use should contain a short usage string for the most common use case(s), not
+an exhaustive specification
+
+ * Short should contain a one-line explanation of what the command does
+ * Short descriptions should start with an uppercase case letter and not
+ have a period at the end of a sentence
+ * Short descriptions should (if possible) start with a first person
+ (singular present tense) verb
+
+ * Long may contain multiple lines, including additional information about
+input, output, commonly used flags, etc.
+ * Long descriptions should use proper grammar, start with an uppercase
+ letter and have a period at the end of a sentence
+
+
+ * Example should contain examples
+ * Start commands with `$`
+ * A comment should precede each example command, and should begin with `#`
+
+
+* Use "FILENAME" for filenames
+
+* Use "TYPE" for the particular flavor of resource type accepted by kubectl,
+rather than "RESOURCE" or "KIND"
+
+* Use "NAME" for resource names
+
+## Command implementation conventions
+
+For every command there should be a `NewCmd<CommandName>` function that creates
+the command and returns a pointer to a `cobra.Command`, which can later be added
+to other parent commands to compose the structure tree. There should also be a
+`<CommandName>Config` struct with a variable to every flag and argument declared
+by the command (and any other variable required for the command to run). This
+makes tests and mocking easier. The struct ideally exposes three methods:
+
+* `Complete`: Completes the struct fields with values that may or may not be
+directly provided by the user, for example, by flags pointers, by the `args`
+slice, by using the Factory, etc.
+
+* `Validate`: performs validation on the struct fields and returns appropriate
+errors.
+
+* `Run<CommandName>`: runs the actual logic of the command, taking as assumption
+that the struct is complete with all required values to run, and they are valid.
+
+Sample command skeleton:
+
+```go
+// MineRecommendedName is the recommended command name for kubectl mine.
+const MineRecommendedName = "mine"
+
+// Long command description and examples.
+var (
+ mineLong = templates.LongDesc(`
+ mine which is described here
+ with lots of details.`)
+
+ mineExample = templates.Examples(`
+ # Run my command's first action
+ kubectl mine first_action
+
+ # Run my command's second action on latest stuff
+ kubectl mine second_action --flag`)
+)
+
+// MineConfig contains all the options for running the mine cli command.
+type MineConfig struct {
+ mineLatest bool
+}
+
+// NewCmdMine implements the kubectl mine command.
+func NewCmdMine(parent, name string, f *cmdutil.Factory, out io.Writer) *cobra.Command {
+ opts := &MineConfig{}
+
+ cmd := &cobra.Command{
+ Use: fmt.Sprintf("%s [--latest]", name),
+ Short: "Run my command",
+ Long: mineLong,
+ Example: fmt.Sprintf(mineExample, parent+" "+name),
+ Run: func(cmd *cobra.Command, args []string) {
+ if err := opts.Complete(f, cmd, args, out); err != nil {
+ cmdutil.CheckErr(err)
+ }
+ if err := opts.Validate(); err != nil {
+ cmdutil.CheckErr(cmdutil.UsageError(cmd, err.Error()))
+ }
+ if err := opts.RunMine(); err != nil {
+ cmdutil.CheckErr(err)
+ }
+ },
+ }
+
+ cmd.Flags().BoolVar(&options.mineLatest, "latest", false, "Use latest stuff")
+ return cmd
+}
+
+// Complete completes all the required options for mine.
+func (o *MineConfig) Complete(f *cmdutil.Factory, cmd *cobra.Command, args []string, out io.Writer) error {
+ return nil
+}
+
+// Validate validates all the required options for mine.
+func (o MineConfig) Validate() error {
+ return nil
+}
+
+// RunMine implements all the necessary functionality for mine.
+func (o MineConfig) RunMine() error {
+ return nil
+}
+```
+
+The `Run<CommandName>` method should contain the business logic of the command
+and as noted in [command conventions](#command-conventions), ideally that logic
+should exist server-side so any client could take advantage of it. Notice that
+this is not a mandatory structure and not every command is implemented this way,
+but this is a nice convention so try to be compliant with it. As an example,
+have a look at how [kubectl logs](../../pkg/kubectl/cmd/logs.go) is implemented.
+
+## Generators
+
+Generators are kubectl commands that generate resources based on a set of inputs
+(other resources, flags, or a combination of both).
+
+The point of generators is:
+
+* to enable users using kubectl in a scripted fashion to pin to a particular
+behavior which may change in the future. Explicit use of a generator will always
+guarantee that the expected behavior stays the same.
+
+* to enable potential expansion of the generated resources for scenarios other
+than just creation, similar to how -f is supported for most general-purpose
+commands.
+
+Generator commands shoud obey to the following conventions:
+
+* A `--generator` flag should be defined. Users then can choose between
+different generators, if the command supports them (for example, `kubectl run`
+currently supports generators for pods, jobs, replication controllers, and
+deployments), or between different versions of a generator so that users
+depending on a specific behavior may pin to that version (for example, `kubectl
+expose` currently supports two different versions of a service generator).
+
+* Generation should be decoupled from creation. A generator should implement the
+`kubectl.StructuredGenerator` interface and have no dependencies on cobra or the
+Factory. See, for example, how the first version of the namespace generator is
+defined:
+
+```go
+// NamespaceGeneratorV1 supports stable generation of a namespace
+type NamespaceGeneratorV1 struct {
+ // Name of namespace
+ Name string
+}
+
+// Ensure it supports the generator pattern that uses parameters specified during construction
+var _ StructuredGenerator = &NamespaceGeneratorV1{}
+
+// StructuredGenerate outputs a namespace object using the configured fields
+func (g *NamespaceGeneratorV1) StructuredGenerate() (runtime.Object, error) {
+ if err := g.validate(); err != nil {
+ return nil, err
+ }
+ namespace := &api.Namespace{}
+ namespace.Name = g.Name
+ return namespace, nil
+}
+
+// validate validates required fields are set to support structured generation
+func (g *NamespaceGeneratorV1) validate() error {
+ if len(g.Name) == 0 {
+ return fmt.Errorf("name must be specified")
+ }
+ return nil
+}
+```
+
+The generator struct (`NamespaceGeneratorV1`) holds the necessary fields for
+namespace generation. It also satisfies the `kubectl.StructuredGenerator`
+interface by implementing the `StructuredGenerate() (runtime.Object, error)`
+method which configures the generated namespace that callers of the generator
+(`kubectl create namespace` in our case) need to create.
+
+* `--dry-run` should output the resource that would be created, without
+creating it.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/kubectl-conventions.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/kubemark-guide.md b/contributors/devel/kubemark-guide.md
new file mode 100755
index 00000000..e914226d
--- /dev/null
+++ b/contributors/devel/kubemark-guide.md
@@ -0,0 +1,212 @@
+# Kubemark User Guide
+
+## Introduction
+
+Kubemark is a performance testing tool which allows users to run experiments on
+simulated clusters. The primary use case is scalability testing, as simulated
+clusters can be much bigger than the real ones. The objective is to expose
+problems with the master components (API server, controller manager or
+scheduler) that appear only on bigger clusters (e.g. small memory leaks).
+
+This document serves as a primer to understand what Kubemark is, what it is not,
+and how to use it.
+
+## Architecture
+
+On a very high level Kubemark cluster consists of two parts: real master
+components and a set of “Hollow” Nodes. The prefix “Hollow” means an
+implementation/instantiation of a component with all “moving” parts mocked out.
+The best example is HollowKubelet, which pretends to be an ordinary Kubelet, but
+does not start anything, nor mount any volumes - it just lies it does. More
+detailed design and implementation details are at the end of this document.
+
+Currently master components run on a dedicated machine(s), and HollowNodes run
+on an ‘external’ Kubernetes cluster. This design has a slight advantage, over
+running master components on external cluster, of completely isolating master
+resources from everything else.
+
+## Requirements
+
+To run Kubemark you need a Kubernetes cluster (called `external cluster`)
+for running all your HollowNodes and a dedicated machine for a master.
+Master machine has to be directly routable from HollowNodes. You also need an
+access to some Docker repository.
+
+Currently scripts are written to be easily usable by GCE, but it should be
+relatively straightforward to port them to different providers or bare metal.
+
+## Common use cases and helper scripts
+
+Common workflow for Kubemark is:
+- starting a Kubemark cluster (on GCE)
+- running e2e tests on Kubemark cluster
+- monitoring test execution and debugging problems
+- turning down Kubemark cluster
+
+Included in descriptions there will be comments helpful for anyone who’ll want to
+port Kubemark to different providers.
+
+### Starting a Kubemark cluster
+
+To start a Kubemark cluster on GCE you need to create an external kubernetes
+cluster (it can be GCE, GKE or anything else) by yourself, make sure that kubeconfig
+points to it by default, build a kubernetes release (e.g. by running
+`make quick-release`) and run `test/kubemark/start-kubemark.sh` script.
+This script will create a VM for master components, Pods for HollowNodes
+and do all the setup necessary to let them talk to each other. It will use the
+configuration stored in `cluster/kubemark/config-default.sh` - you can tweak it
+however you want, but note that some features may not be implemented yet, as
+implementation of Hollow components/mocks will probably be lagging behind ‘real’
+one. For performance tests interesting variables are `NUM_NODES` and
+`MASTER_SIZE`. After start-kubemark script is finished you’ll have a ready
+Kubemark cluster, a kubeconfig file for talking to the Kubemark cluster is
+stored in `test/kubemark/kubeconfig.kubemark`.
+
+Currently we're running HollowNode with limit of 0.05 a CPU core and ~60MB or
+memory, which taking into account default cluster addons and fluentD running on
+an 'external' cluster, allows running ~17.5 HollowNodes per core.
+
+#### Behind the scene details:
+
+Start-kubemark script does quite a lot of things:
+
+- Creates a master machine called hollow-cluster-master and PD for it (*uses
+gcloud, should be easy to do outside of GCE*)
+
+- Creates a firewall rule which opens port 443\* on the master machine (*uses
+gcloud, should be easy to do outside of GCE*)
+
+- Builds a Docker image for HollowNode from the current repository and pushes it
+to the Docker repository (*GCR for us, using scripts from
+`cluster/gce/util.sh` - it may get tricky outside of GCE*)
+
+- Generates certificates and kubeconfig files, writes a kubeconfig locally to
+`test/kubemark/kubeconfig.kubemark` and creates a Secret which stores kubeconfig for
+HollowKubelet/HollowProxy use (*used gcloud to transfer files to Master, should
+be easy to do outside of GCE*).
+
+- Creates a ReplicationController for HollowNodes and starts them up. (*will
+work exactly the same everywhere as long as MASTER_IP will be populated
+correctly, but you’ll need to update docker image address if you’re not using
+GCR and default image name*)
+
+- Waits until all HollowNodes are in the Running phase (*will work exactly the
+same everywhere*)
+
+<sub>\* Port 443 is a secured port on the master machine which is used for all
+external communication with the API server. In the last sentence *external*
+means all traffic coming from other machines, including all the Nodes, not only
+from outside of the cluster. Currently local components, i.e. ControllerManager
+and Scheduler talk with API server using insecure port 8080.</sub>
+
+### Running e2e tests on Kubemark cluster
+
+To run standard e2e test on your Kubemark cluster created in the previous step
+you execute `test/kubemark/run-e2e-tests.sh` script. It will configure ginkgo to
+use Kubemark cluster instead of something else and start an e2e test. This
+script should not need any changes to work on other cloud providers.
+
+By default (if nothing will be passed to it) the script will run a Density '30
+test. If you want to run a different e2e test you just need to provide flags you want to be
+passed to `hack/ginkgo-e2e.sh` script, e.g. `--ginkgo.focus="Load"` to run the
+Load test.
+
+By default, at the end of each test, it will delete namespaces and everything
+under it (e.g. events, replication controllers) on Kubemark master, which takes
+a lot of time. Such work aren't needed in most cases: if you delete your
+Kubemark cluster after running `run-e2e-tests.sh`; you don't care about
+namespace deletion performance, specifically related to etcd; etc. There is a
+flag that enables you to avoid namespace deletion: `--delete-namespace=false`.
+Adding the flag should let you see in logs: `Found DeleteNamespace=false,
+skipping namespace deletion!`
+
+### Monitoring test execution and debugging problems
+
+Run-e2e-tests prints the same output on Kubemark as on ordinary e2e cluster, but
+if you need to dig deeper you need to learn how to debug HollowNodes and how
+Master machine (currently) differs from the ordinary one.
+
+If you need to debug master machine you can do similar things as you do on your
+ordinary master. The difference between Kubemark setup and ordinary setup is
+that in Kubemark etcd is run as a plain docker container, and all master
+components are run as normal processes. There’s no Kubelet overseeing them. Logs
+are stored in exactly the same place, i.e. `/var/logs/` directory. Because
+binaries are not supervised by anything they won't be restarted in the case of a
+crash.
+
+To help you with debugging from inside the cluster startup script puts a
+`~/configure-kubectl.sh` script on the master. It downloads `gcloud` and
+`kubectl` tool and configures kubectl to work on unsecured master port (useful
+if there are problems with security). After the script is run you can use
+kubectl command from the master machine to play with the cluster.
+
+Debugging HollowNodes is a bit more tricky, as if you experience a problem on
+one of them you need to learn which hollow-node pod corresponds to a given
+HollowNode known by the Master. During self-registeration HollowNodes provide
+their cluster IPs as Names, which means that if you need to find a HollowNode
+named `10.2.4.5` you just need to find a Pod in external cluster with this
+cluster IP. There’s a helper script
+`test/kubemark/get-real-pod-for-hollow-node.sh` that does this for you.
+
+When you have a Pod name you can use `kubectl logs` on external cluster to get
+logs, or use a `kubectl describe pod` call to find an external Node on which
+this particular HollowNode is running so you can ssh to it.
+
+E.g. you want to see the logs of HollowKubelet on which pod `my-pod` is running.
+To do so you can execute:
+
+```
+$ kubectl kubernetes/test/kubemark/kubeconfig.kubemark describe pod my-pod
+```
+
+Which outputs pod description and among it a line:
+
+```
+Node: 1.2.3.4/1.2.3.4
+```
+
+To learn the `hollow-node` pod corresponding to node `1.2.3.4` you use
+aforementioned script:
+
+```
+$ kubernetes/test/kubemark/get-real-pod-for-hollow-node.sh 1.2.3.4
+```
+
+which will output the line:
+
+```
+hollow-node-1234
+```
+
+Now you just use ordinary kubectl command to get the logs:
+
+```
+kubectl --namespace=kubemark logs hollow-node-1234
+```
+
+All those things should work exactly the same on all cloud providers.
+
+### Turning down Kubemark cluster
+
+On GCE you just need to execute `test/kubemark/stop-kubemark.sh` script, which
+will delete HollowNode ReplicationController and all the resources for you. On
+other providers you’ll need to delete all this stuff by yourself.
+
+## Some current implementation details
+
+Kubemark master uses exactly the same binaries as ordinary Kubernetes does. This
+means that it will never be out of date. On the other hand HollowNodes use
+existing fake for Kubelet (called SimpleKubelet), which mocks its runtime
+manager with `pkg/kubelet/dockertools/fake_manager.go`, where most logic sits.
+Because there’s no easy way of mocking other managers (e.g. VolumeManager), they
+are not supported in Kubemark (e.g. we can’t schedule Pods with volumes in them
+yet).
+
+As the time passes more fakes will probably be plugged into HollowNodes, but
+it’s crucial to make it as simple as possible to allow running a big number of
+Hollows on a single core.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/kubemark-guide.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/local-cluster/docker.md b/contributors/devel/local-cluster/docker.md
new file mode 100644
index 00000000..78768f80
--- /dev/null
+++ b/contributors/devel/local-cluster/docker.md
@@ -0,0 +1,269 @@
+**Stop. This guide has been superseded by [Minikube](https://github.com/kubernetes/minikube) which is the recommended method of running Kubernetes on your local machine.**
+
+
+The following instructions show you how to set up a simple, single node Kubernetes cluster using Docker.
+
+Here's a diagram of what the final result will look like:
+
+![Kubernetes Single Node on Docker](k8s-singlenode-docker.png)
+
+## Prerequisites
+
+**Note: These steps have not been tested with the [Docker For Mac or Docker For Windows beta programs](https://blog.docker.com/2016/03/docker-for-mac-windows-beta/).**
+
+1. You need to have Docker version >= "1.10" installed on the machine.
+2. Enable mount propagation. Hyperkube is running in a container which has to mount volumes for other containers, for example in case of persistent storage. The required steps depend on the init system.
+
+
+ In case of **systemd**, change MountFlags in the Docker unit file to shared.
+
+ ```shell
+ DOCKER_CONF=$(systemctl cat docker | head -1 | awk '{print $2}')
+ sed -i.bak 's/^\(MountFlags=\).*/\1shared/' $DOCKER_CONF
+ systemctl daemon-reload
+ systemctl restart docker
+ ```
+
+ **Otherwise**, manually set the mount point used by Hyperkube to be shared:
+
+ ```shell
+ mkdir -p /var/lib/kubelet
+ mount --bind /var/lib/kubelet /var/lib/kubelet
+ mount --make-shared /var/lib/kubelet
+ ```
+
+
+### Run it
+
+1. Decide which Kubernetes version to use. Set the `${K8S_VERSION}` variable to a version of Kubernetes >= "v1.2.0".
+
+
+ If you'd like to use the current **stable** version of Kubernetes, run the following:
+
+ ```sh
+ export K8S_VERSION=$(curl -sS https://storage.googleapis.com/kubernetes-release/release/stable.txt)
+ ```
+
+ and for the **latest** available version (including unstable releases):
+
+ ```sh
+ export K8S_VERSION=$(curl -sS https://storage.googleapis.com/kubernetes-release/release/latest.txt)
+ ```
+
+2. Start Hyperkube
+
+ ```shell
+ export ARCH=amd64
+ docker run -d \
+ --volume=/sys:/sys:rw \
+ --volume=/var/lib/docker/:/var/lib/docker:rw \
+ --volume=/var/lib/kubelet/:/var/lib/kubelet:rw,shared \
+ --volume=/var/run:/var/run:rw \
+ --net=host \
+ --pid=host \
+ --privileged \
+ --name=kubelet \
+ gcr.io/google_containers/hyperkube-${ARCH}:${K8S_VERSION} \
+ /hyperkube kubelet \
+ --hostname-override=127.0.0.1 \
+ --api-servers=http://localhost:8080 \
+ --config=/etc/kubernetes/manifests \
+ --cluster-dns=10.0.0.10 \
+ --cluster-domain=cluster.local \
+ --allow-privileged --v=2
+ ```
+
+ > Note that `--cluster-dns` and `--cluster-domain` is used to deploy dns, feel free to discard them if dns is not needed.
+
+ > If you would like to mount an external device as a volume, add `--volume=/dev:/dev` to the command above. It may however, cause some problems described in [#18230](https://github.com/kubernetes/kubernetes/issues/18230)
+
+ > Architectures other than `amd64` are experimental and sometimes unstable, but feel free to try them out! Valid values: `arm`, `arm64` and `ppc64le`. ARM is available with Kubernetes version `v1.3.0-alpha.2` and higher. ARM 64-bit and PowerPC 64 little-endian are available with `v1.3.0-alpha.3` and higher. Track progress on multi-arch support [here](https://github.com/kubernetes/kubernetes/issues/17981)
+
+ > If you are behind a proxy, you need to pass the proxy setup to curl in the containers to pull the certificates. Create a .curlrc under /root folder (because the containers are running as root) with the following line:
+
+ ```
+ proxy = <your_proxy_server>:<port>
+ ```
+
+ This actually runs the kubelet, which in turn runs a [pod](http://kubernetes.io/docs/user-guide/pods/) that contains the other master components.
+
+ ** **SECURITY WARNING** ** services exposed via Kubernetes using Hyperkube are available on the host node's public network interface / IP address. Because of this, this guide is not suitable for any host node/server that is directly internet accessible. Refer to [#21735](https://github.com/kubernetes/kubernetes/issues/21735) for additional info.
+
+### Download `kubectl`
+
+At this point you should have a running Kubernetes cluster. You can test it out
+by downloading the kubectl binary for `${K8S_VERSION}` (in this example: `{{page.version}}.0`).
+
+
+Downloads:
+
+ - `linux/amd64`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/amd64/kubectl
+ - `linux/386`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/386/kubectl
+ - `linux/arm`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/arm/kubectl
+ - `linux/arm64`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/arm64/kubectl
+ - `linux/ppc64le`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/ppc64le/kubectl
+ - `OS X/amd64`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/darwin/amd64/kubectl
+ - `OS X/386`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/darwin/386/kubectl
+ - `windows/amd64`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/windows/amd64/kubectl.exe
+ - `windows/386`: http://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/windows/386/kubectl.exe
+
+The generic download path is:
+
+```
+http://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/${GOOS}/${GOARCH}/${K8S_BINARY}
+```
+
+An example install with `linux/amd64`:
+
+```
+curl -sSL "https://storage.googleapis.com/kubernetes-release/release/{{page.version}}.0/bin/linux/amd64/kubectl" > /usr/bin/kubectl
+chmod +x /usr/bin/kubectl
+```
+
+On OS X, to make the API server accessible locally, setup a ssh tunnel.
+
+```shell
+docker-machine ssh `docker-machine active` -N -L 8080:localhost:8080
+```
+
+Setting up a ssh tunnel is applicable to remote docker hosts as well.
+
+(Optional) Create kubernetes cluster configuration:
+
+```shell
+kubectl config set-cluster test-doc --server=http://localhost:8080
+kubectl config set-context test-doc --cluster=test-doc
+kubectl config use-context test-doc
+```
+
+### Test it out
+
+List the nodes in your cluster by running:
+
+```shell
+kubectl get nodes
+```
+
+This should print:
+
+```shell
+NAME STATUS AGE
+127.0.0.1 Ready 1h
+```
+
+### Run an application
+
+```shell
+kubectl run nginx --image=nginx --port=80
+```
+
+Now run `docker ps` you should see nginx running. You may need to wait a few minutes for the image to get pulled.
+
+### Expose it as a service
+
+```shell
+kubectl expose deployment nginx --port=80
+```
+
+Run the following command to obtain the cluster local IP of this service we just created:
+
+```shell{% raw %}
+ip=$(kubectl get svc nginx --template={{.spec.clusterIP}})
+echo $ip
+{% endraw %}```
+
+Hit the webserver with this IP:
+
+```shell{% raw %}
+
+curl $ip
+{% endraw %}```
+
+On OS X, since docker is running inside a VM, run the following command instead:
+
+```shell
+docker-machine ssh `docker-machine active` curl $ip
+```
+
+## Deploy a DNS
+
+Read [documentation for manually deploying a DNS](http://kubernetes.io/docs/getting-started-guides/docker-multinode/#deploy-dns-manually-for-v12x) for instructions.
+
+### Turning down your cluster
+
+1. Delete the nginx service and deployment:
+
+If you plan on re-creating your nginx deployment and service you will need to clean it up.
+
+```shell
+kubectl delete service,deployments nginx
+```
+
+2. Delete all the containers including the kubelet:
+
+```shell
+docker rm -f kubelet
+docker rm -f `docker ps | grep k8s | awk '{print $1}'`
+```
+
+3. Cleanup the filesystem:
+
+On OS X, first ssh into the docker VM:
+
+```shell
+docker-machine ssh `docker-machine active`
+```
+
+```shell
+grep /var/lib/kubelet /proc/mounts | awk '{print $2}' | sudo xargs -n1 umount
+sudo rm -rf /var/lib/kubelet
+```
+
+### Troubleshooting
+
+#### Node is in `NotReady` state
+
+If you see your node as `NotReady` it's possible that your OS does not have memcg enabled.
+
+1. Your kernel should support memory accounting. Ensure that the
+following configs are turned on in your linux kernel:
+
+```shell
+CONFIG_RESOURCE_COUNTERS=y
+CONFIG_MEMCG=y
+```
+
+2. Enable the memory accounting in the kernel, at boot, as command line
+parameters as follows:
+
+```shell
+GRUB_CMDLINE_LINUX="cgroup_enable=memory=1"
+```
+
+NOTE: The above is specifically for GRUB2.
+You can check the command line parameters passed to your kernel by looking at the
+output of /proc/cmdline:
+
+```shell
+$ cat /proc/cmdline
+BOOT_IMAGE=/boot/vmlinuz-3.18.4-aufs root=/dev/sda5 ro cgroup_enable=memory=1
+```
+
+## Support Level
+
+
+IaaS Provider | Config. Mgmt | OS | Networking | Conforms | Support Level
+-------------------- | ------------ | ------ | ---------- | ---------| ----------------------------
+Docker Single Node | custom | N/A | local | | Project ([@brendandburns](https://github.com/brendandburns))
+
+
+
+## Further reading
+
+Please see the [Kubernetes docs](http://kubernetes.io/docs) for more details on administering
+and using a Kubernetes cluster.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/local-cluster/docker.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/local-cluster/k8s-singlenode-docker.png b/contributors/devel/local-cluster/k8s-singlenode-docker.png
new file mode 100644
index 00000000..5ebf8126
--- /dev/null
+++ b/contributors/devel/local-cluster/k8s-singlenode-docker.png
Binary files differ
diff --git a/contributors/devel/local-cluster/local.md b/contributors/devel/local-cluster/local.md
new file mode 100644
index 00000000..60bd5a8f
--- /dev/null
+++ b/contributors/devel/local-cluster/local.md
@@ -0,0 +1,125 @@
+**Stop. This guide has been superseded by [Minikube](https://github.com/kubernetes/minikube) which is the recommended method of running Kubernetes on your local machine.**
+
+### Requirements
+
+#### Linux
+
+Not running Linux? Consider running Linux in a local virtual machine with [vagrant](https://www.vagrantup.com/), or on a cloud provider like Google Compute Engine
+
+#### Docker
+
+At least [Docker](https://docs.docker.com/installation/#installation)
+1.8.3+. Ensure the Docker daemon is running and can be contacted (try `docker
+ps`). Some of the Kubernetes components need to run as root, which normally
+works fine with docker.
+
+#### etcd
+
+You need an [etcd](https://github.com/coreos/etcd/releases) in your path, please make sure it is installed and in your ``$PATH``.
+
+#### go
+
+You need [go](https://golang.org/doc/install) at least 1.4+ in your path, please make sure it is installed and in your ``$PATH``.
+
+### Starting the cluster
+
+First, you need to [download Kubernetes](http://kubernetes.io/docs/getting-started-guides/binary_release/). Then open a separate tab of your terminal
+and run the following (since one needs sudo access to start/stop Kubernetes daemons, it is easier to run the entire script as root):
+
+```shell
+cd kubernetes
+hack/local-up-cluster.sh
+```
+
+This will build and start a lightweight local cluster, consisting of a master
+and a single node. Type Control-C to shut it down.
+
+You can use the cluster/kubectl.sh script to interact with the local cluster. hack/local-up-cluster.sh will
+print the commands to run to point kubectl at the local cluster.
+
+
+### Running a container
+
+Your cluster is running, and you want to start running containers!
+
+You can now use any of the cluster/kubectl.sh commands to interact with your local setup.
+
+```shell
+export KUBERNETES_PROVIDER=local
+cluster/kubectl.sh get pods
+cluster/kubectl.sh get services
+cluster/kubectl.sh get deployments
+cluster/kubectl.sh run my-nginx --image=nginx --replicas=2 --port=80
+
+## begin wait for provision to complete, you can monitor the docker pull by opening a new terminal
+ sudo docker images
+ ## you should see it pulling the nginx image, once the above command returns it
+ sudo docker ps
+ ## you should see your container running!
+ exit
+## end wait
+
+## create a service for nginx, which serves on port 80
+cluster/kubectl.sh expose deployment my-nginx --port=80 --name=my-nginx
+
+## introspect Kubernetes!
+cluster/kubectl.sh get pods
+cluster/kubectl.sh get services
+cluster/kubectl.sh get deployments
+
+## Test the nginx service with the IP/port from "get services" command
+curl http://10.X.X.X:80/
+```
+
+### Running a user defined pod
+
+Note the difference between a [container](http://kubernetes.io/docs/user-guide/containers/)
+and a [pod](http://kubernetes.io/docs/user-guide/pods/). Since you only asked for the former, Kubernetes will create a wrapper pod for you.
+However you cannot view the nginx start page on localhost. To verify that nginx is running you need to run `curl` within the docker container (try `docker exec`).
+
+You can control the specifications of a pod via a user defined manifest, and reach nginx through your browser on the port specified therein:
+
+```shell
+cluster/kubectl.sh create -f test/fixtures/doc-yaml/user-guide/pod.yaml
+```
+
+Congratulations!
+
+### FAQs
+
+#### I cannot reach service IPs on the network.
+
+Some firewall software that uses iptables may not interact well with
+kubernetes. If you have trouble around networking, try disabling any
+firewall or other iptables-using systems, first. Also, you can check
+if SELinux is blocking anything by running a command such as `journalctl --since yesterday | grep avc`.
+
+By default the IP range for service cluster IPs is 10.0.*.* - depending on your
+docker installation, this may conflict with IPs for containers. If you find
+containers running with IPs in this range, edit hack/local-cluster-up.sh and
+change the service-cluster-ip-range flag to something else.
+
+#### I changed Kubernetes code, how do I run it?
+
+```shell
+cd kubernetes
+hack/build-go.sh
+hack/local-up-cluster.sh
+```
+
+#### kubectl claims to start a container but `get pods` and `docker ps` don't show it.
+
+One or more of the Kubernetes daemons might've crashed. Tail the [logs](http://kubernetes.io/docs/admin/cluster-troubleshooting/#looking-at-logs) of each in /tmp.
+
+```shell
+$ ls /tmp/kube*.log
+$ tail -f /tmp/kube-apiserver.log
+```
+
+#### The pods fail to connect to the services by host names
+
+The local-up-cluster.sh script doesn't start a DNS service. Similar situation can be found [here](http://issue.k8s.io/6667). You can start a manually.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/local-cluster/local.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/local-cluster/vagrant.md b/contributors/devel/local-cluster/vagrant.md
new file mode 100644
index 00000000..0f0fe91c
--- /dev/null
+++ b/contributors/devel/local-cluster/vagrant.md
@@ -0,0 +1,397 @@
+Running Kubernetes with Vagrant (and VirtualBox) is an easy way to run/test/develop on your local machine (Linux, Mac OS X).
+
+### Prerequisites
+
+1. Install latest version >= 1.7.4 of [Vagrant](http://www.vagrantup.com/downloads.html)
+2. Install one of:
+ 1. The latest version of [Virtual Box](https://www.virtualbox.org/wiki/Downloads)
+ 2. [VMWare Fusion](https://www.vmware.com/products/fusion/) version 5 or greater as well as the appropriate [Vagrant VMWare Fusion provider](https://www.vagrantup.com/vmware)
+ 3. [VMWare Workstation](https://www.vmware.com/products/workstation/) version 9 or greater as well as the [Vagrant VMWare Workstation provider](https://www.vagrantup.com/vmware)
+ 4. [Parallels Desktop](https://www.parallels.com/products/desktop/) version 9 or greater as well as the [Vagrant Parallels provider](https://parallels.github.io/vagrant-parallels/)
+ 5. libvirt with KVM and enable support of hardware virtualisation. [Vagrant-libvirt](https://github.com/pradels/vagrant-libvirt). For fedora provided official rpm, and possible to use `yum install vagrant-libvirt`
+
+### Setup
+
+Setting up a cluster is as simple as running:
+
+```sh
+export KUBERNETES_PROVIDER=vagrant
+curl -sS https://get.k8s.io | bash
+```
+
+Alternatively, you can download [Kubernetes release](https://github.com/kubernetes/kubernetes/releases) and extract the archive. To start your local cluster, open a shell and run:
+
+```sh
+cd kubernetes
+
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+The `KUBERNETES_PROVIDER` environment variable tells all of the various cluster management scripts which variant to use. If you forget to set this, the assumption is you are running on Google Compute Engine.
+
+By default, the Vagrant setup will create a single master VM (called kubernetes-master) and one node (called kubernetes-node-1). Each VM will take 1 GB, so make sure you have at least 2GB to 4GB of free memory (plus appropriate free disk space).
+
+If you'd like more than one node, set the `NUM_NODES` environment variable to the number you want:
+
+```sh
+export NUM_NODES=3
+```
+
+Vagrant will provision each machine in the cluster with all the necessary components to run Kubernetes. The initial setup can take a few minutes to complete on each machine.
+
+If you installed more than one Vagrant provider, Kubernetes will usually pick the appropriate one. However, you can override which one Kubernetes will use by setting the [`VAGRANT_DEFAULT_PROVIDER`](https://docs.vagrantup.com/v2/providers/default.html) environment variable:
+
+```sh
+export VAGRANT_DEFAULT_PROVIDER=parallels
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+By default, each VM in the cluster is running Fedora.
+
+To access the master or any node:
+
+```sh
+vagrant ssh master
+vagrant ssh node-1
+```
+
+If you are running more than one node, you can access the others by:
+
+```sh
+vagrant ssh node-2
+vagrant ssh node-3
+```
+
+Each node in the cluster installs the docker daemon and the kubelet.
+
+The master node instantiates the Kubernetes master components as pods on the machine.
+
+To view the service status and/or logs on the kubernetes-master:
+
+```console
+[vagrant@kubernetes-master ~] $ vagrant ssh master
+[vagrant@kubernetes-master ~] $ sudo su
+
+[root@kubernetes-master ~] $ systemctl status kubelet
+[root@kubernetes-master ~] $ journalctl -ru kubelet
+
+[root@kubernetes-master ~] $ systemctl status docker
+[root@kubernetes-master ~] $ journalctl -ru docker
+
+[root@kubernetes-master ~] $ tail -f /var/log/kube-apiserver.log
+[root@kubernetes-master ~] $ tail -f /var/log/kube-controller-manager.log
+[root@kubernetes-master ~] $ tail -f /var/log/kube-scheduler.log
+```
+
+To view the services on any of the nodes:
+
+```console
+[vagrant@kubernetes-master ~] $ vagrant ssh node-1
+[vagrant@kubernetes-master ~] $ sudo su
+
+[root@kubernetes-master ~] $ systemctl status kubelet
+[root@kubernetes-master ~] $ journalctl -ru kubelet
+
+[root@kubernetes-master ~] $ systemctl status docker
+[root@kubernetes-master ~] $ journalctl -ru docker
+```
+
+### Interacting with your Kubernetes cluster with Vagrant.
+
+With your Kubernetes cluster up, you can manage the nodes in your cluster with the regular Vagrant commands.
+
+To push updates to new Kubernetes code after making source changes:
+
+```sh
+./cluster/kube-push.sh
+```
+
+To stop and then restart the cluster:
+
+```sh
+vagrant halt
+./cluster/kube-up.sh
+```
+
+To destroy the cluster:
+
+```sh
+vagrant destroy
+```
+
+Once your Vagrant machines are up and provisioned, the first thing to do is to check that you can use the `kubectl.sh` script.
+
+You may need to build the binaries first, you can do this with `make`
+
+```console
+$ ./cluster/kubectl.sh get nodes
+
+NAME LABELS
+10.245.1.4 <none>
+10.245.1.5 <none>
+10.245.1.3 <none>
+```
+
+### Authenticating with your master
+
+When using the vagrant provider in Kubernetes, the `cluster/kubectl.sh` script will cache your credentials in a `~/.kubernetes_vagrant_auth` file so you will not be prompted for them in the future.
+
+```sh
+cat ~/.kubernetes_vagrant_auth
+```
+
+```json
+{ "User": "vagrant",
+ "Password": "vagrant",
+ "CAFile": "/home/k8s_user/.kubernetes.vagrant.ca.crt",
+ "CertFile": "/home/k8s_user/.kubecfg.vagrant.crt",
+ "KeyFile": "/home/k8s_user/.kubecfg.vagrant.key"
+}
+```
+
+You should now be set to use the `cluster/kubectl.sh` script. For example try to list the nodes that you have started with:
+
+```sh
+./cluster/kubectl.sh get nodes
+```
+
+### Running containers
+
+Your cluster is running, you can list the nodes in your cluster:
+
+```sh
+$ ./cluster/kubectl.sh get nodes
+
+NAME LABELS
+10.245.2.4 <none>
+10.245.2.3 <none>
+10.245.2.2 <none>
+```
+
+Now start running some containers!
+
+You can now use any of the `cluster/kube-*.sh` commands to interact with your VM machines.
+Before starting a container there will be no pods, services and replication controllers.
+
+```sh
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+
+$ ./cluster/kubectl.sh get services
+NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
+
+$ ./cluster/kubectl.sh get replicationcontrollers
+CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
+```
+
+Start a container running nginx with a replication controller and three replicas
+
+```sh
+$ ./cluster/kubectl.sh run my-nginx --image=nginx --replicas=3 --port=80
+```
+
+When listing the pods, you will see that three containers have been started and are in Waiting state:
+
+```sh
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-5kq0g 0/1 Pending 0 10s
+my-nginx-gr3hh 0/1 Pending 0 10s
+my-nginx-xql4j 0/1 Pending 0 10s
+```
+
+You need to wait for the provisioning to complete, you can monitor the nodes by doing:
+
+```sh
+$ vagrant ssh node-1 -c 'sudo docker images'
+kubernetes-node-1:
+ REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
+ <none> <none> 96864a7d2df3 26 hours ago 204.4 MB
+ google/cadvisor latest e0575e677c50 13 days ago 12.64 MB
+ kubernetes/pause latest 6c4579af347b 8 weeks ago 239.8 kB
+```
+
+Once the docker image for nginx has been downloaded, the container will start and you can list it:
+
+```sh
+$ vagrant ssh node-1 -c 'sudo docker ps'
+kubernetes-node-1:
+ CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
+ dbe79bf6e25b nginx:latest "nginx" 21 seconds ago Up 19 seconds k8s--mynginx.8c5b8a3a--7813c8bd_-_3ffe_-_11e4_-_9036_-_0800279696e1.etcd--7813c8bd_-_3ffe_-_11e4_-_9036_-_0800279696e1--fcfa837f
+ fa0e29c94501 kubernetes/pause:latest "/pause" 8 minutes ago Up 8 minutes 0.0.0.0:8080->80/tcp k8s--net.a90e7ce4--7813c8bd_-_3ffe_-_11e4_-_9036_-_0800279696e1.etcd--7813c8bd_-_3ffe_-_11e4_-_9036_-_0800279696e1--baf5b21b
+ aa2ee3ed844a google/cadvisor:latest "/usr/bin/cadvisor" 38 minutes ago Up 38 minutes k8s--cadvisor.9e90d182--cadvisor_-_agent.file--4626b3a2
+ 65a3a926f357 kubernetes/pause:latest "/pause" 39 minutes ago Up 39 minutes 0.0.0.0:4194->8080/tcp k8s--net.c5ba7f0e--cadvisor_-_agent.file--342fd561
+```
+
+Going back to listing the pods, services and replicationcontrollers, you now have:
+
+```sh
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-5kq0g 1/1 Running 0 1m
+my-nginx-gr3hh 1/1 Running 0 1m
+my-nginx-xql4j 1/1 Running 0 1m
+
+$ ./cluster/kubectl.sh get services
+NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
+
+$ ./cluster/kubectl.sh get replicationcontrollers
+CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE
+my-nginx my-nginx nginx run=my-nginx 3 1m
+```
+
+We did not start any services, hence there are none listed. But we see three replicas displayed properly.
+
+Learn about [running your first containers](http://kubernetes.io/docs/user-guide/simple-nginx/) application to learn how to create a service.
+
+You can already play with scaling the replicas with:
+
+```sh
+$ ./cluster/kubectl.sh scale rc my-nginx --replicas=2
+$ ./cluster/kubectl.sh get pods
+NAME READY STATUS RESTARTS AGE
+my-nginx-5kq0g 1/1 Running 0 2m
+my-nginx-gr3hh 1/1 Running 0 2m
+```
+
+Congratulations!
+
+## Troubleshooting
+
+#### I keep downloading the same (large) box all the time!
+
+By default the Vagrantfile will download the box from S3. You can change this (and cache the box locally) by providing a name and an alternate URL when calling `kube-up.sh`
+
+```sh
+export KUBERNETES_BOX_NAME=choose_your_own_name_for_your_kuber_box
+export KUBERNETES_BOX_URL=path_of_your_kuber_box
+export KUBERNETES_PROVIDER=vagrant
+./cluster/kube-up.sh
+```
+
+#### I am getting timeouts when trying to curl the master from my host!
+
+During provision of the cluster, you may see the following message:
+
+```sh
+Validating node-1
+.............
+Waiting for each node to be registered with cloud provider
+error: couldn't read version from server: Get https://10.245.1.2/api: dial tcp 10.245.1.2:443: i/o timeout
+```
+
+Some users have reported VPNs may prevent traffic from being routed to the host machine into the virtual machine network.
+
+To debug, first verify that the master is binding to the proper IP address:
+
+```sh
+$ vagrant ssh master
+$ ifconfig | grep eth1 -C 2
+eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.245.1.2 netmask
+ 255.255.255.0 broadcast 10.245.1.255
+```
+
+Then verify that your host machine has a network connection to a bridge that can serve that address:
+
+```sh
+$ ifconfig | grep 10.245.1 -C 2
+
+vboxnet5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
+ inet 10.245.1.1 netmask 255.255.255.0 broadcast 10.245.1.255
+ inet6 fe80::800:27ff:fe00:5 prefixlen 64 scopeid 0x20<link>
+ ether 0a:00:27:00:00:05 txqueuelen 1000 (Ethernet)
+```
+
+If you do not see a response on your host machine, you will most likely need to connect your host to the virtual network created by the virtualization provider.
+
+If you do see a network, but are still unable to ping the machine, check if your VPN is blocking the request.
+
+#### I just created the cluster, but I am getting authorization errors!
+
+You probably have an incorrect ~/.kubernetes_vagrant_auth file for the cluster you are attempting to contact.
+
+```sh
+rm ~/.kubernetes_vagrant_auth
+```
+
+After using kubectl.sh make sure that the correct credentials are set:
+
+```sh
+cat ~/.kubernetes_vagrant_auth
+```
+
+```json
+{
+ "User": "vagrant",
+ "Password": "vagrant"
+}
+```
+
+#### I just created the cluster, but I do not see my container running!
+
+If this is your first time creating the cluster, the kubelet on each node schedules a number of docker pull requests to fetch prerequisite images. This can take some time and as a result may delay your initial pod getting provisioned.
+
+#### I have brought Vagrant up but the nodes cannot validate!
+
+Log on to one of the nodes (`vagrant ssh node-1`) and inspect the salt minion log (`sudo cat /var/log/salt/minion`).
+
+#### I want to change the number of nodes!
+
+You can control the number of nodes that are instantiated via the environment variable `NUM_NODES` on your host machine. If you plan to work with replicas, we strongly encourage you to work with enough nodes to satisfy your largest intended replica size. If you do not plan to work with replicas, you can save some system resources by running with a single node. You do this, by setting `NUM_NODES` to 1 like so:
+
+```sh
+export NUM_NODES=1
+```
+
+#### I want my VMs to have more memory!
+
+You can control the memory allotted to virtual machines with the `KUBERNETES_MEMORY` environment variable.
+Just set it to the number of megabytes you would like the machines to have. For example:
+
+```sh
+export KUBERNETES_MEMORY=2048
+```
+
+If you need more granular control, you can set the amount of memory for the master and nodes independently. For example:
+
+```sh
+export KUBERNETES_MASTER_MEMORY=1536
+export KUBERNETES_NODE_MEMORY=2048
+```
+
+#### I want to set proxy settings for my Kubernetes cluster boot strapping!
+
+If you are behind a proxy, you need to install vagrant proxy plugin and set the proxy settings by
+
+```sh
+vagrant plugin install vagrant-proxyconf
+export VAGRANT_HTTP_PROXY=http://username:password@proxyaddr:proxyport
+export VAGRANT_HTTPS_PROXY=https://username:password@proxyaddr:proxyport
+```
+
+Optionally you can specify addresses to not proxy, for example
+
+```sh
+export VAGRANT_NO_PROXY=127.0.0.1
+```
+
+If you are using sudo to make kubernetes build for example make quick-release, you need run `sudo -E make quick-release` to pass the environment variables.
+
+#### I ran vagrant suspend and nothing works!
+
+`vagrant suspend` seems to mess up the network. This is not supported at this time.
+
+#### I want vagrant to sync folders via nfs!
+
+You can ensure that vagrant uses nfs to sync folders with virtual machines by setting the KUBERNETES_VAGRANT_USE_NFS environment variable to 'true'. nfs is faster than virtualbox or vmware's 'shared folders' and does not require guest additions. See the [vagrant docs](http://docs.vagrantup.com/v2/synced-folders/nfs.html) for details on configuring nfs on the host. This setting will have no effect on the libvirt provider, which uses nfs by default. For example:
+
+```sh
+export KUBERNETES_VAGRANT_USE_NFS=true
+```
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/local-cluster/vagrant.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/logging.md b/contributors/devel/logging.md
new file mode 100644
index 00000000..1241ee7f
--- /dev/null
+++ b/contributors/devel/logging.md
@@ -0,0 +1,36 @@
+## Logging Conventions
+
+The following conventions for the glog levels to use.
+[glog](http://godoc.org/github.com/golang/glog) is globally preferred to
+[log](http://golang.org/pkg/log/) for better runtime control.
+
+* glog.Errorf() - Always an error
+
+* glog.Warningf() - Something unexpected, but probably not an error
+
+* glog.Infof() has multiple levels:
+ * glog.V(0) - Generally useful for this to ALWAYS be visible to an operator
+ * Programmer errors
+ * Logging extra info about a panic
+ * CLI argument handling
+ * glog.V(1) - A reasonable default log level if you don't want verbosity.
+ * Information about config (listening on X, watching Y)
+ * Errors that repeat frequently that relate to conditions that can be corrected (pod detected as unhealthy)
+ * glog.V(2) - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems.
+ * Logging HTTP requests and their exit code
+ * System state changing (killing pod)
+ * Controller state change events (starting pods)
+ * Scheduler log messages
+ * glog.V(3) - Extended information about changes
+ * More info about system state changes
+ * glog.V(4) - Debug level verbosity (for now)
+ * Logging in particularly thorny parts of code where you may want to come back later and check it
+
+As per the comments, the practical default level is V(2). Developers and QE
+environments may wish to run at V(3) or V(4). If you wish to change the log
+level, you can pass in `-v=X` where X is the desired maximum level to log.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/logging.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/mesos-style.md b/contributors/devel/mesos-style.md
new file mode 100644
index 00000000..81554ce8
--- /dev/null
+++ b/contributors/devel/mesos-style.md
@@ -0,0 +1,218 @@
+# Building Mesos/Omega-style frameworks on Kubernetes
+
+## Introduction
+
+We have observed two different cluster management architectures, which can be
+categorized as "Borg-style" and "Mesos/Omega-style." In the remainder of this
+document, we will abbreviate the latter as "Mesos-style." Although out-of-the
+box Kubernetes uses a Borg-style architecture, it can also be configured in a
+Mesos-style architecture, and in fact can support both styles at the same time.
+This document describes the two approaches and describes how to deploy a
+Mesos-style architecture on Kubernetes.
+
+As an aside, the converse is also true: one can deploy a Borg/Kubernetes-style
+architecture on Mesos.
+
+This document is NOT intended to provide a comprehensive comparison of Borg and
+Mesos. For example, we omit discussion of the tradeoffs between scheduling with
+full knowledge of cluster state vs. scheduling using the "offer" model. That
+issue is discussed in some detail in the Omega paper.
+(See [references](#references) below.)
+
+
+## What is a Borg-style architecture?
+
+A Borg-style architecture is characterized by:
+
+* a single logical API endpoint for clients, where some amount of processing is
+done on requests, such as admission control and applying defaults
+
+* generic (non-application-specific) collection abstractions described
+declaratively,
+
+* generic controllers/state machines that manage the lifecycle of the collection
+abstractions and the containers spawned from them
+
+* a generic scheduler
+
+For example, Borg's primary collection abstraction is a Job, and every
+application that runs on Borg--whether it's a user-facing service like the GMail
+front-end, a batch job like a MapReduce, or an infrastructure service like
+GFS--must represent itself as a Job. Borg has corresponding state machine logic
+for managing Jobs and their instances, and a scheduler that's responsible for
+assigning the instances to machines.
+
+The flow of a request in Borg is:
+
+1. Client submits a collection object to the Borgmaster API endpoint
+
+1. Admission control, quota, applying defaults, etc. run on the collection
+
+1. If the collection is admitted, it is persisted, and the collection state
+machine creates the underlying instances
+
+1. The scheduler assigns a hostname to the instance, and tells the Borglet to
+start the instance's container(s)
+
+1. Borglet starts the container(s)
+
+1. The instance state machine manages the instances and the collection state
+machine manages the collection during their lifetimes
+
+Out-of-the-box Kubernetes has *workload-specific* abstractions (ReplicaSet, Job,
+DaemonSet, etc.) and corresponding controllers, and in the future may have
+[workload-specific schedulers](../../docs/proposals/multiple-schedulers.md),
+e.g. different schedulers for long-running services vs. short-running batch. But
+these abstractions, controllers, and schedulers are not *application-specific*.
+
+The usual request flow in Kubernetes is very similar, namely
+
+1. Client submits a collection object (e.g. ReplicaSet, Job, ...) to the API
+server
+
+1. Admission control, quota, applying defaults, etc. run on the collection
+
+1. If the collection is admitted, it is persisted, and the corresponding
+collection controller creates the underlying pods
+
+1. Admission control, quota, applying defaults, etc. runs on each pod; if there
+are multiple schedulers, one of the admission controllers will write the
+scheduler name as an annotation based on a policy
+
+1. If a pod is admitted, it is persisted
+
+1. The appropriate scheduler assigns a nodeName to the instance, which triggers
+the Kubelet to start the pod's container(s)
+
+1. Kubelet starts the container(s)
+
+1. The controller corresponding to the collection manages the pod and the
+collection during their lifetime
+
+In the Borg model, application-level scheduling and cluster-level scheduling are
+handled by separate components. For example, a MapReduce master might request
+Borg to create a job with a certain number of instances with a particular
+resource shape, where each instance corresponds to a MapReduce worker; the
+MapReduce master would then schedule individual units of work onto those
+workers.
+
+## What is a Mesos-style architecture?
+
+Mesos is fundamentally designed to support multiple application-specific
+"frameworks." A framework is composed of a "framework scheduler" and a
+"framework executor." We will abbreviate "framework scheduler" as "framework"
+since "scheduler" means something very different in Kubernetes (something that
+just assigns pods to nodes).
+
+Unlike Borg and Kubernetes, where there is a single logical endpoint that
+receives all API requests (the Borgmaster and API server, respectively), in
+Mesos every framework is a separate API endpoint. Mesos does not have any
+standard set of collection abstractions, controllers/state machines, or
+schedulers; the logic for all of these things is contained in each
+[application-specific framework](http://mesos.apache.org/documentation/latest/frameworks/)
+individually. (Note that the notion of application-specific does sometimes blur
+into the realm of workload-specific, for example
+[Chronos](https://github.com/mesos/chronos) is a generic framework for batch
+jobs. However, regardless of what set of Mesos frameworks you are using, the key
+properties remain: each framework is its own API endpoint with its own
+client-facing and internal abstractions, state machines, and scheduler).
+
+A Mesos framework can integrate application-level scheduling and cluster-level
+scheduling into a single component.
+
+Note: Although Mesos frameworks expose their own API endpoints to clients, they
+consume a common infrastructure via a common API endpoint for controlling tasks
+(launching, detecting failure, etc.) and learning about available cluster
+resources. More details
+[here](http://mesos.apache.org/documentation/latest/scheduler-http-api/).
+
+## Building a Mesos-style framework on Kubernetes
+
+Implementing the Mesos model on Kubernetes boils down to enabling
+application-specific collection abstractions, controllers/state machines, and
+scheduling. There are just three steps:
+
+* Use API plugins to create API resources for your new application-specific
+collection abstraction(s)
+
+* Implement controllers for the new abstractions (and for managing the lifecycle
+of the pods the controllers generate)
+
+* Implement a scheduler with the application-specific scheduling logic
+
+Note that the last two can be combined: a Kubernetes controller can do the
+scheduling for the pods it creates, by writing node name to the pods when it
+creates them.
+
+Once you've done this, you end up with an architecture that is extremely similar
+to the Mesos-style--the Kubernetes controller is effectively a Mesos framework.
+The remaining differences are:
+
+* In Kubernetes, all API operations go through a single logical endpoint, the
+API server (we say logical because the API server can be replicated). In
+contrast, in Mesos, API operations go to a particular framework. However, the
+Kubernetes API plugin model makes this difference fairly small.
+
+* In Kubernetes, application-specific admission control, quota, defaulting, etc.
+rules can be implemented in the API server rather than in the controller. Of
+course you can choose to make these operations be no-ops for your
+application-specific collection abstractions, and handle them in your controller.
+
+* On the node level, Mesos allows application-specific executors, whereas
+Kubernetes only has executors for Docker and rkt containers.
+
+The end-to-end flow is:
+
+1. Client submits an application-specific collection object to the API server
+
+2. The API server plugin for that collection object forwards the request to the
+API server that handles that collection type
+
+3. Admission control, quota, applying defaults, etc. runs on the collection
+object
+
+4. If the collection is admitted, it is persisted
+
+5. The collection controller sees the collection object and in response creates
+the underlying pods and chooses which nodes they will run on by setting node
+name
+
+6. Kubelet sees the pods with node name set and starts the container(s)
+
+7. The collection controller manages the pods and the collection during their
+lifetimes
+
+*Note: if the controller and scheduler are separated, then step 5 breaks
+down into multiple steps:*
+
+(5a) collection controller creates pods with empty node name.
+
+(5b) API server admission control, quota, defaulting, etc. runs on the
+pods; one of the admission controller steps writes the scheduler name as an
+annotation on each pods (see pull request `#18262` for more details).
+
+(5c) The corresponding application-specific scheduler chooses a node and
+writes node name, which triggers the Kubelet to start the pod's container(s).
+
+As a final note, the Kubernetes model allows multiple levels of iterative
+refinement of runtime abstractions, as long as the lowest level is the pod. For
+example, clients of application Foo might create a `FooSet` which is picked up
+by the FooController which in turn creates `BatchFooSet` and `ServiceFooSet`
+objects, which are picked up by the BatchFoo controller and ServiceFoo
+controller respectively, which in turn create pods. In between each of these
+steps there is an opportunity for object-specific admission control, quota, and
+defaulting to run in the API server, though these can instead be handled by the
+controllers.
+
+## References
+
+Mesos is described [here](https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf).
+Omega is described [here](http://research.google.com/pubs/pub41684.html).
+Borg is described [here](http://research.google.com/pubs/pub43438.html).
+
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/mesos-style.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/node-performance-testing.md b/contributors/devel/node-performance-testing.md
new file mode 100644
index 00000000..d6bb657f
--- /dev/null
+++ b/contributors/devel/node-performance-testing.md
@@ -0,0 +1,127 @@
+# Measuring Node Performance
+
+This document outlines the issues and pitfalls of measuring Node performance, as
+well as the tools available.
+
+## Cluster Set-up
+
+There are lots of factors which can affect node performance numbers, so care
+must be taken in setting up the cluster to make the intended measurements. In
+addition to taking the following steps into consideration, it is important to
+document precisely which setup was used. For example, performance can vary
+wildly from commit-to-commit, so it is very important to **document which commit
+or version** of Kubernetes was used, which Docker version was used, etc.
+
+### Addon pods
+
+Be aware of which addon pods are running on which nodes. By default Kubernetes
+runs 8 addon pods, plus another 2 per node (`fluentd-elasticsearch` and
+`kube-proxy`) in the `kube-system` namespace. The addon pods can be disabled for
+more consistent results, but doing so can also have performance implications.
+
+For example, Heapster polls each node regularly to collect stats data. Disabling
+Heapster will hide the performance cost of serving those stats in the Kubelet.
+
+#### Disabling Add-ons
+
+Disabling addons is simple. Just ssh into the Kubernetes master and move the
+addon from `/etc/kubernetes/addons/` to a backup location. More details
+[here](../../cluster/addons/).
+
+### Which / how many pods?
+
+Performance will vary a lot between a node with 0 pods and a node with 100 pods.
+In many cases you'll want to make measurements with several different amounts of
+pods. On a single node cluster scaling a replication controller makes this easy,
+just make sure the system reaches a steady-state before starting the
+measurement. E.g. `kubectl scale replicationcontroller pause --replicas=100`
+
+In most cases pause pods will yield the most consistent measurements since the
+system will not be affected by pod load. However, in some special cases
+Kubernetes has been tuned to optimize pods that are not doing anything, such as
+the cAdvisor housekeeping (stats gathering). In these cases, performing a very
+light task (such as a simple network ping) can make a difference.
+
+Finally, you should also consider which features yours pods should be using. For
+example, if you want to measure performance with probing, you should obviously
+use pods with liveness or readiness probes configured. Likewise for volumes,
+number of containers, etc.
+
+### Other Tips
+
+**Number of nodes** - On the one hand, it can be easier to manage logs, pods,
+environment etc. with a single node to worry about. On the other hand, having
+multiple nodes will let you gather more data in parallel for more robust
+sampling.
+
+## E2E Performance Test
+
+There is an end-to-end test for collecting overall resource usage of node
+components: [kubelet_perf.go](../../test/e2e/kubelet_perf.go). To
+run the test, simply make sure you have an e2e cluster running (`go run
+hack/e2e.go -up`) and [set up](#cluster-set-up) correctly.
+
+Run the test with `go run hack/e2e.go -v -test
+--test_args="--ginkgo.focus=resource\susage\stracking"`. You may also wish to
+customise the number of pods or other parameters of the test (remember to rerun
+`make WHAT=test/e2e/e2e.test` after you do).
+
+## Profiling
+
+Kubelet installs the [go pprof handlers]
+(https://golang.org/pkg/net/http/pprof/), which can be queried for CPU profiles:
+
+```console
+$ kubectl proxy &
+Starting to serve on 127.0.0.1:8001
+$ curl -G "http://localhost:8001/api/v1/proxy/nodes/${NODE}:10250/debug/pprof/profile?seconds=${DURATION_SECONDS}" > $OUTPUT
+$ KUBELET_BIN=_output/dockerized/bin/linux/amd64/kubelet
+$ go tool pprof -web $KUBELET_BIN $OUTPUT
+```
+
+`pprof` can also provide heap usage, from the `/debug/pprof/heap` endpoint
+(e.g. `http://localhost:8001/api/v1/proxy/nodes/${NODE}:10250/debug/pprof/heap`).
+
+More information on go profiling can be found
+[here](http://blog.golang.org/profiling-go-programs).
+
+## Benchmarks
+
+Before jumping through all the hoops to measure a live Kubernetes node in a real
+cluster, it is worth considering whether the data you need can be gathered
+through a Benchmark test. Go provides a really simple benchmarking mechanism,
+just add a unit test of the form:
+
+```go
+// In foo_test.go
+func BenchmarkFoo(b *testing.B) {
+ b.StopTimer()
+ setupFoo() // Perform any global setup
+ b.StartTimer()
+ for i := 0; i < b.N; i++ {
+ foo() // Functionality to measure
+ }
+}
+```
+
+Then:
+
+```console
+$ go test -bench=. -benchtime=${SECONDS}s foo_test.go
+```
+
+More details on benchmarking [here](https://golang.org/pkg/testing/).
+
+## TODO
+
+- (taotao) Measuring docker performance
+- Expand cluster set-up section
+- (vishh) Measuring disk usage
+- (yujuhong) Measuring memory usage
+- Add section on monitoring kubelet metrics (e.g. with prometheus)
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/node-performance-testing.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/on-call-build-cop.md b/contributors/devel/on-call-build-cop.md
new file mode 100644
index 00000000..15c71e5d
--- /dev/null
+++ b/contributors/devel/on-call-build-cop.md
@@ -0,0 +1,151 @@
+## Kubernetes "Github and Build-cop" Rotation
+
+### Preqrequisites
+
+* Ensure you have [write access to http://github.com/kubernetes/kubernetes](https://github.com/orgs/kubernetes/teams/kubernetes-maintainers)
+ * Test your admin access by e.g. adding a label to an issue.
+
+### Traffic sources and responsibilities
+
+* GitHub Kubernetes [issues](https://github.com/kubernetes/kubernetes/issues)
+and [pulls](https://github.com/kubernetes/kubernetes/pulls): Your job is to be
+the first responder to all new issues and PRs. If you are not equipped to do
+this (which is fine!), it is your job to seek guidance!
+
+ * Support issues should be closed and redirected to Stackoverflow (see example
+response below).
+
+ * All incoming issues should be tagged with a team label
+(team/{api,ux,control-plane,node,cluster,csi,redhat,mesosphere,gke,release-infra,test-infra,none});
+for issues that overlap teams, you can use multiple team labels
+
+ * There is a related concept of "Github teams" which allow you to @ mention
+a set of people; feel free to @ mention a Github team if you wish, but this is
+not a substitute for adding a team/* label, which is required
+
+ * [Google teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=goog-)
+ * [Redhat teams](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=rh-)
+ * [SIGs](https://github.com/orgs/kubernetes/teams?utf8=%E2%9C%93&query=sig-)
+
+ * If the issue is reporting broken builds, broken e2e tests, or other
+obvious P0 issues, label the issue with priority/P0 and assign it to someone.
+This is the only situation in which you should add a priority/* label
+ * non-P0 issues do not need a reviewer assigned initially
+
+ * Assign any issues related to Vagrant to @derekwaynecarr (and @mention him
+in the issue)
+
+ * All incoming PRs should be assigned a reviewer.
+
+ * unless it is a WIP (Work in Progress), RFC (Request for Comments), or design proposal.
+ * An auto-assigner [should do this for you] (https://github.com/kubernetes/kubernetes/pull/12365/files)
+ * When in doubt, choose a TL or team maintainer of the most relevant team; they can delegate
+
+ * Keep in mind that you can @ mention people in an issue/PR to bring it to
+their attention without assigning it to them. You can also @ mention github
+teams, such as @kubernetes/goog-ux or @kubernetes/kubectl
+
+ * If you need help triaging an issue or PR, consult with (or assign it to)
+@brendandburns, @thockin, @bgrant0607, @quinton-hoole, @davidopp, @dchen1107,
+@lavalamp (all U.S. Pacific Time) or @fgrzadkowski (Central European Time).
+
+ * At the beginning of your shift, please add team/* labels to any issues that
+have fallen through the cracks and don't have one. Likewise, be fair to the next
+person in rotation: try to ensure that every issue that gets filed while you are
+on duty is handled. The Github query to find issues with no team/* label is:
+[here](https://github.com/kubernetes/kubernetes/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+-label%3Ateam%2Fcontrol-plane+-label%3Ateam%2Fmesosphere+-label%3Ateam%2Fredhat+-label%3Ateam%2Frelease-infra+-label%3Ateam%2Fnone+-label%3Ateam%2Fnode+-label%3Ateam%2Fcluster+-label%3Ateam%2Fux+-label%3Ateam%2Fapi+-label%3Ateam%2Ftest-infra+-label%3Ateam%2Fgke+-label%3A"team%2FCSI-API+Machinery+SIG"+-label%3Ateam%2Fhuawei+-label%3Ateam%2Fsig-aws).
+
+Example response for support issues:
+
+```code
+Please re-post your question to [stackoverflow]
+(http://stackoverflow.com/questions/tagged/kubernetes).
+
+We are trying to consolidate the channels to which questions for help/support
+are posted so that we can improve our efficiency in responding to your requests,
+and to make it easier for you to find answers to frequently asked questions and
+how to address common use cases.
+
+We regularly see messages posted in multiple forums, with the full response
+thread only in one place or, worse, spread across multiple forums. Also, the
+large volume of support issues on github is making it difficult for us to use
+issues to identify real bugs.
+
+The Kubernetes team scans stackoverflow on a regular basis, and will try to
+ensure your questions don't go unanswered.
+
+Before posting a new question, please search stackoverflow for answers to
+similar questions, and also familiarize yourself with:
+
+ * [user guide](http://kubernetes.io/docs/user-guide/)
+ * [troubleshooting guide](http://kubernetes.io/docs/admin/cluster-troubleshooting/)
+
+Again, thanks for using Kubernetes.
+
+The Kubernetes Team
+```
+
+### Build-copping
+
+* The [merge-bot submit queue](http://submit-queue.k8s.io/)
+([source](https://github.com/kubernetes/contrib/tree/master/mungegithub/mungers/submit-queue.go))
+should auto-merge all eligible PRs for you once they've passed all the relevant
+checks mentioned below and all [critical e2e tests]
+(https://goto.google.com/k8s-test/view/Critical%20Builds/) are passing. If the
+merge-bot been disabled for some reason, or tests are failing, you might need to
+do some manual merging to get things back on track.
+
+* Once a day or so, look at the [flaky test builds]
+(https://goto.google.com/k8s-test/view/Flaky/); if they are timing out, clusters
+are failing to start, or tests are consistently failing (instead of just
+flaking), file an issue to get things back on track.
+
+* Jobs that are not in [critical e2e tests](https://goto.google.com/k8s-test/view/Critical%20Builds/)
+or [flaky test builds](https://goto.google.com/k8s-test/view/Flaky/) are not
+your responsibility to monitor. The `Test owner:` in the job description will be
+automatically emailed if the job is failing.
+
+* If you are oncall, ensure that PRs confirming to the following
+pre-requisites are being merged at a reasonable rate:
+
+ * [Have been LGTMd](https://github.com/kubernetes/kubernetes/labels/lgtm)
+ * Pass Travis and Jenkins per-PR tests.
+ * Author has signed CLA if applicable.
+
+
+* Although the shift schedule shows you as being scheduled Monday to Monday,
+ working on the weekend is neither expected nor encouraged. Enjoy your time
+ off.
+
+* When the build is broken, roll back the PRs responsible ASAP
+
+* When E2E tests are unstable, a "merge freeze" may be instituted. During a
+merge freeze:
+
+ * Oncall should slowly merge LGTMd changes throughout the day while monitoring
+E2E to ensure stability.
+
+ * Ideally the E2E run should be green, but some tests are flaky and can fail
+randomly (not as a result of a particular change).
+ * If a large number of tests fail, or tests that normally pass fail, that
+is an indication that one or more of the PR(s) in that build might be
+problematic (and should be reverted).
+ * Use the Test Results Analyzer to see individual test history over time.
+
+
+* Flake mitigation
+
+ * Tests that flake (fail a small percentage of the time) need an issue filed
+against them. Please read [this](flaky-tests.md#filing-issues-for-flaky-tests);
+the build cop is expected to file issues for any flaky tests they encounter.
+
+ * It's reasonable to manually merge PRs that fix a flake or otherwise mitigate it.
+
+### Contact information
+
+[@k8s-oncall](https://github.com/k8s-oncall) will reach the current person on
+call.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-build-cop.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/on-call-rotations.md b/contributors/devel/on-call-rotations.md
new file mode 100644
index 00000000..a6535e82
--- /dev/null
+++ b/contributors/devel/on-call-rotations.md
@@ -0,0 +1,43 @@
+## Kubernetes On-Call Rotations
+
+### Kubernetes "first responder" rotations
+
+Kubernetes has generated a lot of public traffic: email, pull-requests, bugs,
+etc. So much traffic that it's becoming impossible to keep up with it all! This
+is a fantastic problem to have. In order to be sure that SOMEONE, but not
+EVERYONE on the team is paying attention to public traffic, we have instituted
+two "first responder" rotations, listed below. Please read this page before
+proceeding to the pages linked below, which are specific to each rotation.
+
+Please also read our [notes on OSS collaboration](collab.md), particularly the
+bits about hours. Specifically, each rotation is expected to be active primarily
+during work hours, less so off hours.
+
+During regular workday work hours of your shift, your primary responsibility is
+to monitor the traffic sources specific to your rotation. You can check traffic
+in the evenings if you feel so inclined, but it is not expected to be as highly
+focused as work hours. For weekends, you should check traffic very occasionally
+(e.g. once or twice a day). Again, it is not expected to be as highly focused as
+workdays. It is assumed that over time, everyone will get weekday and weekend
+shifts, so the workload will balance out.
+
+If you can not serve your shift, and you know this ahead of time, it is your
+responsibility to find someone to cover and to change the rotation. If you have
+an emergency, your responsibilities fall on the primary of the other rotation,
+who acts as your secondary. If you need help to cover all of the tasks, partners
+with oncall rotations (e.g.,
+[Redhat](https://github.com/orgs/kubernetes/teams/rh-oncall)).
+
+If you are not on duty you DO NOT need to do these things. You are free to focus
+on "real work".
+
+Note that Kubernetes will occasionally enter code slush/freeze, prior to
+milestones. When it does, there might be changes in the instructions (assigning
+milestones, for instance).
+
+* [Github and Build Cop Rotation](on-call-build-cop.md)
+* [User Support Rotation](on-call-user-support.md)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-rotations.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/on-call-user-support.md b/contributors/devel/on-call-user-support.md
new file mode 100644
index 00000000..a111c6fe
--- /dev/null
+++ b/contributors/devel/on-call-user-support.md
@@ -0,0 +1,89 @@
+## Kubernetes "User Support" Rotation
+
+### Traffic sources and responsibilities
+
+* [StackOverflow](http://stackoverflow.com/questions/tagged/kubernetes) and
+[ServerFault](http://serverfault.com/questions/tagged/google-kubernetes):
+Respond to any thread that has no responses and is more than 6 hours old (over
+time we will lengthen this timeout to allow community responses). If you are not
+equipped to respond, it is your job to redirect to someone who can.
+
+ * [Query for unanswered Kubernetes StackOverflow questions](http://stackoverflow.com/search?q=%5Bkubernetes%5D+answers%3A0)
+ * [Query for unanswered Kubernetes ServerFault questions](http://serverfault.com/questions/tagged/google-kubernetes?sort=unanswered&pageSize=15)
+ * Direct poorly formulated questions to [stackoverflow's tips about how to ask](http://stackoverflow.com/help/how-to-ask)
+ * Direct off-topic questions to [stackoverflow's policy](http://stackoverflow.com/help/on-topic)
+
+* [Slack](https://kubernetes.slack.com) ([registration](http://slack.k8s.io)):
+Your job is to be on Slack, watching for questions and answering or redirecting
+as needed. Also check out the [Slack Archive](http://kubernetes.slackarchive.io/).
+
+* [Email/Groups](https://groups.google.com/forum/#!forum/google-containers):
+Respond to any thread that has no responses and is more than 6 hours old (over
+time we will lengthen this timeout to allow community responses). If you are not
+equipped to respond, it is your job to redirect to someone who can.
+
+* [Legacy] [IRC](irc://irc.freenode.net/#google-containers)
+(irc.freenode.net #google-containers): watch IRC for questions and try to
+redirect users to Slack. Also check out the
+[IRC logs](https://botbot.me/freenode/google-containers/).
+
+In general, try to direct support questions to:
+
+1. Documentation, such as the [user guide](../user-guide/README.md) and
+[troubleshooting guide](http://kubernetes.io/docs/troubleshooting/)
+
+2. Stackoverflow
+
+If you see questions on a forum other than Stackoverflow, try to redirect them
+to Stackoverflow. Example response:
+
+```code
+Please re-post your question to [stackoverflow]
+(http://stackoverflow.com/questions/tagged/kubernetes).
+
+We are trying to consolidate the channels to which questions for help/support
+are posted so that we can improve our efficiency in responding to your requests,
+and to make it easier for you to find answers to frequently asked questions and
+how to address common use cases.
+
+We regularly see messages posted in multiple forums, with the full response
+thread only in one place or, worse, spread across multiple forums. Also, the
+large volume of support issues on github is making it difficult for us to use
+issues to identify real bugs.
+
+The Kubernetes team scans stackoverflow on a regular basis, and will try to
+ensure your questions don't go unanswered.
+
+Before posting a new question, please search stackoverflow for answers to
+similar questions, and also familiarize yourself with:
+
+ * [user guide](http://kubernetes.io/docs/user-guide/)
+ * [troubleshooting guide](http://kubernetes.io/docs/troubleshooting/)
+
+Again, thanks for using Kubernetes.
+
+The Kubernetes Team
+```
+
+If you answer a question (in any of the above forums) that you think might be
+useful for someone else in the future, *please add it to one of the FAQs in the
+wiki*:
+
+* [User FAQ](https://github.com/kubernetes/kubernetes/wiki/User-FAQ)
+* [Developer FAQ](https://github.com/kubernetes/kubernetes/wiki/Developer-FAQ)
+* [Debugging FAQ](https://github.com/kubernetes/kubernetes/wiki/Debugging-FAQ).
+
+Getting it into the FAQ is more important than polish. Please indicate the date
+it was added, so people can judge the likelihood that it is out-of-date (and
+please correct any FAQ entries that you see contain out-of-date information).
+
+### Contact information
+
+[@k8s-support-oncall](https://github.com/k8s-support-oncall) will reach the
+current person on call.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/on-call-user-support.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/owners.md b/contributors/devel/owners.md
new file mode 100644
index 00000000..217585ce
--- /dev/null
+++ b/contributors/devel/owners.md
@@ -0,0 +1,100 @@
+# Owners files
+
+_Note_: This is a design for a feature that is not yet implemented. See the [contrib PR](https://github.com/kubernetes/contrib/issues/1389) for the current progress.
+
+## Overview
+
+We want to establish owners for different parts of the code in the Kubernetes codebase. These owners
+will serve as the approvers for code to be submitted to these parts of the repository. Notably, owners
+are not necessarily expected to do the first code review for all commits to these areas, but they are
+required to approve changes before they can be merged.
+
+**Note** The Kubernetes project has a hiatus on adding new approvers to OWNERS files. At this time we are [adding more reviewers](https://github.com/kubernetes/kubernetes/pulls?utf8=%E2%9C%93&q=is%3Apr%20%22Curating%20owners%3A%22%20) to take the load off of the current set of approvers and once we have had a chance to flush this out for a release we will begin adding new approvers again. Adding new approvers is planned for after the Kubernetes 1.6.0 release.
+
+## High Level flow
+
+### Step One: A PR is submitted
+
+After a PR is submitted, the automated kubernetes PR robot will append a message to the PR indicating the owners
+that are required for the PR to be submitted.
+
+Subsequently, a user can also request the approval message from the robot by writing:
+
+```
+@k8s-bot approvers
+```
+
+into a comment.
+
+In either case, the automation replies with an annotation that indicates
+the owners required to approve. The annotation is a comment that is applied to the PR.
+This comment will say:
+
+```
+Approval is required from <owner-a> OR <owner-b>, AND <owner-c> OR <owner-d>, AND ...
+```
+
+The set of required owners is drawn from the OWNERS files in the repository (see below). For each file
+there should be multiple different OWNERS, these owners are listed in the `OR` clause(s). Because
+it is possible that a PR may cover different directories, with disjoint sets of OWNERS, a PR may require
+approval from more than one person, this is where the `AND` clauses come from.
+
+`<owner-a>` should be the github user id of the owner _without_ a leading `@` symbol to prevent the owner
+from being cc'd into the PR by email.
+
+### Step Two: A PR is LGTM'd
+
+Once a PR is reviewed and LGTM'd it is eligible for submission. However, for it to be submitted
+an owner for all of the files changed in the PR have to 'approve' the PR. A user is an owner for a
+file if they are included in the OWNERS hierarchy (see below) for that file.
+
+Owner approval comes in two forms:
+
+ * An owner adds a comment to the PR saying "I approve" or "approved"
+ * An owner is the original author of the PR
+
+In the case of a comment based approval, the same rules as for the 'lgtm' label apply. If the PR is
+changed by pushing new commits to the PR, the previous approval is invalidated, and the owner(s) must
+approve again. Because of this is recommended that PR authors squash their PRs prior to getting approval
+from owners.
+
+### Step Three: A PR is merged
+
+Once a PR is LGTM'd and all required owners have approved, it is eligible for merge. The merge bot takes care of
+the actual merging.
+
+## Design details
+
+We need to build new features into the existing github munger in order to accomplish this. Additionally
+we need to add owners files to the repository.
+
+### Approval Munger
+
+We need to add a munger that adds comments to PRs indicating whose approval they require. This munger will
+look for PRs that do not have approvers already present in the comments, or where approvers have been
+requested, and add an appropriate comment to the PR.
+
+
+### Status Munger
+
+GitHub has a [status api](https://developer.github.com/v3/repos/statuses/), we will add a status munger that pushes a status onto a PR of approval status. This status will only be approved if the relevant
+approvers have approved the PR.
+
+### Requiring approval status
+
+Github has the ability to [require status checks prior to merging](https://help.github.com/articles/enabling-required-status-checks/)
+
+Once we have the status check munger described above implemented, we will add this required status check
+to our main branch as well as any release branches.
+
+### Adding owners files
+
+In each directory in the repository we may add an OWNERS file. This file will contain the github OWNERS
+for that directory. OWNERSHIP is hierarchical, so if a directory does not container an OWNERS file, its
+parent's OWNERS file is used instead. There will be a top-level OWNERS file to back-stop the system.
+
+Obviously changing the OWNERS file requires OWNERS permission.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/owners.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/pr_workflow.dia b/contributors/devel/pr_workflow.dia
new file mode 100644
index 00000000..753a284b
--- /dev/null
+++ b/contributors/devel/pr_workflow.dia
Binary files differ
diff --git a/contributors/devel/pr_workflow.png b/contributors/devel/pr_workflow.png
new file mode 100644
index 00000000..0e2bd5d6
--- /dev/null
+++ b/contributors/devel/pr_workflow.png
Binary files differ
diff --git a/contributors/devel/profiling.md b/contributors/devel/profiling.md
new file mode 100644
index 00000000..f50537f1
--- /dev/null
+++ b/contributors/devel/profiling.md
@@ -0,0 +1,46 @@
+# Profiling Kubernetes
+
+This document explain how to plug in profiler and how to profile Kubernetes services.
+
+## Profiling library
+
+Go comes with inbuilt 'net/http/pprof' profiling library and profiling web service. The way service works is binding debug/pprof/ subtree on a running webserver to the profiler. Reading from subpages of debug/pprof returns pprof-formatted profiles of the running binary. The output can be processed offline by the tool of choice, or used as an input to handy 'go tool pprof', which can graphically represent the result.
+
+## Adding profiling to services to APIserver.
+
+TL;DR: Add lines:
+
+```go
+m.mux.HandleFunc("/debug/pprof/", pprof.Index)
+m.mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
+m.mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
+```
+
+to the init(c *Config) method in 'pkg/master/master.go' and import 'net/http/pprof' package.
+
+In most use cases to use profiler service it's enough to do 'import _ net/http/pprof', which automatically registers a handler in the default http.Server. Slight inconvenience is that APIserver uses default server for intra-cluster communication, so plugging profiler to it is not really useful. In 'pkg/kubelet/server/server.go' more servers are created and started as separate goroutines. The one that is usually serving external traffic is secureServer. The handler for this traffic is defined in 'pkg/master/master.go' and stored in Handler variable. It is created from HTTP multiplexer, so the only thing that needs to be done is adding profiler handler functions to this multiplexer. This is exactly what lines after TL;DR do.
+
+## Connecting to the profiler
+
+Even when running profiler I found not really straightforward to use 'go tool pprof' with it. The problem is that at least for dev purposes certificates generated for APIserver are not signed by anyone trusted and because secureServer serves only secure traffic it isn't straightforward to connect to the service. The best workaround I found is by creating an ssh tunnel from the kubernetes_master open unsecured port to some external server, and use this server as a proxy. To save everyone looking for correct ssh flags, it is done by running:
+
+```sh
+ssh kubernetes_master -L<local_port>:localhost:8080
+```
+
+or analogous one for you Cloud provider. Afterwards you can e.g. run
+
+```sh
+go tool pprof http://localhost:<local_port>/debug/pprof/profile
+```
+
+to get 30 sec. CPU profile.
+
+## Contention profiling
+
+To enable contention profiling you need to add line `rt.SetBlockProfileRate(1)` in addition to `m.mux.HandleFunc(...)` added before (`rt` stands for `runtime` in `master.go`). This enables 'debug/pprof/block' subpage, which can be used as an input to `go tool pprof`.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/profiling.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/pull-requests.md b/contributors/devel/pull-requests.md
new file mode 100644
index 00000000..888d7320
--- /dev/null
+++ b/contributors/devel/pull-requests.md
@@ -0,0 +1,105 @@
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Pull Request Process](#pull-request-process)
+- [Life of a Pull Request](#life-of-a-pull-request)
+ - [Before sending a pull request](#before-sending-a-pull-request)
+ - [Release Notes](#release-notes)
+ - [Reviewing pre-release notes](#reviewing-pre-release-notes)
+ - [Visual overview](#visual-overview)
+- [Other notes](#other-notes)
+- [Automation](#automation)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Pull Request Process
+
+An overview of how pull requests are managed for kubernetes. This document
+assumes the reader has already followed the [development guide](development.md)
+to set up their environment.
+
+# Life of a Pull Request
+
+Unless in the last few weeks of a milestone when we need to reduce churn and stabilize, we aim to be always accepting pull requests.
+
+Either the [on call](on-call-rotations.md) manually or the [github "munger"](https://github.com/kubernetes/contrib/tree/master/mungegithub) submit-queue plugin automatically will manage merging PRs.
+
+There are several requirements for the submit-queue to work:
+* Author must have signed CLA ("cla: yes" label added to PR)
+* No changes can be made since last lgtm label was applied
+* k8s-bot must have reported the GCE E2E build and test steps passed (Jenkins unit/integration, Jenkins e2e)
+
+Additionally, for infrequent or new contributors, we require the on call to apply the "ok-to-merge" label manually. This is gated by the [whitelist](https://github.com/kubernetes/contrib/blob/master/mungegithub/whitelist.txt).
+
+## Before sending a pull request
+
+The following will save time for both you and your reviewer:
+
+* Enable [pre-commit hooks](development.md#committing-changes-to-your-fork) and verify they pass.
+* Verify `make verify` passes.
+* Verify `make test` passes.
+* Verify `make test-integration` passes.
+
+## Release Notes
+
+This section applies only to pull requests on the master branch.
+For cherry-pick PRs, see the [Cherrypick instructions](cherry-picks.md)
+
+1. All pull requests are initiated with a `release-note-label-needed` label.
+1. For a PR to be ready to merge, the `release-note-label-needed` label must be removed and one of the other `release-note-*` labels must be added.
+1. `release-note-none` is a valid option if the PR does not need to be mentioned
+ at release time.
+1. `release-note` labeled PRs generate a release note using the PR title by
+ default OR the release-note block in the PR template if filled in.
+ * See the [PR template](../../.github/PULL_REQUEST_TEMPLATE.md) for more
+ details.
+ * PR titles and body comments are mutable and can be modified at any time
+ prior to the release to reflect a release note friendly message.
+
+The only exception to these rules is when a PR is not a cherry-pick and is
+targeted directly to the non-master branch. In this case, a `release-note-*`
+label is required for that non-master PR.
+
+### Reviewing pre-release notes
+
+At any time, you can see what the release notes will look like on any branch.
+(NOTE: This only works on Linux for now)
+
+```
+$ git pull https://github.com/kubernetes/release
+$ RELNOTES=$PWD/release/relnotes
+$ cd /to/your/kubernetes/repo
+$ $RELNOTES -man # for details on how to use the tool
+# Show release notes from the last release on a branch to HEAD
+$ $RELNOTES --branch=master
+```
+
+## Visual overview
+
+![PR workflow](pr_workflow.png)
+
+# Other notes
+
+Pull requests that are purely support questions will be closed and
+redirected to [stackoverflow](http://stackoverflow.com/questions/tagged/kubernetes).
+We do this to consolidate help/support questions into a single channel,
+improve efficiency in responding to requests and make FAQs easier
+to find.
+
+Pull requests older than 2 weeks will be closed. Exceptions can be made
+for PRs that have active review comments, or that are awaiting other dependent PRs.
+Closed pull requests are easy to recreate, and little work is lost by closing a pull
+request that subsequently needs to be reopened. We want to limit the total number of PRs in flight to:
+* Maintain a clean project
+* Remove old PRs that would be difficult to rebase as the underlying code has changed over time
+* Encourage code velocity
+
+
+# Automation
+
+We use a variety of automation to manage pull requests. This automation is described in detail
+[elsewhere.](automation.md)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/pull-requests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/running-locally.md b/contributors/devel/running-locally.md
new file mode 100644
index 00000000..327d685e
--- /dev/null
+++ b/contributors/devel/running-locally.md
@@ -0,0 +1,170 @@
+Getting started locally
+-----------------------
+
+**Table of Contents**
+
+- [Requirements](#requirements)
+ - [Linux](#linux)
+ - [Docker](#docker)
+ - [etcd](#etcd)
+ - [go](#go)
+ - [OpenSSL](#openssl)
+- [Clone the repository](#clone-the-repository)
+- [Starting the cluster](#starting-the-cluster)
+- [Running a container](#running-a-container)
+- [Running a user defined pod](#running-a-user-defined-pod)
+- [Troubleshooting](#troubleshooting)
+ - [I cannot reach service IPs on the network.](#i-cannot-reach-service-ips-on-the-network)
+ - [I cannot create a replication controller with replica size greater than 1! What gives?](#i-cannot-create-a-replication-controller-with-replica-size-greater-than-1--what-gives)
+ - [I changed Kubernetes code, how do I run it?](#i-changed-kubernetes-code-how-do-i-run-it)
+ - [kubectl claims to start a container but `get pods` and `docker ps` don't show it.](#kubectl-claims-to-start-a-container-but-get-pods-and-docker-ps-dont-show-it)
+ - [The pods fail to connect to the services by host names](#the-pods-fail-to-connect-to-the-services-by-host-names)
+
+### Requirements
+
+#### Linux
+
+Not running Linux? Consider running [Minikube](http://kubernetes.io/docs/getting-started-guides/minikube/), or on a cloud provider like [Google Compute Engine](../getting-started-guides/gce.md).
+
+#### Docker
+
+At least [Docker](https://docs.docker.com/installation/#installation)
+1.3+. Ensure the Docker daemon is running and can be contacted (try `docker
+ps`). Some of the Kubernetes components need to run as root, which normally
+works fine with docker.
+
+#### etcd
+
+You need an [etcd](https://github.com/coreos/etcd/releases) in your path, please make sure it is installed and in your ``$PATH``.
+
+#### go
+
+You need [go](https://golang.org/doc/install) in your path (see [here](development.md#go-versions) for supported versions), please make sure it is installed and in your ``$PATH``.
+
+#### OpenSSL
+
+You need [OpenSSL](https://www.openssl.org/) installed. If you do not have the `openssl` command available, you may see the following error in `/tmp/kube-apiserver.log`:
+
+```
+server.go:333] Invalid Authentication Config: open /tmp/kube-serviceaccount.key: no such file or directory
+```
+
+### Clone the repository
+
+In order to run kubernetes you must have the kubernetes code on the local machine. Cloning this repository is sufficient.
+
+```$ git clone --depth=1 https://github.com/kubernetes/kubernetes.git```
+
+The `--depth=1` parameter is optional and will ensure a smaller download.
+
+### Starting the cluster
+
+In a separate tab of your terminal, run the following (since one needs sudo access to start/stop Kubernetes daemons, it is easier to run the entire script as root):
+
+```sh
+cd kubernetes
+hack/local-up-cluster.sh
+```
+
+This will build and start a lightweight local cluster, consisting of a master
+and a single node. Type Control-C to shut it down.
+
+If you've already compiled the Kubernetes components, then you can avoid rebuilding them with this script by using the `-O` flag.
+
+```sh
+./hack/local-up-cluster.sh -O
+```
+
+You can use the cluster/kubectl.sh script to interact with the local cluster. hack/local-up-cluster.sh will
+print the commands to run to point kubectl at the local cluster.
+
+
+### Running a container
+
+Your cluster is running, and you want to start running containers!
+
+You can now use any of the cluster/kubectl.sh commands to interact with your local setup.
+
+```sh
+cluster/kubectl.sh get pods
+cluster/kubectl.sh get services
+cluster/kubectl.sh get replicationcontrollers
+cluster/kubectl.sh run my-nginx --image=nginx --replicas=2 --port=80
+
+
+## begin wait for provision to complete, you can monitor the docker pull by opening a new terminal
+ sudo docker images
+ ## you should see it pulling the nginx image, once the above command returns it
+ sudo docker ps
+ ## you should see your container running!
+ exit
+## end wait
+
+## introspect Kubernetes!
+cluster/kubectl.sh get pods
+cluster/kubectl.sh get services
+cluster/kubectl.sh get replicationcontrollers
+```
+
+
+### Running a user defined pod
+
+Note the difference between a [container](../user-guide/containers.md)
+and a [pod](../user-guide/pods.md). Since you only asked for the former, Kubernetes will create a wrapper pod for you.
+However you cannot view the nginx start page on localhost. To verify that nginx is running you need to run `curl` within the docker container (try `docker exec`).
+
+You can control the specifications of a pod via a user defined manifest, and reach nginx through your browser on the port specified therein:
+
+```sh
+cluster/kubectl.sh create -f test/fixtures/doc-yaml/user-guide/pod.yaml
+```
+
+Congratulations!
+
+### Troubleshooting
+
+#### I cannot reach service IPs on the network.
+
+Some firewall software that uses iptables may not interact well with
+kubernetes. If you have trouble around networking, try disabling any
+firewall or other iptables-using systems, first. Also, you can check
+if SELinux is blocking anything by running a command such as `journalctl --since yesterday | grep avc`.
+
+By default the IP range for service cluster IPs is 10.0.*.* - depending on your
+docker installation, this may conflict with IPs for containers. If you find
+containers running with IPs in this range, edit hack/local-cluster-up.sh and
+change the service-cluster-ip-range flag to something else.
+
+#### I cannot create a replication controller with replica size greater than 1! What gives?
+
+You are running a single node setup. This has the limitation of only supporting a single replica of a given pod. If you are interested in running with larger replica sizes, we encourage you to try the local vagrant setup or one of the cloud providers.
+
+#### I changed Kubernetes code, how do I run it?
+
+```sh
+cd kubernetes
+make
+hack/local-up-cluster.sh
+```
+
+#### kubectl claims to start a container but `get pods` and `docker ps` don't show it.
+
+One or more of the Kubernetes daemons might've crashed. Tail the logs of each in /tmp.
+
+#### The pods fail to connect to the services by host names
+
+To start the DNS service, you need to set the following variables:
+
+```sh
+KUBE_ENABLE_CLUSTER_DNS=true
+KUBE_DNS_SERVER_IP="10.0.0.10"
+KUBE_DNS_DOMAIN="cluster.local"
+KUBE_DNS_REPLICAS=1
+```
+
+To know more on DNS service you can look [here](http://issue.k8s.io/6667). Related documents can be found [here](../../build-tools/kube-dns/#how-do-i-configure-it)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/running-locally.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/scheduler.md b/contributors/devel/scheduler.md
new file mode 100755
index 00000000..b1cfea7a
--- /dev/null
+++ b/contributors/devel/scheduler.md
@@ -0,0 +1,72 @@
+# The Kubernetes Scheduler
+
+The Kubernetes scheduler runs as a process alongside the other master
+components such as the API server. Its interface to the API server is to watch
+for Pods with an empty PodSpec.NodeName, and for each Pod, it posts a Binding
+indicating where the Pod should be scheduled.
+
+## The scheduling process
+
+```
+ +-------+
+ +---------------+ node 1|
+ | +-------+
+ |
+ +----> | Apply pred. filters
+ | |
+ | | +-------+
+ | +----+---------->+node 2 |
+ | | +--+----+
+ | watch | |
+ | | | +------+
+ | +---------------------->+node 3|
++--+---------------+ | +--+---+
+| Pods in apiserver| | |
++------------------+ | |
+ | |
+ | |
+ +------------V------v--------+
+ | Priority function |
+ +-------------+--------------+
+ |
+ | node 1: p=2
+ | node 2: p=5
+ v
+ select max{node priority} = node 2
+
+```
+
+The Scheduler tries to find a node for each Pod, one at a time.
+- First it applies a set of "predicates" to filter out inappropriate nodes. For example, if the PodSpec specifies resource requests, then the scheduler will filter out nodes that don't have at least that much resources available (computed as the capacity of the node minus the sum of the resource requests of the containers that are already running on the node).
+- Second, it applies a set of "priority functions"
+that rank the nodes that weren't filtered out by the predicate check. For example, it tries to spread Pods across nodes and zones while at the same time favoring the least (theoretically) loaded nodes (where "load" - in theory - is measured as the sum of the resource requests of the containers running on the node, divided by the node's capacity).
+- Finally, the node with the highest priority is chosen (or, if there are multiple such nodes, then one of them is chosen at random). The code for this main scheduling loop is in the function `Schedule()` in [plugin/pkg/scheduler/generic_scheduler.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/generic_scheduler.go)
+
+## Scheduler extensibility
+
+The scheduler is extensible: the cluster administrator can choose which of the pre-defined
+scheduling policies to apply, and can add new ones.
+
+### Policies (Predicates and Priorities)
+
+The built-in predicates and priorities are
+defined in [plugin/pkg/scheduler/algorithm/predicates/predicates.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/predicates/predicates.go) and
+[plugin/pkg/scheduler/algorithm/priorities/priorities.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/priorities/priorities.go), respectively.
+
+### Modifying policies
+
+The policies that are applied when scheduling can be chosen in one of two ways. Normally,
+the policies used are selected by the functions `defaultPredicates()` and `defaultPriorities()` in
+[plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go).
+However, the choice of policies can be overridden by passing the command-line flag `--policy-config-file` to the scheduler, pointing to a JSON file specifying which scheduling policies to use. See [examples/scheduler-policy-config.json](../../examples/scheduler-policy-config.json) for an example
+config file. (Note that the config file format is versioned; the API is defined in [plugin/pkg/scheduler/api](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/api/)).
+Thus to add a new scheduling policy, you should modify predicates.go or priorities.go, and either register the policy in `defaultPredicates()` or `defaultPriorities()`, or use a policy config file.
+
+## Exploring the code
+
+If you want to get a global picture of how the scheduler works, you can start in
+[plugin/cmd/kube-scheduler/app/server.go](http://releases.k8s.io/HEAD/plugin/cmd/kube-scheduler/app/server.go)
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/scheduler.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/scheduler_algorithm.md b/contributors/devel/scheduler_algorithm.md
new file mode 100755
index 00000000..28c6c2bc
--- /dev/null
+++ b/contributors/devel/scheduler_algorithm.md
@@ -0,0 +1,44 @@
+# Scheduler Algorithm in Kubernetes
+
+For each unscheduled Pod, the Kubernetes scheduler tries to find a node across the cluster according to a set of rules. A general introduction to the Kubernetes scheduler can be found at [scheduler.md](scheduler.md). In this document, the algorithm of how to select a node for the Pod is explained. There are two steps before a destination node of a Pod is chosen. The first step is filtering all the nodes and the second is ranking the remaining nodes to find a best fit for the Pod.
+
+## Filtering the nodes
+
+The purpose of filtering the nodes is to filter out the nodes that do not meet certain requirements of the Pod. For example, if the free resource on a node (measured by the capacity minus the sum of the resource requests of all the Pods that already run on the node) is less than the Pod's required resource, the node should not be considered in the ranking phase so it is filtered out. Currently, there are several "predicates" implementing different filtering policies, including:
+
+- `NoDiskConflict`: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.
+- `NoVolumeZoneConflict`: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
+- `PodFitsResources`: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check [QoS proposal](../design/resource-qos.md).
+- `PodFitsHostPorts`: Check if any HostPort required by the Pod is already occupied on the node.
+- `HostName`: Filter out all nodes except the one specified in the PodSpec's NodeName field.
+- `MatchNodeSelector`: Check if the labels of the node match the labels specified in the Pod's `nodeSelector` field and, as of Kubernetes v1.2, also match the `scheduler.alpha.kubernetes.io/affinity` pod annotation if present. See [here](../user-guide/node-selection/) for more details on both.
+- `MaxEBSVolumeCount`: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume -- see [Amazon's documentation](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html#linux-specific-volume-limits)). The maximum value can be controlled by setting the `KUBE_MAX_PD_VOLS` environment variable.
+- `MaxGCEPDVolumeCount`: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows -- see [GCE's documentation](https://cloud.google.com/compute/docs/disks/persistent-disks#limits_for_predefined_machine_types)). The maximum value can be controlled by setting the `KUBE_MAX_PD_VOLS` environment variable.
+- `CheckNodeMemoryPressure`: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no ``BestEffort`` should be placed on a node under memory pressure as it gets automatically evicted by kubelet.
+- `CheckNodeDiskPressure`: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.
+
+The details of the above predicates can be found in [plugin/pkg/scheduler/algorithm/predicates/predicates.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/predicates/predicates.go). All predicates mentioned above can be used in combination to perform a sophisticated filtering policy. Kubernetes uses some, but not all, of these predicates by default. You can see which ones are used by default in [plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go).
+
+## Ranking the nodes
+
+The filtered nodes are considered suitable to host the Pod, and it is often that there are more than one nodes remaining. Kubernetes prioritizes the remaining nodes to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. For each remaining node, a priority function gives a score which scales from 0-10 with 10 representing for "most preferred" and 0 for "least preferred". Each priority function is weighted by a positive number and the final score of each node is calculated by adding up all the weighted scores. For example, suppose there are two priority functions, `priorityFunc1` and `priorityFunc2` with weighting factors `weight1` and `weight2` respectively, the final score of some NodeA is:
+
+ finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)
+
+After the scores of all nodes are calculated, the node with highest score is chosen as the host of the Pod. If there are more than one nodes with equal highest scores, a random one among them is chosen.
+
+Currently, Kubernetes scheduler provides some practical priority functions, including:
+
+- `LeastRequestedPriority`: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
+- `BalancedResourceAllocation`: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
+- `SelectorSpreadPriority`: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
+- `CalculateAntiAffinityPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
+- `ImageLocalityPriority`: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
+- `NodeAffinityPriority`: (Kubernetes v1.2) Implements `preferredDuringSchedulingIgnoredDuringExecution` node affinity; see [here](../user-guide/node-selection/) for more details.
+
+The details of the above priority functions can be found in [plugin/pkg/scheduler/algorithm/priorities](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/priorities/). Kubernetes uses some, but not all, of these priority functions by default. You can see which ones are used by default in [plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go). Similar as predicates, you can combine the above priority functions and assign weight factors (positive number) to them as you want (check [scheduler.md](scheduler.md) for how to customize).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/scheduler_algorithm.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/testing.md b/contributors/devel/testing.md
new file mode 100644
index 00000000..45848f3b
--- /dev/null
+++ b/contributors/devel/testing.md
@@ -0,0 +1,230 @@
+# Testing guide
+
+Updated: 5/21/2016
+
+**Table of Contents**
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Testing guide](#testing-guide)
+ - [Unit tests](#unit-tests)
+ - [Run all unit tests](#run-all-unit-tests)
+ - [Set go flags during unit tests](#set-go-flags-during-unit-tests)
+ - [Run unit tests from certain packages](#run-unit-tests-from-certain-packages)
+ - [Run specific unit test cases in a package](#run-specific-unit-test-cases-in-a-package)
+ - [Stress running unit tests](#stress-running-unit-tests)
+ - [Unit test coverage](#unit-test-coverage)
+ - [Benchmark unit tests](#benchmark-unit-tests)
+ - [Integration tests](#integration-tests)
+ - [Install etcd dependency](#install-etcd-dependency)
+ - [Etcd test data](#etcd-test-data)
+ - [Run integration tests](#run-integration-tests)
+ - [Run a specific integration test](#run-a-specific-integration-test)
+ - [End-to-End tests](#end-to-end-tests)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+This assumes you already read the [development guide](development.md) to
+install go, godeps, and configure your git client. All command examples are
+relative to the `kubernetes` root directory.
+
+Before sending pull requests you should at least make sure your changes have
+passed both unit and integration tests.
+
+Kubernetes only merges pull requests when unit, integration, and e2e tests are
+passing, so it is often a good idea to make sure the e2e tests work as well.
+
+## Unit tests
+
+* Unit tests should be fully hermetic
+ - Only access resources in the test binary.
+* All packages and any significant files require unit tests.
+* The preferred method of testing multiple scenarios or input is
+ [table driven testing](https://github.com/golang/go/wiki/TableDrivenTests)
+ - Example: [TestNamespaceAuthorization](../../test/integration/auth/auth_test.go)
+* Unit tests must pass on OS X and Windows platforms.
+ - Tests using linux-specific features must be skipped or compiled out.
+ - Skipped is better, compiled out is required when it won't compile.
+* Concurrent unit test runs must pass.
+* See [coding conventions](coding-conventions.md).
+
+### Run all unit tests
+
+`make test` is the entrypoint for running the unit tests that ensures that
+`GOPATH` is set up correctly. If you have `GOPATH` set up correctly, you can
+also just use `go test` directly.
+
+```sh
+cd kubernetes
+make test # Run all unit tests.
+```
+
+### Set go flags during unit tests
+
+You can set [go flags](https://golang.org/cmd/go/) by setting the
+`KUBE_GOFLAGS` environment variable.
+
+### Run unit tests from certain packages
+
+`make test` accepts packages as arguments; the `k8s.io/kubernetes` prefix is
+added automatically to these:
+
+```sh
+make test WHAT=pkg/api # run tests for pkg/api
+```
+
+To run multiple targets you need quotes:
+
+```sh
+make test WHAT="pkg/api pkg/kubelet" # run tests for pkg/api and pkg/kubelet
+```
+
+In a shell, it's often handy to use brace expansion:
+
+```sh
+make test WHAT=pkg/{api,kubelet} # run tests for pkg/api and pkg/kubelet
+```
+
+### Run specific unit test cases in a package
+
+You can set the test args using the `KUBE_TEST_ARGS` environment variable.
+You can use this to pass the `-run` argument to `go test`, which accepts a
+regular expression for the name of the test that should be run.
+
+```sh
+# Runs TestValidatePod in pkg/api/validation with the verbose flag set
+make test WHAT=pkg/api/validation KUBE_GOFLAGS="-v" KUBE_TEST_ARGS='-run ^TestValidatePod$'
+
+# Runs tests that match the regex ValidatePod|ValidateConfigMap in pkg/api/validation
+make test WHAT=pkg/api/validation KUBE_GOFLAGS="-v" KUBE_TEST_ARGS="-run ValidatePod\|ValidateConfigMap$"
+```
+
+For other supported test flags, see the [golang
+documentation](https://golang.org/cmd/go/#hdr-Description_of_testing_flags).
+
+### Stress running unit tests
+
+Running the same tests repeatedly is one way to root out flakes.
+You can do this efficiently.
+
+```sh
+# Have 2 workers run all tests 5 times each (10 total iterations).
+make test PARALLEL=2 ITERATION=5
+```
+
+For more advanced ideas please see [flaky-tests.md](flaky-tests.md).
+
+### Unit test coverage
+
+Currently, collecting coverage is only supported for the Go unit tests.
+
+To run all unit tests and generate an HTML coverage report, run the following:
+
+```sh
+make test KUBE_COVER=y
+```
+
+At the end of the run, an HTML report will be generated with the path
+printed to stdout.
+
+To run tests and collect coverage in only one package, pass its relative path
+under the `kubernetes` directory as an argument, for example:
+
+```sh
+make test WHAT=pkg/kubectl KUBE_COVER=y
+```
+
+Multiple arguments can be passed, in which case the coverage results will be
+combined for all tests run.
+
+### Benchmark unit tests
+
+To run benchmark tests, you'll typically use something like:
+
+```sh
+go test ./pkg/apiserver -benchmem -run=XXX -bench=BenchmarkWatch
+```
+
+This will do the following:
+
+1. `-run=XXX` is a regular expression filter on the name of test cases to run
+2. `-bench=BenchmarkWatch` will run test methods with BenchmarkWatch in the name
+ * See `grep -nr BenchmarkWatch .` for examples
+3. `-benchmem` enables memory allocation stats
+
+See `go help test` and `go help testflag` for additional info.
+
+## Integration tests
+
+* Integration tests should only access other resources on the local machine
+ - Most commonly etcd or a service listening on localhost.
+* All significant features require integration tests.
+ - This includes kubectl commands
+* The preferred method of testing multiple scenarios or inputs
+is [table driven testing](https://github.com/golang/go/wiki/TableDrivenTests)
+ - Example: [TestNamespaceAuthorization](../../test/integration/auth/auth_test.go)
+* Each test should create its own master, httpserver and config.
+ - Example: [TestPodUpdateActiveDeadlineSeconds](../../test/integration/pods/pods_test.go)
+* See [coding conventions](coding-conventions.md).
+
+### Install etcd dependency
+
+Kubernetes integration tests require your `PATH` to include an
+[etcd](https://github.com/coreos/etcd/releases) installation. Kubernetes
+includes a script to help install etcd on your machine.
+
+```sh
+# Install etcd and add to PATH
+
+# Option a) install inside kubernetes root
+hack/install-etcd.sh # Installs in ./third_party/etcd
+echo export PATH="\$PATH:$(pwd)/third_party/etcd" >> ~/.profile # Add to PATH
+
+# Option b) install manually
+grep -E "image.*etcd" cluster/saltbase/etcd/etcd.manifest # Find version
+# Install that version using yum/apt-get/etc
+echo export PATH="\$PATH:<LOCATION>" >> ~/.profile # Add to PATH
+```
+
+### Etcd test data
+
+Many tests start an etcd server internally, storing test data in the operating system's temporary directory.
+
+If you see test failures because the temporary directory does not have sufficient space,
+or is on a volume with unpredictable write latency, you can override the test data directory
+for those internal etcd instances with the `TEST_ETCD_DIR` environment variable.
+
+### Run integration tests
+
+The integration tests are run using `make test-integration`.
+The Kubernetes integration tests are writting using the normal golang testing
+package but expect to have a running etcd instance to connect to. The `test-
+integration.sh` script wraps `make test` and sets up an etcd instance
+for the integration tests to use.
+
+```sh
+make test-integration # Run all integration tests.
+```
+
+This script runs the golang tests in package
+[`test/integration`](../../test/integration/).
+
+### Run a specific integration test
+
+You can use also use the `KUBE_TEST_ARGS` environment variable with the `hack
+/test-integration.sh` script to run a specific integration test case:
+
+```sh
+# Run integration test TestPodUpdateActiveDeadlineSeconds with the verbose flag set.
+make test-integration KUBE_GOFLAGS="-v" KUBE_TEST_ARGS="-run ^TestPodUpdateActiveDeadlineSeconds$"
+```
+
+If you set `KUBE_TEST_ARGS`, the test case will be run with only the `v1` API
+version and the watch cache test is skipped.
+
+## End-to-End tests
+
+Please refer to [End-to-End Testing in Kubernetes](e2e-tests.md).
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/testing.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/update-release-docs.md b/contributors/devel/update-release-docs.md
new file mode 100644
index 00000000..1e0988db
--- /dev/null
+++ b/contributors/devel/update-release-docs.md
@@ -0,0 +1,115 @@
+# Table of Contents
+
+<!-- BEGIN MUNGE: GENERATED_TOC -->
+
+- [Table of Contents](#table-of-contents)
+- [Overview](#overview)
+- [Adding a new docs collection for a release](#adding-a-new-docs-collection-for-a-release)
+- [Updating docs in an existing collection](#updating-docs-in-an-existing-collection)
+ - [Updating docs on HEAD](#updating-docs-on-head)
+ - [Updating docs in release branch](#updating-docs-in-release-branch)
+ - [Updating docs in gh-pages branch](#updating-docs-in-gh-pages-branch)
+
+<!-- END MUNGE: GENERATED_TOC -->
+
+# Overview
+
+This document explains how to update kubernetes release docs hosted at http://kubernetes.io/docs/.
+
+http://kubernetes.io is served using the [gh-pages
+branch](https://github.com/kubernetes/kubernetes/tree/gh-pages) of kubernetes repo on github.
+Updating docs in that branch will update http://kubernetes.io
+
+There are 2 scenarios which require updating docs:
+* Adding a new docs collection for a release.
+* Updating docs in an existing collection.
+
+# Adding a new docs collection for a release
+
+Whenever a new release series (`release-X.Y`) is cut from `master`, we push the
+corresponding set of docs to `http://kubernetes.io/vX.Y/docs`. The steps are as follows:
+
+* Create a `_vX.Y` folder in `gh-pages` branch.
+* Add `vX.Y` as a valid collection in [_config.yml](https://github.com/kubernetes/kubernetes/blob/gh-pages/_config.yml)
+* Create a new `_includes/nav_vX.Y.html` file with the navigation menu. This can
+ be a copy of `_includes/nav_vX.Y-1.html` with links to new docs added and links
+ to deleted docs removed. Update [_layouts/docwithnav.html]
+ (https://github.com/kubernetes/kubernetes/blob/gh-pages/_layouts/docwithnav.html)
+ to include this new navigation html file. Example PR: [#16143](https://github.com/kubernetes/kubernetes/pull/16143).
+* [Pull docs from release branch](#updating-docs-in-gh-pages-branch) in `_vX.Y`
+ folder.
+
+Once these changes have been submitted, you should be able to reach the docs at
+`http://kubernetes.io/vX.Y/docs/` where you can test them.
+
+To make `X.Y` the default version of docs:
+
+* Update [_config.yml](https://github.com/kubernetes/kubernetes/blob/gh-pages/_config.yml)
+ and [/kubernetes/kubernetes/blob/gh-pages/_docs/index.md](https://github.com/kubernetes/kubernetes/blob/gh-pages/_docs/index.md)
+ to point to the new version. Example PR: [#16416](https://github.com/kubernetes/kubernetes/pull/16416).
+* Update [_includes/docversionselector.html](https://github.com/kubernetes/kubernetes/blob/gh-pages/_includes/docversionselector.html)
+ to make `vX.Y` the default version.
+* Add "Disallow: /vX.Y-1/" to existing [robots.txt](https://github.com/kubernetes/kubernetes/blob/gh-pages/robots.txt)
+ file to hide old content from web crawlers and focus SEO on new docs. Example PR:
+ [#16388](https://github.com/kubernetes/kubernetes/pull/16388).
+* Regenerate [sitemaps.xml](https://github.com/kubernetes/kubernetes/blob/gh-pages/sitemap.xml)
+ so that it now contains `vX.Y` links. Sitemap can be regenerated using
+ https://www.xml-sitemaps.com. Example PR: [#17126](https://github.com/kubernetes/kubernetes/pull/17126).
+* Resubmit the updated sitemaps file to [Google
+ webmasters](https://www.google.com/webmasters/tools/sitemap-list?siteUrl=http://kubernetes.io/) for google to index the new links.
+* Update [_layouts/docwithnav.html] (https://github.com/kubernetes/kubernetes/blob/gh-pages/_layouts/docwithnav.html)
+ to include [_includes/archivedocnotice.html](https://github.com/kubernetes/kubernetes/blob/gh-pages/_includes/archivedocnotice.html)
+ for `vX.Y-1` docs which need to be archived.
+* Ping @thockin to update docs.k8s.io to redirect to `http://kubernetes.io/vX.Y/`. [#18788](https://github.com/kubernetes/kubernetes/issues/18788).
+
+http://kubernetes.io/docs/ should now be redirecting to `http://kubernetes.io/vX.Y/`.
+
+# Updating docs in an existing collection
+
+The high level steps to update docs in an existing collection are:
+
+1. Update docs on `HEAD` (master branch)
+2. Cherryick the change in relevant release branch.
+3. Update docs on `gh-pages`.
+
+## Updating docs on HEAD
+
+[Development guide](development.md) provides general instructions on how to contribute to kubernetes github repo.
+[Docs how to guide](how-to-doc.md) provides conventions to follow while writing docs.
+
+## Updating docs in release branch
+
+Once docs have been updated in the master branch, the changes need to be
+cherrypicked in the latest release branch.
+[Cherrypick guide](cherry-picks.md) has more details on how to cherrypick your change.
+
+## Updating docs in gh-pages branch
+
+Once release branch has all the relevant changes, we can pull in the latest docs
+in `gh-pages` branch.
+Run the following 2 commands in `gh-pages` branch to update docs for release `X.Y`:
+
+```
+_tools/import_docs vX.Y _vX.Y release-X.Y release-X.Y
+```
+
+For ex: to pull in docs for release 1.1, run:
+
+```
+_tools/import_docs v1.1 _v1.1 release-1.1 release-1.1
+```
+
+Apart from copying over the docs, `_tools/release_docs` also does some post processing
+(like updating the links to docs to point to http://kubernetes.io/docs/ instead of pointing to github repo).
+Note that we always pull in the docs from release branch and not from master (pulling docs
+from master requires some extra processing like versionizing the links and removing unversioned warnings).
+
+We delete all existing docs before pulling in new ones to ensure that deleted
+docs go away.
+
+If the change added or deleted a doc, then update the corresponding `_includes/nav_vX.Y.html` file as well.
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/update-release-docs.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/updating-docs-for-feature-changes.md b/contributors/devel/updating-docs-for-feature-changes.md
new file mode 100644
index 00000000..309b809d
--- /dev/null
+++ b/contributors/devel/updating-docs-for-feature-changes.md
@@ -0,0 +1,76 @@
+# How to update docs for new kubernetes features
+
+This document describes things to consider when updating Kubernetes docs for new features or changes to existing features (including removing features).
+
+## Who should read this doc?
+
+Anyone making user facing changes to kubernetes. This is especially important for Api changes or anything impacting the getting started experience.
+
+## What docs changes are needed when adding or updating a feature in kubernetes?
+
+### When making Api changes
+
+*e.g. adding Deployments*
+* Always make sure docs for downstream effects are updated *(StatefulSet -> PVC, Deployment -> ReplicationController)*
+* Add or update the corresponding *[Glossary](http://kubernetes.io/docs/reference/)* item
+* Verify the guides / walkthroughs do not require any changes:
+ * **If your change will be recommended over the approaches shown in these guides, then they must be updated to reflect your change**
+ * [Hello Node](http://kubernetes.io/docs/hellonode/)
+ * [K8s101](http://kubernetes.io/docs/user-guide/walkthrough/)
+ * [K8S201](http://kubernetes.io/docs/user-guide/walkthrough/k8s201/)
+ * [Guest-book](https://github.com/kubernetes/kubernetes/tree/release-1.2/examples/guestbook)
+ * [Thorough-walkthrough](http://kubernetes.io/docs/user-guide/)
+* Verify the [landing page examples](http://kubernetes.io/docs/samples/) do not require any changes (those under "Recently updated samples")
+ * **If your change will be recommended over the approaches shown in the "Updated" examples, then they must be updated to reflect your change**
+ * If you are aware that your change will be recommended over the approaches shown in non-"Updated" examples, create an Issue
+* Verify the collection of docs under the "Guides" section do not require updates (may need to use grep for this until are docs are more organized)
+
+### When making Tools changes
+
+*e.g. updating kube-dash or kubectl*
+* If changing kubectl, verify the guides / walkthroughs do not require any changes:
+ * **If your change will be recommended over the approaches shown in these guides, then they must be updated to reflect your change**
+ * [Hello Node](http://kubernetes.io/docs/hellonode/)
+ * [K8s101](http://kubernetes.io/docs/user-guide/walkthrough/)
+ * [K8S201](http://kubernetes.io/docs/user-guide/walkthrough/k8s201/)
+ * [Guest-book](https://github.com/kubernetes/kubernetes/tree/release-1.2/examples/guestbook)
+ * [Thorough-walkthrough](http://kubernetes.io/docs/user-guide/)
+* If updating an existing tool
+ * Search for any docs about the tool and update them
+* If adding a new tool for end users
+ * Add a new page under [Guides](http://kubernetes.io/docs/)
+* **If removing a tool (kube-ui), make sure documentation that references it is updated appropriately!**
+
+### When making cluster setup changes
+
+*e.g. adding Multi-AZ support*
+* Update the relevant [Administering Clusters](http://kubernetes.io/docs/) pages
+
+### When making Kubernetes binary changes
+
+*e.g. adding a flag, changing Pod GC behavior, etc*
+* Add or update a page under [Configuring Kubernetes](http://kubernetes.io/docs/)
+
+## Where do the docs live?
+
+1. Most external user facing docs live in the [kubernetes/docs](https://github.com/kubernetes/kubernetes.github.io) repo
+ * Also see the *[general instructions](http://kubernetes.io/editdocs/)* for making changes to the docs website
+2. Internal design and development docs live in the [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) repo
+
+## Who should help review docs changes?
+
+* cc *@kubernetes/docs*
+* Changes to [kubernetes/docs](https://github.com/kubernetes/kubernetes.github.io) repo must have both a Technical Review and a Docs Review
+
+## Tips for writing new docs
+
+* Try to keep new docs small and focused
+* Document pre-requisites (if they exist)
+* Document what concepts will be covered in the document
+* Include screen shots or pictures in documents for GUIs
+* *TODO once we have a standard widget set we are happy with* - include diagrams to help describe complex ideas (not required yet)
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/updating-docs-for-feature-changes.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/writing-a-getting-started-guide.md b/contributors/devel/writing-a-getting-started-guide.md
new file mode 100644
index 00000000..b1d65d60
--- /dev/null
+++ b/contributors/devel/writing-a-getting-started-guide.md
@@ -0,0 +1,101 @@
+# Writing a Getting Started Guide
+
+This page gives some advice for anyone planning to write or update a Getting Started Guide for Kubernetes.
+It also gives some guidelines which reviewers should follow when reviewing a pull request for a
+guide.
+
+A Getting Started Guide is instructions on how to create a Kubernetes cluster on top of a particular
+type(s) of infrastructure. Infrastructure includes: the IaaS provider for VMs;
+the node OS; inter-node networking; and node Configuration Management system.
+A guide refers to scripts, Configuration Management files, and/or binary assets such as RPMs. We call
+the combination of all these things needed to run on a particular type of infrastructure a
+**distro**.
+
+[The Matrix](../../docs/getting-started-guides/README.md) lists the distros. If there is already a guide
+which is similar to the one you have planned, consider improving that one.
+
+
+Distros fall into two categories:
+ - **versioned distros** are tested to work with a particular binary release of Kubernetes. These
+ come in a wide variety, reflecting a wide range of ideas and preferences in how to run a cluster.
+ - **development distros** are tested work with the latest Kubernetes source code. But, there are
+ relatively few of these and the bar is much higher for creating one. They must support
+ fully automated cluster creation, deletion, and upgrade.
+
+There are different guidelines for each.
+
+## Versioned Distro Guidelines
+
+These guidelines say *what* to do. See the Rationale section for *why*.
+ - Send us a PR.
+ - Put the instructions in `docs/getting-started-guides/...`. Scripts go there too. This helps devs easily
+ search for uses of flags by guides.
+ - We may ask that you host binary assets or large amounts of code in our `contrib` directory or on your
+ own repo.
+ - Add or update a row in [The Matrix](../../docs/getting-started-guides/README.md).
+ - State the binary version of Kubernetes that you tested clearly in your Guide doc.
+ - Setup a cluster and run the [conformance tests](e2e-tests.md#conformance-tests) against it, and report the
+ results in your PR.
+ - Versioned distros should typically not modify or add code in `cluster/`. That is just scripts for developer
+ distros.
+ - When a new major or minor release of Kubernetes comes out, we may also release a new
+ conformance test, and require a new conformance test run to earn a conformance checkmark.
+
+If you have a cluster partially working, but doing all the above steps seems like too much work,
+we still want to hear from you. We suggest you write a blog post or a Gist, and we will link to it on our wiki page.
+Just file an issue or chat us on [Slack](http://slack.kubernetes.io) and one of the committers will link to it from the wiki.
+
+## Development Distro Guidelines
+
+These guidelines say *what* to do. See the Rationale section for *why*.
+ - the main reason to add a new development distro is to support a new IaaS provider (VM and
+ network management). This means implementing a new `pkg/cloudprovider/providers/$IAAS_NAME`.
+ - Development distros should use Saltstack for Configuration Management.
+ - development distros need to support automated cluster creation, deletion, upgrading, etc.
+ This mean writing scripts in `cluster/$IAAS_NAME`.
+ - all commits to the tip of this repo need to not break any of the development distros
+ - the author of the change is responsible for making changes necessary on all the cloud-providers if the
+ change affects any of them, and reverting the change if it breaks any of the CIs.
+ - a development distro needs to have an organization which owns it. This organization needs to:
+ - Setting up and maintaining Continuous Integration that runs e2e frequently (multiple times per day) against the
+ Distro at head, and which notifies all devs of breakage.
+ - being reasonably available for questions and assisting with
+ refactoring and feature additions that affect code for their IaaS.
+
+## Rationale
+
+ - We want people to create Kubernetes clusters with whatever IaaS, Node OS,
+ configuration management tools, and so on, which they are familiar with. The
+ guidelines for **versioned distros** are designed for flexibility.
+ - We want developers to be able to work without understanding all the permutations of
+ IaaS, NodeOS, and configuration management. The guidelines for **developer distros** are designed
+ for consistency.
+ - We want users to have a uniform experience with Kubernetes whenever they follow instructions anywhere
+ in our Github repository. So, we ask that versioned distros pass a **conformance test** to make sure
+ really work.
+ - We want to **limit the number of development distros** for several reasons. Developers should
+ only have to change a limited number of places to add a new feature. Also, since we will
+ gate commits on passing CI for all distros, and since end-to-end tests are typically somewhat
+ flaky, it would be highly likely for there to be false positives and CI backlogs with many CI pipelines.
+ - We do not require versioned distros to do **CI** for several reasons. It is a steep
+ learning curve to understand our automated testing scripts. And it is considerable effort
+ to fully automate setup and teardown of a cluster, which is needed for CI. And, not everyone
+ has the time and money to run CI. We do not want to
+ discourage people from writing and sharing guides because of this.
+ - Versioned distro authors are free to run their own CI and let us know if there is breakage, but we
+ will not include them as commit hooks -- there cannot be so many commit checks that it is impossible
+ to pass them all.
+ - We prefer a single Configuration Management tool for development distros. If there were more
+ than one, the core developers would have to learn multiple tools and update config in multiple
+ places. **Saltstack** happens to be the one we picked when we started the project. We
+ welcome versioned distros that use any tool; there are already examples of
+ CoreOS Fleet, Ansible, and others.
+ - You can still run code from head or your own branch
+ if you use another Configuration Management tool -- you just have to do some manual steps
+ during testing and deployment.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/writing-a-getting-started-guide.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/contributors/devel/writing-good-e2e-tests.md b/contributors/devel/writing-good-e2e-tests.md
new file mode 100644
index 00000000..ab13aff2
--- /dev/null
+++ b/contributors/devel/writing-good-e2e-tests.md
@@ -0,0 +1,235 @@
+# Writing good e2e tests for Kubernetes #
+
+## Patterns and Anti-Patterns ##
+
+### Goals of e2e tests ###
+
+Beyond the obvious goal of providing end-to-end system test coverage,
+there are a few less obvious goals that you should bear in mind when
+designing, writing and debugging your end-to-end tests. In
+particular, "flaky" tests, which pass most of the time but fail
+intermittently for difficult-to-diagnose reasons are extremely costly
+in terms of blurring our regression signals and slowing down our
+automated merge queue. Up-front time and effort designing your test
+to be reliable is very well spent. Bear in mind that we have hundreds
+of tests, each running in dozens of different environments, and if any
+test in any test environment fails, we have to assume that we
+potentially have some sort of regression. So if a significant number
+of tests fail even only 1% of the time, basic statistics dictates that
+we will almost never have a "green" regression indicator. Stated
+another way, writing a test that is only 99% reliable is just about
+useless in the harsh reality of a CI environment. In fact it's worse
+than useless, because not only does it not provide a reliable
+regression indicator, but it also costs a lot of subsequent debugging
+time, and delayed merges.
+
+#### Debuggability ####
+
+If your test fails, it should provide as detailed as possible reasons
+for the failure in it's output. "Timeout" is not a useful error
+message. "Timed out after 60 seconds waiting for pod xxx to enter
+running state, still in pending state" is much more useful to someone
+trying to figure out why your test failed and what to do about it.
+Specifically,
+[assertion](https://onsi.github.io/gomega/#making-assertions) code
+like the following generates rather useless errors:
+
+```
+Expect(err).NotTo(HaveOccurred())
+```
+
+Rather
+[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this:
+
+```
+Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
+```
+
+On the other hand, overly verbose logging, particularly of non-error conditions, can make
+it unnecessarily difficult to figure out whether a test failed and if
+so why? So don't log lots of irrelevant stuff either.
+
+#### Ability to run in non-dedicated test clusters ####
+
+To reduce end-to-end delay and improve resource utilization when
+running e2e tests, we try, where possible, to run large numbers of
+tests in parallel against the same test cluster. This means that:
+
+1. you should avoid making any assumption (implicit or explicit) that
+your test is the only thing running against the cluster. For example,
+making the assumption that your test can run a pod on every node in a
+cluster is not a safe assumption, as some other tests, running at the
+same time as yours, might have saturated one or more nodes in the
+cluster. Similarly, running a pod in the system namespace, and
+assuming that that will increase the count of pods in the system
+namespace by one is not safe, as some other test might be creating or
+deleting pods in the system namespace at the same time as your test.
+If you do legitimately need to write a test like that, make sure to
+label it ["\[Serial\]"](e2e-tests.md#kinds_of_tests) so that it's easy
+to identify, and not run in parallel with any other tests.
+1. You should avoid doing things to the cluster that make it difficult
+for other tests to reliably do what they're trying to do, at the same
+time. For example, rebooting nodes, disconnecting network interfaces,
+or upgrading cluster software as part of your test is likely to
+violate the assumptions that other tests might have made about a
+reasonably stable cluster environment. If you need to write such
+tests, please label them as
+["\[Disruptive\]"](e2e-tests.md#kinds_of_tests) so that it's easy to
+identify them, and not run them in parallel with other tests.
+1. You should avoid making assumptions about the Kubernetes API that
+are not part of the API specification, as your tests will break as
+soon as these assumptions become invalid. For example, relying on
+specific Events, Event reasons or Event messages will make your tests
+very brittle.
+
+#### Speed of execution ####
+
+We have hundreds of e2e tests, some of which we run in serial, one
+after the other, in some cases. If each test takes just a few minutes
+to run, that very quickly adds up to many, many hours of total
+execution time. We try to keep such total execution time down to a
+few tens of minutes at most. Therefore, try (very hard) to keep the
+execution time of your individual tests below 2 minutes, ideally
+shorter than that. Concretely, adding inappropriately long 'sleep'
+statements or other gratuitous waits to tests is a killer. If under
+normal circumstances your pod enters the running state within 10
+seconds, and 99.9% of the time within 30 seconds, it would be
+gratuitous to wait 5 minutes for this to happen. Rather just fail
+after 30 seconds, with a clear error message as to why your test
+failed ("e.g. Pod x failed to become ready after 30 seconds, it
+usually takes 10 seconds"). If you do have a truly legitimate reason
+for waiting longer than that, or writing a test which takes longer
+than 2 minutes to run, comment very clearly in the code why this is
+necessary, and label the test as
+["\[Slow\]"](e2e-tests.md#kinds_of_tests), so that it's easy to
+identify and avoid in test runs that are required to complete
+timeously (for example those that are run against every code
+submission before it is allowed to be merged).
+Note that completing within, say, 2 minutes only when the test
+passes is not generally good enough. Your test should also fail in a
+reasonable time. We have seen tests that, for example, wait up to 10
+minutes for each of several pods to become ready. Under good
+conditions these tests might pass within a few seconds, but if the
+pods never become ready (e.g. due to a system regression) they take a
+very long time to fail and typically cause the entire test run to time
+out, so that no results are produced. Again, this is a lot less
+useful than a test that fails reliably within a minute or two when the
+system is not working correctly.
+
+#### Resilience to relatively rare, temporary infrastructure glitches or delays ####
+
+Remember that your test will be run many thousands of
+times, at different times of day and night, probably on different
+cloud providers, under different load conditions. And often the
+underlying state of these systems is stored in eventually consistent
+data stores. So, for example, if a resource creation request is
+theoretically asynchronous, even if you observe it to be practically
+synchronous most of the time, write your test to assume that it's
+asynchronous (e.g. make the "create" call, and poll or watch the
+resource until it's in the correct state before proceeding).
+Similarly, don't assume that API endpoints are 100% available.
+They're not. Under high load conditions, API calls might temporarily
+fail or time-out. In such cases it's appropriate to back off and retry
+a few times before failing your test completely (in which case make
+the error message very clear about what happened, e.g. "Retried
+http://... 3 times - all failed with xxx". Use the standard
+retry mechanisms provided in the libraries detailed below.
+
+### Some concrete tools at your disposal ###
+
+Obviously most of the above goals apply to many tests, not just yours.
+So we've developed a set of reusable test infrastructure, libraries
+and best practises to help you to do the right thing, or at least do
+the same thing as other tests, so that if that turns out to be the
+wrong thing, it can be fixed in one place, not hundreds, to be the
+right thing.
+
+Here are a few pointers:
+
++ [E2e Framework](../../test/e2e/framework/framework.go):
+ Familiarise yourself with this test framework and how to use it.
+ Amongst others, it automatically creates uniquely named namespaces
+ within which your tests can run to avoid name clashes, and reliably
+ automates cleaning up the mess after your test has completed (it
+ just deletes everything in the namespace). This helps to ensure
+ that tests do not leak resources. Note that deleting a namespace
+ (and by implication everything in it) is currently an expensive
+ operation. So the fewer resources you create, the less cleaning up
+ the framework needs to do, and the faster your test (and other
+ tests running concurrently with yours) will complete. Your tests
+ should always use this framework. Trying other home-grown
+ approaches to avoiding name clashes and resource leaks has proven
+ to be a very bad idea.
++ [E2e utils library](../../test/e2e/framework/util.go):
+ This handy library provides tons of reusable code for a host of
+ commonly needed test functionality, including waiting for resources
+ to enter specified states, safely and consistently retrying failed
+ operations, usefully reporting errors, and much more. Make sure
+ that you're familiar with what's available there, and use it.
+ Likewise, if you come across a generally useful mechanism that's
+ not yet implemented there, add it so that others can benefit from
+ your brilliance. In particular pay attention to the variety of
+ timeout and retry related constants at the top of that file. Always
+ try to reuse these constants rather than try to dream up your own
+ values. Even if the values there are not precisely what you would
+ like to use (timeout periods, retry counts etc), the benefit of
+ having them be consistent and centrally configurable across our
+ entire test suite typically outweighs your personal preferences.
++ **Follow the examples of stable, well-written tests:** Some of our
+ existing end-to-end tests are better written and more reliable than
+ others. A few examples of well-written tests include:
+ [Replication Controllers](../../test/e2e/rc.go),
+ [Services](../../test/e2e/service.go),
+ [Reboot](../../test/e2e/reboot.go).
++ [Ginkgo Test Framework](https://github.com/onsi/ginkgo): This is the
+ test library and runner upon which our e2e tests are built. Before
+ you write or refactor a test, read the docs and make sure that you
+ understand how it works. In particular be aware that every test is
+ uniquely identified and described (e.g. in test reports) by the
+ concatenation of it's `Describe` clause and nested `It` clauses.
+ So for example `Describe("Pods",...).... It(""should be scheduled
+ with cpu and memory limits")` produces a sane test identifier and
+ descriptor `Pods should be scheduled with cpu and memory limits`,
+ which makes it clear what's being tested, and hence what's not
+ working if it fails. Other good examples include:
+
+```
+ CAdvisor should be healthy on every node
+```
+
+and
+
+```
+ Daemon set should run and stop complex daemon
+```
+
+ On the contrary
+(these are real examples), the following are less good test
+descriptors:
+
+```
+ KubeProxy should test kube-proxy
+```
+
+and
+
+```
+Nodes [Disruptive] Network when a node becomes unreachable
+[replication controller] recreates pods scheduled on the
+unreachable node AND allows scheduling of pods on a node after
+it rejoins the cluster
+```
+
+An improvement might be
+
+```
+Unreachable nodes are evacuated and then repopulated upon rejoining [Disruptive]
+```
+
+Note that opening issues for specific better tooling is welcome, and
+code implementing that tooling is even more welcome :-).
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/devel/writing-good-e2e-tests.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->