From b8c71ec88501b94c4f98d35304a9eefd582c3767 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Thu, 16 Oct 2014 14:45:16 -0700 Subject: Separated user, dev, and design docs. Renamed: logging.md -> devel/logging.m Renamed: access.md -> design/access.md Renamed: identifiers.md -> design/identifiers.md Renamed: labels.md -> design/labels.md Renamed: namespaces.md -> design/namespaces.md Renamed: security.md -> design/security.md Renamed: networking.md -> design/networking.md Added abbreviated user user-focused document in place of most moved docs. Added docs/README.md explains how docs are organized. Added short, user-oriented documentation on labels Added a glossary. Fixed up some links. --- access.md | 248 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ identifiers.md | 90 +++++++++++++++++++++ labels.md | 68 ++++++++++++++++ namespaces.md | 193 ++++++++++++++++++++++++++++++++++++++++++++ networking.md | 107 +++++++++++++++++++++++++ security.md | 26 ++++++ 6 files changed, 732 insertions(+) create mode 100644 access.md create mode 100644 identifiers.md create mode 100644 labels.md create mode 100644 namespaces.md create mode 100644 networking.md create mode 100644 security.md diff --git a/access.md b/access.md new file mode 100644 index 00000000..7af64ac9 --- /dev/null +++ b/access.md @@ -0,0 +1,248 @@ +# K8s Identity and Access Management Sketch + +This document suggests a direction for identity and access management in the Kubernetes system. + + +## Background + +High level goals are: + - Have a plan for how identity, authentication, and authorization will fit in to the API. + - Have a plan for partitioning resources within a cluster between independent organizational units. + - Ease integration with existing enterprise and hosted scenarios. + +### Actors +Each of these can act as normal users or attackers. + - External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access. + - K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods) + - K8s Project Admins: People who manage access for some K8s Users + - K8s Cluster Admins: People who control the machines, networks, or binaries that comprise a K8s cluster. + - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. + +### Threats +Both intentional attacks and accidental use of privilege are concerns. + +For both cases it may be useful to think about these categories differently: + - Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s. + - K8s API Path - attack by sending network messages to any K8s API endpoint. + - Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category. + +This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document. + +### Assets to protect + +External User assets: + - Personal information like private messages, or images uploaded by External Users + - web server logs + +K8s User assets: + - External User assets of each K8s User + - things private to the K8s app, like: + - credentials for accessing other services (docker private repos, storage services, facebook, etc) + - SSL certificates for web servers + - proprietary data and code + +K8s Cluster assets: + - Assets of each K8s User + - Machine Certificates or secrets. + - The value of K8s cluster computing resources (cpu, memory, etc). + +This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins. + +### Usage environments +Cluster in Small organization: + - K8s Admins may be the same people as K8s Users. + - few K8s Admins. + - prefer ease of use to fine-grained access control/precise accounting, etc. + - Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster. + +Cluster in Large organization: + - K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles. + - K8s Users need to be protected from each other. + - Auditing of K8s User and K8s Admin actions important. + - flexible accurate usage accounting and resource controls important. + - Lots of automated access to APIs. + - Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure. + +Org-run cluster: + - organization that runs K8s master components is same as the org that runs apps on K8s. + - Minions may be on-premises VMs or physical machines; Cloud VMs; or a mix. + +Hosted cluster: + - Offering K8s API as a service, or offering a Paas or Saas built on K8s + - May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure. + - May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.) + - Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded). + +K8s ecosystem services: + - There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case. + +Pods configs should be largely portable between Org-run and hosted configurations. + + +# Design +Related discussion: +- https://github.com/GoogleCloudPlatform/kubernetes/issues/442 +- https://github.com/GoogleCloudPlatform/kubernetes/issues/443 + +This doc describes two security profiles: + - Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users. + - Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure. + +K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix. + +Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00. + +## Identity +###userAccount +K8s will have a `userAccount` API object. +- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs. +- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field. +- `userAccount` is not related to the unix username of processes in Pods created by that userAccount. +- `userAccount` API objects can have labels + +The system may associate one or more Authentication Methods with a +`userAccount` (but they are not formally part of the userAccount object.) +In a simple deployment, the authentication method for a +user might be an authentication token which is verified by a K8s server. In a +more complex deployment, the authentication might be delegated to +another system which is trusted by the K8s API to authenticate users, but where +the authentication details are unknown to K8s. + +Initial Features: +- there is no superuser `userAccount` +- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this. +- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed. +- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount. + +Improvements: +- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management. + +Simple Profile: + - single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all. + +Enterprise Profile: + - every human user has own `userAccount`. + - `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles. + - each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`) + - automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file. + +###Unix accounts +A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity. + +Initially: +- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest. +- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts. + +Improvements: +- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) +- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids. +- any features that help users avoid use of privileged containers (https://github.com/GoogleCloudPlatform/kubernetes/issues/391) + +###Namespaces +K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. + +Namespaces are described in [namespace.md](namespaces.md). + +In the Enterprise Profile: + - a `userAccount` may have permission to access several `namespace`s. + +In the Simple Profile: + - There is a single `namespace` used by the single user. + +Namespaces versus userAccount vs Labels: +- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. +- `labels` (see [docs/labels.md](labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. +- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people. + + +## Authentication + +Goals for K8s authentication: +- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required. +- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users. + - For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication. + - So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver. + - For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal. +- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code. + +Initially: +- Tokens used to authenticate a user. +- Long lived tokens identify a particular `userAccount`. +- Administrator utility generates tokens at cluster setup. +- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750 +- No scopes for tokens. Authorization happens in the API server +- Tokens dynamically generated by apiserver to identify pods which are making API calls. +- Tokens checked in a module of the APIserver. +- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller. + +Improvements: +- Refresh of tokens. +- SSH keys to access inside containers. + +To be considered for subsequent versions: +- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749) +- Scoped tokens. +- Tokens that are bound to the channel between the client and the api server + - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf + - http://www.browserauth.net + + +## Authorization + +K8s authorization should: +- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems. +- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service). +- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults. +- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Controllers, Services, and the identities and policies for those Pods and Controllers. +- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies. + +K8s will implement a relatively simple +[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model. +The model will be described in more detail in a forthcoming document. The model will +- Be less complex than XACML +- Be easily recognizable to those familiar with Amazon IAM Policies. +- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control. + +Authorization policy is set by creating a set of Policy objects. + +The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance). + +Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.) + + +## Accounting + +The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources.md](resources.md)). + +Initially: +- a `quota` object is immutable. +- for hosted K8s systems that do billing, Project is recommended level for billing accounts. +- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`. +- K8s Cluster Admin sets quota objects by writing a config file. + +Improvements: +- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object. +- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores. +- tools to help write consistent quota config files based on number of minions, historical namespace usages, QoS needs, etc. +- way for K8s Cluster Admin to incrementally adjust Quota objects. + +Simple profile: + - a single `namespace` with infinite resource limits. + +Enterprise profile: + - multiple namespaces each with their own limits. + +Issues: +- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations. + + +## Audit Logging + +API actions can be logged. + +Initial implementation: +- All API calls logged to nginx logs. + +Improvements: +- API server does logging instead. +- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions. diff --git a/identifiers.md b/identifiers.md new file mode 100644 index 00000000..1c0660c6 --- /dev/null +++ b/identifiers.md @@ -0,0 +1,90 @@ +# Identifiers and Names in Kubernetes + +A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199). + + +## Definitions + +UID +: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities. + +Name +: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations. + +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL) +: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name + +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN) +: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters + +[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID) +: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination + + +## Objectives for names and UIDs + +1. Uniquely identify (via a UID) an object across space and time + +2. Uniquely name (via a name) an object across space + +3. Provide human-friendly names in API operations and/or configuration files + +4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects + +5. Allow DNS names to be automatically generated for some objects + + +## General design + +1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency. + * Examples: "guestbook.user", "backend-x4eb1" + +2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random). + * Example: "api.k8s.example.com" + +3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time. + * Example: "01234567-89ab-cdef-0123-456789abcdef" + + +## Case study: Scheduling a pod + +Pods can be placed onto a particular node in a number of ways. This case +study demonstrates how the above design can be applied to satisfy the +objectives. + +### A pod scheduled by a user through the apiserver + +1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver. + +2. The apiserver validates the input. + 1. A default Namespace is assigned. + 2. The pod name must be space-unique within the Namespace. + 3. Each container within the pod has a name which must be space-unique within the pod. + +3. The pod is accepted. + 1. A new UID is assigned. + +4. The pod is bound to a node. + 1. The kubelet on the node is passed the pod's UID, Namespace, and Name. + +5. Kubelet validates the input. + +6. Kubelet runs the pod. + 1. Each container is started up with enough metadata to distinguish the pod from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is unique across time. + * This may correspond to Docker's container ID. + +### A pod placed by a config file on the node + +1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor". + +2. Kubelet validates the input. + 1. Since UID is not provided, kubelet generates one. + 2. Since Namespace is not provided, kubelet generates one. + 1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path. + * E.g. Namespace="file-f4231812554558a718a01ca942782d81" + +3. Kubelet runs the pod. + 1. Each container is started up with enough metadata to distinguish the pod from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is unique across time. + 1. This may correspond to Docker's container ID. diff --git a/labels.md b/labels.md new file mode 100644 index 00000000..ff923931 --- /dev/null +++ b/labels.md @@ -0,0 +1,68 @@ +# Labels + +_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes. + +Each object can have a set of key/value labels set on it, with at most one label with a particular key. +``` +"labels": { + "key1" : "value1", + "key2" : "value2" +} +``` + +Unlike [names and UIDs](identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). + +Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes. + +Label selectors may also be used to associate policies with sets of objects. + +We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md). + +[Namespacing of label keys](https://github.com/GoogleCloudPlatform/kubernetes/issues/1491) is under discussion. + +Valid labels follow a slightly modified RFC952 format: 24 characters or less, all lowercase, begins with alpha, dashes (-) are allowed, and ends with alphanumeric. + +## Motivation + +Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users. Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings. + +## Label selectors + +Label selectors permit very simple filtering by label keys and values. The simplicity of label selectors is deliberate. It is intended to facilitate transparency for humans, easy set overlap detection, efficient indexing, and reverse-indexing (i.e., finding all label selectors matching an object's labels - https://github.com/GoogleCloudPlatform/kubernetes/issues/1348). + +Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well. + +We are in the process of extending the label selection specification (see [selector.go](../blob/master/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms: +``` +key1 in (value11, value12, ...) +key1 not in (value11, value12, ...) +key1 exists +``` + +LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future. + +Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s: +- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods. +- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more. + +The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector. + +For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common. + +In the future, label selectors will be used to identify other types of distributed service workers, such as worker pool members or peers in a distributed application. + +Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions. + +Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might target all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc. + +Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on. + +Pods (and other objects) may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions. + +Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc. + +Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to (i.e., they are reversible). OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to. + +## Labels vs. annotations + +We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](annotations.md). diff --git a/namespaces.md b/namespaces.md new file mode 100644 index 00000000..b80c6825 --- /dev/null +++ b/namespaces.md @@ -0,0 +1,193 @@ +# Kubernetes Proposal - Namespaces + +**Related PR:** + +| Topic | Link | +| ---- | ---- | +| Identifiers.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/1216 | +| Access.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/891 | +| Indexing | https://github.com/GoogleCloudPlatform/kubernetes/pull/1183 | +| Cluster Subdivision | https://github.com/GoogleCloudPlatform/kubernetes/issues/442 | + +## Background + +High level goals: + +* Enable an easy-to-use mechanism to logically scope Kubernetes resources +* Ensure extension resources to Kubernetes can share the same logical scope as core Kubernetes resources +* Ensure it aligns with access control proposal +* Ensure system has log n scale with increasing numbers of scopes + +## Use cases + +Actors: + +1. k8s admin - administers a kubernetes cluster +2. k8s service - k8s daemon operates on behalf of another user (i.e. controller-manager) +2. k8s policy manager - enforces policies imposed on k8s cluster +3. k8s user - uses a kubernetes cluster to schedule pods + +User stories: + +1. Ability to set immutable namespace to k8s resources +2. Ability to list k8s resource scoped to a namespace +3. Restrict a namespace identifier to a DNS-compatible string to support compound naming conventions +4. Ability for a k8s policy manager to enforce a k8s user's access to a set of namespaces +5. Ability to set/unset a default namespace for use by kubecfg client +6. Ability for a k8s service to monitor resource changes across namespaces +7. Ability for a k8s service to list resources across namespaces + +## Proposed Design + +### Model Changes + +Introduce a new attribute *Namespace* for each resource that must be scoped in a Kubernetes cluster. + +A *Namespace* is a DNS compatible subdomain. + +``` +// TypeMeta is shared by all objects sent to, or returned from the client +type TypeMeta struct { + Kind string `json:"kind,omitempty" yaml:"kind,omitempty"` + Uid string `json:"uid,omitempty" yaml:"uid,omitempty"` + CreationTimestamp util.Time `json:"creationTimestamp,omitempty" yaml:"creationTimestamp,omitempty"` + SelfLink string `json:"selfLink,omitempty" yaml:"selfLink,omitempty"` + ResourceVersion uint64 `json:"resourceVersion,omitempty" yaml:"resourceVersion,omitempty"` + APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"` + Namespace string `json:"namespace,omitempty" yaml:"namespace,omitempty"` + Name string `json:"name,omitempty" yaml:"name,omitempty"` +} +``` + +An identifier, *UID*, is unique across time and space intended to distinguish between historical occurences of similar entities. + +A *Name* is unique within a given *Namespace* at a particular time, used in resource URLs; provided by clients at creation time +and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish +distinct entities, and reference particular entities across operations. + +As of this writing, the following resources MUST have a *Namespace* and *Name* + +* pod +* service +* replicationController +* endpoint + +A *policy* MAY be associated with a *Namespace*. + +If a *policy* has an associated *Namespace*, the resource paths it enforces are scoped to a particular *Namespace*. + +## k8s API server + +In support of namespace isolation, the Kubernetes API server will address resources by the following conventions: + +The typical actors for the following requests are the k8s user or the k8s service. + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/ns/{ns}/{resourceType}/ | Create instance of {resourceType} in namespace {ns} | +| GET | GET | /api/{version}/ns/{ns}/{resourceType}/{name} | Get instance of {resourceType} in namespace {ns} with {name} | +| UPDATE | PUT | /api/{version}/ns/{ns}/{resourceType}/{name} | Update instance of {resourceType} in namespace {ns} with {name} | +| DELETE | DELETE | /api/{version}/ns/{ns}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {ns} with {name} | +| LIST | GET | /api/{version}/ns/{ns}/{resourceType} | List instances of {resourceType} in namespace {ns} | +| WATCH | GET | /api/{version}/watch/ns/{ns}/{resourceType} | Watch for changes to a {resourceType} in namespace {ns} | + +The typical actor for the following requests are the k8s service or k8s admin as enforced by k8s Policy. + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | +| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | + +The legacy API patterns for k8s are an alias to interacting with the *default* namespace as follows. + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/{resourceType}/ | Create instance of {resourceType} in namespace *default* | +| GET | GET | /api/{version}/{resourceType}/{name} | Get instance of {resourceType} in namespace *default* | +| UPDATE | PUT | /api/{version}/{resourceType}/{name} | Update instance of {resourceType} in namespace *default* | +| DELETE | DELETE | /api/{version}/{resourceType}/{name} | Delete instance of {resourceType} in namespace *default* | + +The k8s API server verifies the *Namespace* on resource creation matches the *{ns}* on the path. + +The k8s API server will enable efficient mechanisms to filter model resources based on the *Namespace*. This may require +the creation of an index on *Namespace* that could support query by namespace with optional label selectors. + +The k8s API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context +of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request, +then the k8s API server will reject the request. + +TODO: Update to discuss k8s api server proxy patterns + +## k8s storage + +A namespace provides a unique identifier space and therefore must be in the storage path of a resource. + +In etcd, we want to continue to still support efficient WATCH across namespaces. + +Resources that persist content in etcd will have storage paths as follows: + +/registry/{resourceType}/{resource.Namespace}/{resource.Name} + +This enables k8s service to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}. + +Upon scheduling a pod to a particular host, the pod's namespace must be in the key path as follows: + +/host/{host}/pod/{pod.Namespace}/{pod.Name} + +## k8s Authorization service + +This design assumes the existence of an authorization service that filters incoming requests to the k8s API Server in order +to enforce user authorization to a particular k8s resource. It performs this action by associating the *subject* of a request +with a *policy* to an associated HTTP path and verb. This design encodes the *namespace* in the resource path in order to enable +external policy servers to function by resource path alone. If a request is made by an identity that is not allowed by +policy to the resource, the request is terminated. Otherwise, it is forwarded to the apiserver. + +## k8s controller-manager + +The controller-manager will provision pods in the same namespace as the associated replicationController. + +## k8s Kubelet + +There is no major change to the kubelet introduced by this proposal. + +### kubecfg client + +kubecfg supports following: + +``` +kubecfg [OPTIONS] ns {namespace} +``` + +To set a namespace to use across multiple operations: + +``` +$ kubecfg ns ns1 +``` + +To view the current namespace: + +``` +$ kubecfg ns +Using namespace ns1 +``` + +To reset to the default namespace: + +``` +$ kubecfg ns default +``` + +In addition, each kubecfg request may explicitly specify a namespace for the operation via the following OPTION + +--ns + +When loading resource files specified by the -c OPTION, the kubecfg client will ensure the namespace is set in the +message body to match the client specified default. + +If no default namespace is applied, the client will assume the following default namespace: + +* default + +The kubecfg client would store default namespace information in the same manner it caches authentication information today +as a file on user's file system. + diff --git a/networking.md b/networking.md new file mode 100644 index 00000000..167b7382 --- /dev/null +++ b/networking.md @@ -0,0 +1,107 @@ +# Networking + +## Model and motivation + +Kubernetes deviates from the default Docker networking model. The goal is for each pod to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration. + +OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems. + +With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses. + +In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them. + +The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host. + +When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. + +This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP. + +An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms. + +## Current implementation + +For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network. + +Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15). + +We start Docker with: + DOCKER_OPTS="--bridge cbr0 --iptables=false" + +We set up this bridge on each node with SaltStack, in [container_bridge.py](cluster/saltbase/salt/_states/container_bridge.py). + + cbr0: + container_bridge.ensure: + - cidr: {{ grains['cbr-cidr'] }} + ... + grains: + roles: + - kubernetes-pool + cbr-cidr: $MINION_IP_RANGE + +We make these addresses routable in GCE: + + gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \ + --norespect_terminal_width \ + --project ${PROJECT} \ + --network ${NETWORK} \ + --next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} & + +The minion IP ranges are /24s in the 10-dot space. + +GCE itself does not know anything about these IPs, though. + +These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).) + +We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container. + +Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode. + +1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name. + - creates a new network namespace (netns) and loopback device + - creates a new pair of veth devices and binds them to the netns + - auto-assigns an IP from docker’s IP range + +2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID. + +### Other networking implementation examples +With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE. + - [OpenVSwitch with GRE/VxLAN](../ovs-networking.md) + - [Flannel](https://github.com/coreos/flannel#flannel) + +## Challenges and future work + +### Docker API + +Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow. + +### External IP assignment + +We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). + +### Naming, discovery, and load balancing + +In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically. + +[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol. + +We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier. + +### External routability + +We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers). + +We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP. + +So we end up with 3 cases: + +1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT. + +2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1. + +3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port. + +Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address. + +### IPv6 + +IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) diff --git a/security.md b/security.md new file mode 100644 index 00000000..22034bdf --- /dev/null +++ b/security.md @@ -0,0 +1,26 @@ +# Security in Kubernetes + +General design principles and guidelines related to security of containers, APIs, and infrastructure in Kubernetes. + + +## Objectives + +1. Ensure a clear isolation between container and the underlying host it runs on +2. Limit the ability of the container to negatively impact the infrastructure or other containers +3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components +4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components + + +## Design Points + +### Isolate the data store from the minions and supporting infrastructure + +Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer. + +As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks. + +Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more. + +The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes. + +The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). \ No newline at end of file -- cgit v1.2.3 From d5bbcd262cf01b76475426aa0100f012f7471cc0 Mon Sep 17 00:00:00 2001 From: Meir Fischer Date: Sun, 9 Nov 2014 22:46:07 -0500 Subject: Fix bad selector file link --- labels.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/labels.md b/labels.md index ff923931..df904d3a 100644 --- a/labels.md +++ b/labels.md @@ -32,7 +32,7 @@ Label selectors permit very simple filtering by label keys and values. The simpl Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well. -We are in the process of extending the label selection specification (see [selector.go](../blob/master/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms: +We are in the process of extending the label selection specification (see [selector.go](/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms: ``` key1 in (value11, value12, ...) key1 not in (value11, value12, ...) -- cgit v1.2.3 From cc78c66a925dd4d35a683ffb50348403f5c2de06 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Tue, 25 Nov 2014 10:32:27 -0800 Subject: Convert gcutil to gcloud compute --- networking.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/networking.md b/networking.md index 167b7382..3f52d388 100644 --- a/networking.md +++ b/networking.md @@ -40,11 +40,12 @@ We set up this bridge on each node with SaltStack, in [container_bridge.py](clus We make these addresses routable in GCE: - gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \ - --norespect_terminal_width \ - --project ${PROJECT} \ - --network ${NETWORK} \ - --next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} & + gcloud compute routes add "${MINION_NAMES[$i]}" \ + --project "${PROJECT}" \ + --destination-range "${MINION_IP_RANGES[$i]}" \ + --network "${NETWORK}" \ + --next-hop-instance "${MINION_NAMES[$i]}" \ + --next-hop-instance-zone "${ZONE}" & The minion IP ranges are /24s in the 10-dot space. -- cgit v1.2.3 From 3a3112c0e24b348c045adcfaba08ac57051f9d15 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 20 Nov 2014 14:27:11 +0800 Subject: Loosen DNS 952 for labels --- labels.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/labels.md b/labels.md index df904d3a..8415376d 100644 --- a/labels.md +++ b/labels.md @@ -18,9 +18,17 @@ Label selectors may also be used to associate policies with sets of objects. We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md). -[Namespacing of label keys](https://github.com/GoogleCloudPlatform/kubernetes/issues/1491) is under discussion. - -Valid labels follow a slightly modified RFC952 format: 24 characters or less, all lowercase, begins with alpha, dashes (-) are allowed, and ends with alphanumeric. +Valid label keys are comprised of two segments - prefix and name - separated +by a slash (`/`). The name segment is required and must be a DNS label: 63 +characters or less, all lowercase, beginning and ending with an alphanumeric +character (`[a-z0-9]`), with dashes (`-`) and alphanumerics between. The +prefix and slash are optional. If specified, the prefix must be a DNS +subdomain (a series of DNS labels separated by dots (`.`), not longer than 253 +characters in total. + +If the prefix is omitted, the label key is presumed to be private to the user. +System components which use labels must specify a prefix. The `kubernetes.io` +prefix is reserved for kubernetes core components. ## Motivation -- cgit v1.2.3 From 5de98eeb18e6714216edc35bb6fb9fe220e7878b Mon Sep 17 00:00:00 2001 From: Sam Ghods Date: Sun, 30 Nov 2014 21:31:52 -0800 Subject: Remove unused YAML tags and GetYAML/SetYAML methods Unneeded after move to ghodss/yaml. --- namespaces.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/namespaces.md b/namespaces.md index b80c6825..761daa1a 100644 --- a/namespaces.md +++ b/namespaces.md @@ -48,14 +48,14 @@ A *Namespace* is a DNS compatible subdomain. ``` // TypeMeta is shared by all objects sent to, or returned from the client type TypeMeta struct { - Kind string `json:"kind,omitempty" yaml:"kind,omitempty"` - Uid string `json:"uid,omitempty" yaml:"uid,omitempty"` - CreationTimestamp util.Time `json:"creationTimestamp,omitempty" yaml:"creationTimestamp,omitempty"` - SelfLink string `json:"selfLink,omitempty" yaml:"selfLink,omitempty"` - ResourceVersion uint64 `json:"resourceVersion,omitempty" yaml:"resourceVersion,omitempty"` - APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"` - Namespace string `json:"namespace,omitempty" yaml:"namespace,omitempty"` - Name string `json:"name,omitempty" yaml:"name,omitempty"` + Kind string `json:"kind,omitempty"` + Uid string `json:"uid,omitempty"` + CreationTimestamp util.Time `json:"creationTimestamp,omitempty"` + SelfLink string `json:"selfLink,omitempty"` + ResourceVersion uint64 `json:"resourceVersion,omitempty"` + APIVersion string `json:"apiVersion,omitempty"` + Namespace string `json:"namespace,omitempty"` + Name string `json:"name,omitempty"` } ``` -- cgit v1.2.3 From ada3dfce7d8fc274fb12958d3b5f36036e203b80 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Wed, 19 Nov 2014 10:17:12 -0500 Subject: Admission control proposal --- admission_control.md | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) create mode 100644 admission_control.md diff --git a/admission_control.md b/admission_control.md new file mode 100644 index 00000000..60e10198 --- /dev/null +++ b/admission_control.md @@ -0,0 +1,145 @@ +# Kubernetes Proposal - Admission Control + +**Related PR:** + +| Topic | Link | +| ----- | ---- | + +## Background + +High level goals: + +* Enable an easy-to-use mechanism to provide admission control to cluster +* Enable a provider to support multiple admission control strategies or author their own +* Ensure any rejected request can propagate errors back to the caller with why the request failed +* Enable usage of cluster resources to satisfy admission control criteria +* Enable admission controller criteria to change without requiring restart of kube-apiserver + +Policy is focused on answering if a user is authorized to perform an action. + +Admission Control is focused on if the system will accept an authorized action. + +The Kubernetes cluster may choose to dismiss an authorized action based on any number of admission control strategies they choose to author and deploy: + +1. Quota enforcement of allocated desired usage +2. Pod black-lister to restrict running specific images on the cluster +3. Privileged container checker +4. Host port reservation +5. Volume validation - e.g. may or may not use hostDir, etc. +6. Min/max constraint checker for pod requested resources +7. ... + +This proposal therefore attempts to enumerate the basic design, and describe how any number of admission controllers could be injected. + +## kube-apiserver + +The kube-apiserver takes the following OPTIONAL arguments to enable admission control + +| Option | Behavior | +| ------ | -------- | +| admission_controllers | List of addresses (ip:port, dns name) to invoke for admission control | +| admission_controller_service | Service label selector to resolve for admission control (namespace/labelKey/labelValue) | + +If the list of addresses to invoke for admission control are provided as a label selector, the kube-apiserver will update the list +of admission control services at a regular interval. + +Upon an incoming request, the kube-apiserver performs the following basic flow: + +1. Authorize the request, if authorized, continue +2. Invoke the Admission Control REST API for each defined address, if all return true, continue +3. RESTStorage processes request +4. Data is persisted in store + +If there is no configured admission control address, then by default, all requests are admitted. + +Admission control is enforced on POST/PUT operations, but is ignored on GET/DELETE operations. + +## Admission Control REST API + +An admission controller satisfies a stable REST API invoked by the kube-apiserver to satisfy requests. + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /admissionController | Send a request for admission to evaluate for admittance or denial | + +The message body to the admissionController includes the following: + +1. requesting user identity +2. action to perform +3. proposed resource to create/modify (if any) + +If the request for admission is satisfied, return a HTTP 200. + +If the request for admission is denied, return a HTTP 403, the response must include a reason for why the response failed. + +## System Design + +The following demonstrates potential cluster setups using an external list of admission control endpoints. + + Request + + + | + | + +---------------|----------+ + | API Server | | + |---------------|----------| + | v | + | +--------+ | + | | Policy | | +---------------------+---+ + | ++-------+ | |Endpoints | + +---------+ | | | |-------------------------| + |Scheduler|<---+| v | |E1. Quota Enforcer | + +---------+ | +----------------------+ | |E2. Capacity Planner | + | | Admission Controller +-------->|E3. White-lister | + | +----+-----------------+ | |... | + | | | +-------------------------+ + | +v-------------+ | + | | REST Storage | | + | +--------------+ | + +----------------+---------+ + | + v + +--------------+ + | Data Store | + |--------------| + | | + | | + | | + +--------------+ + +The following demonstrates potential cluster setup that uses services to fulfill admission control. + +In this context, the cluster itself is used to provide HA admission control, and pods may choose to +invoke the API Server to determine if a request is or is not admissible. + + + Request +--------+ + + |Pods... | + | |--------| + | | <-------+ + +---------------|----------+ | | | + | API Server | | | | | + |---------------|----------| +---+----+ | + | v | | | + | +--------+ |<---------------+ | + | | Policy | | +-------------------------------------+ + | ++-------+ | |Service (ns=infra, labelKey=admitter)| + +---------+ | | | |-------------------------------------| + |Scheduler|<---+| v | |Service1 | + +---------+ | +----------------------+ | |Service2 | + | | Admission Controller +-------->|Service3 | + | +----+-----------------+ | |... | + | | | +-------------------------------------+ + | +v-------------+ | + | | REST Storage | | + | +--------------+ | + +----------------+---------+ + | + v + +--------------+ + | Data Store | + |--------------| + | | + | | + | | + +--------------+ \ No newline at end of file -- cgit v1.2.3 From 14464583f81e60a36ef50082822aebf9fabe5ca3 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Mon, 15 Dec 2014 14:32:32 -0500 Subject: Version 2.0 of proposal --- admission_control.md | 171 ++++++++++++++++----------------------------------- 1 file changed, 53 insertions(+), 118 deletions(-) diff --git a/admission_control.md b/admission_control.md index 60e10198..e3f04894 100644 --- a/admission_control.md +++ b/admission_control.md @@ -4,6 +4,7 @@ | Topic | Link | | ----- | ---- | +| Separate validation from RESTStorage | https://github.com/GoogleCloudPlatform/kubernetes/issues/2977 | ## Background @@ -12,24 +13,16 @@ High level goals: * Enable an easy-to-use mechanism to provide admission control to cluster * Enable a provider to support multiple admission control strategies or author their own * Ensure any rejected request can propagate errors back to the caller with why the request failed -* Enable usage of cluster resources to satisfy admission control criteria -* Enable admission controller criteria to change without requiring restart of kube-apiserver -Policy is focused on answering if a user is authorized to perform an action. +Authorization via policy is focused on answering if a user is authorized to perform an action. Admission Control is focused on if the system will accept an authorized action. -The Kubernetes cluster may choose to dismiss an authorized action based on any number of admission control strategies they choose to author and deploy: +Kubernetes may choose to dismiss an authorized action based on any number of admission control strategies. -1. Quota enforcement of allocated desired usage -2. Pod black-lister to restrict running specific images on the cluster -3. Privileged container checker -4. Host port reservation -5. Volume validation - e.g. may or may not use hostDir, etc. -6. Min/max constraint checker for pod requested resources -7. ... +This proposal documents the basic design, and describes how any number of admission control plug-ins could be injected. -This proposal therefore attempts to enumerate the basic design, and describe how any number of admission controllers could be injected. +Implementation of specific admission control strategies are handled in separate documents. ## kube-apiserver @@ -37,109 +30,51 @@ The kube-apiserver takes the following OPTIONAL arguments to enable admission co | Option | Behavior | | ------ | -------- | -| admission_controllers | List of addresses (ip:port, dns name) to invoke for admission control | -| admission_controller_service | Service label selector to resolve for admission control (namespace/labelKey/labelValue) | - -If the list of addresses to invoke for admission control are provided as a label selector, the kube-apiserver will update the list -of admission control services at a regular interval. - -Upon an incoming request, the kube-apiserver performs the following basic flow: - -1. Authorize the request, if authorized, continue -2. Invoke the Admission Control REST API for each defined address, if all return true, continue -3. RESTStorage processes request -4. Data is persisted in store - -If there is no configured admission control address, then by default, all requests are admitted. - -Admission control is enforced on POST/PUT operations, but is ignored on GET/DELETE operations. - -## Admission Control REST API - -An admission controller satisfies a stable REST API invoked by the kube-apiserver to satisfy requests. - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /admissionController | Send a request for admission to evaluate for admittance or denial | - -The message body to the admissionController includes the following: - -1. requesting user identity -2. action to perform -3. proposed resource to create/modify (if any) - -If the request for admission is satisfied, return a HTTP 200. - -If the request for admission is denied, return a HTTP 403, the response must include a reason for why the response failed. - -## System Design - -The following demonstrates potential cluster setups using an external list of admission control endpoints. - - Request - + - | - | - +---------------|----------+ - | API Server | | - |---------------|----------| - | v | - | +--------+ | - | | Policy | | +---------------------+---+ - | ++-------+ | |Endpoints | - +---------+ | | | |-------------------------| - |Scheduler|<---+| v | |E1. Quota Enforcer | - +---------+ | +----------------------+ | |E2. Capacity Planner | - | | Admission Controller +-------->|E3. White-lister | - | +----+-----------------+ | |... | - | | | +-------------------------+ - | +v-------------+ | - | | REST Storage | | - | +--------------+ | - +----------------+---------+ - | - v - +--------------+ - | Data Store | - |--------------| - | | - | | - | | - +--------------+ - -The following demonstrates potential cluster setup that uses services to fulfill admission control. - -In this context, the cluster itself is used to provide HA admission control, and pods may choose to -invoke the API Server to determine if a request is or is not admissible. - - - Request +--------+ - + |Pods... | - | |--------| - | | <-------+ - +---------------|----------+ | | | - | API Server | | | | | - |---------------|----------| +---+----+ | - | v | | | - | +--------+ |<---------------+ | - | | Policy | | +-------------------------------------+ - | ++-------+ | |Service (ns=infra, labelKey=admitter)| - +---------+ | | | |-------------------------------------| - |Scheduler|<---+| v | |Service1 | - +---------+ | +----------------------+ | |Service2 | - | | Admission Controller +-------->|Service3 | - | +----+-----------------+ | |... | - | | | +-------------------------------------+ - | +v-------------+ | - | | REST Storage | | - | +--------------+ | - +----------------+---------+ - | - v - +--------------+ - | Data Store | - |--------------| - | | - | | - | | - +--------------+ \ No newline at end of file +| admission_control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | +| admission_control_config_file | File with admission control configuration parameters to boot-strap plug-in. | + +An **AdmissionControl** plug-in is an implementation of the following interface: + +``` +package admission + +// Attributes is an interface used by a plug-in to make an admission decision on a individual request. +type Attributes interface { + GetClient() client.Interface + GetNamespace() string + GetKind() string + GetOperation() string + GetObject() runtime.Object +} + +// Interface is an abstract, pluggable interface for Admission Control decisions. +type Interface interface { + // Admit makes an admission decision based on the request attributes + // An error is returned if it denies the request. + Admit(a Attributes) (err error) +} +``` + +A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation +of admission.Interface. + +``` +func init() { + admission.RegisterPlugin("AlwaysDeny", func(config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) +} +``` + +Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations. + +This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow +will ensure the following: + +1. Incoming request +2. Authenticate user +3. Authorize user +4. If operation=create|update, then validate(object) +5. If operation=create|update|delete, then admissionControl.AdmissionControl(requestAttributes) + a. invoke each admission.Interface object in sequence +6. Object is persisted + +If at any step, there is an error, the request is canceled. -- cgit v1.2.3 From 606dcf108b265a6a886ac025b076789b9fbf27ff Mon Sep 17 00:00:00 2001 From: Clayton Coleman Date: Mon, 11 Aug 2014 21:23:37 -0400 Subject: Proposal: Isolate kubelet from etcd Discusses the current security risks posed by the kubelet->etcd pattern and discusses some options. Triggered by #846 and referenced in #859 --- isolation_between_nodes_and_master.md | 48 +++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 isolation_between_nodes_and_master.md diff --git a/isolation_between_nodes_and_master.md b/isolation_between_nodes_and_master.md new file mode 100644 index 00000000..a91927d8 --- /dev/null +++ b/isolation_between_nodes_and_master.md @@ -0,0 +1,48 @@ +# Design: Limit direct access to etcd from within Kubernetes + +All nodes have effective access of "root" on the entire Kubernetes cluster today because they have access to etcd, the central data store. The kubelet, the service proxy, and the nodes themselves have a connection to etcd that can be used to read or write any data in the system. In a cluster with many hosts, any container or user that gains the ability to write to the network device that can reach etcd, on any host, also gains that access. + +* The Kubelet and Kube Proxy currently rely on an efficient "wait for changes over HTTP" interface get their current state and avoid missing changes + * This interface is implemented by etcd as the "watch" operation on a given key containing useful data + + +## Options: + +1. Do nothing +2. Introduce an HTTP proxy that limits the ability of nodes to access etcd + 1. Prevent writes of data from the kubelet + 2. Prevent reading data not associated with the client responsibilities + 3. Introduce a security token granting access +3. Introduce an API on the apiserver that returns the data a node Kubelet and Kube Proxy needs + 1. Remove the ability of nodes to access etcd via network configuration + 2. Provide an alternate implementation for the event writing code Kubelet + 3. Implement efficient "watch for changes over HTTP" to offer comparable function with etcd + 4. Ensure that the apiserver can scale at or above the capacity of the etcd system. + 5. Implement authorization scoping for the nodes that limits the data they can view +4. Implement granular access control in etcd + 1. Authenticate HTTP clients with client certificates, tokens, or BASIC auth and authorize them for read only access + 2. Allow read access of certain subpaths based on what the requestor's tokens are + + +## Evaluation: + +Option 1 would be considered unacceptable for deployment in a multi-tenant or security conscious environment. It would be acceptable in a low security deployment where all software is trusted. It would be acceptable in proof of concept environments on a single machine. + +Option 2 would require implementing an http proxy that for 2-1 could block POST/PUT/DELETE requests (and potentially HTTP method tunneling parameters accepted by etcd). 2-2 would be more complicated and would require filtering operations based on deep understanding of the etcd API *and* the underlying schema. It would be possible, but involve extra software. + +Option 3 would involve extending the existing apiserver to return pods associated with a given node over an HTTP "watch for changes" mechanism, which is already implemented. Proper security would involve checking that the caller is authorized to access that data - one imagines a per node token, key, or SSL certificate that could be used to authenticate and then authorize access to only the data belonging to that node. The current event publishing mechanism from the kubelet would also need to be replaced with a secure API endpoint or a change to a polling model. The apiserver would also need to be able to function in a horizontally scalable mode by changing or fixing the "operations" queue to work in a stateless, scalable model. In practice, the amount of traffic even a large Kubernetes deployment would drive towards an apiserver would be tens of requests per second (500 hosts, 1 request per host every minute) which is negligible if well implemented. Implementing this would also decouple the data store schema from the nodes, allowing a different data store technology to be added in the future without affecting existing nodes. This would also expose that data to other consumers for their own purposes (monitoring, implementing service discovery). + +Option 4 would involve extending etcd to [support access control](https://github.com/coreos/etcd/issues/91). Administrators would need to authorize nodes to connect to etcd, and expose network routability directly to etcd. The mechanism for handling this authentication and authorization would be different than the authorization used by Kubernetes controllers and API clients. It would not be possible to completely replace etcd as a data store without also implementing a new Kubelet config endpoint. + + +## Preferred solution: + +Implement the first parts of option 3 - an efficient watch API for the pod, service, and endpoints data for the Kubelet and Kube Proxy. Authorization and authentication are planned in the future - when a solution is available, implement a custom authorization scope that allows API access to be restricted to only the data about a single node or the service endpoint data. + +In general, option 4 is desirable in addition to option 3 as a mechanism to further secure the store to infrastructure components that must access it. + + +## Caveats + +In all four options, compromise of a host will allow an attacker to imitate that host. For attack vectors that are reproducible from inside containers (privilege escalation), an attacker can distribute himself to other hosts by requesting new containers be spun up. In scenario 1, the cluster is totally compromised immediately. In 2-1, the attacker can view all information about the cluster including keys or authorization data defined with pods. In 2-2 and 3, the attacker must still distribute himself in order to get access to a large subset of information, and cannot see other data that is potentially located in etcd like side storage or system configuration. For attack vectors that are not exploits, but instead allow network access to etcd, an attacker in 2ii has no ability to spread his influence, and is instead restricted to the subset of information on the host. For 3-5, they can do nothing they could not do already (request access to the nodes / services endpoint) because the token is not visible to them on the host. + -- cgit v1.2.3 From 84569936d964a0e2ad2ddeaf76001979f635d9b7 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Wed, 7 Jan 2015 12:35:38 -0800 Subject: Design doc for clustering. This is related to #2303 and steals from #2435. --- clustering.md | 56 +++++++++++++++++++++++++++++++++++++++++++++ clustering/.gitignore | 1 + clustering/Makefile | 16 +++++++++++++ clustering/README.md | 9 ++++++++ clustering/dynamic.png | Bin 0 -> 87530 bytes clustering/dynamic.seqdiag | 24 +++++++++++++++++++ clustering/static.png | Bin 0 -> 45845 bytes clustering/static.seqdiag | 16 +++++++++++++ 8 files changed, 122 insertions(+) create mode 100644 clustering.md create mode 100644 clustering/.gitignore create mode 100644 clustering/Makefile create mode 100644 clustering/README.md create mode 100644 clustering/dynamic.png create mode 100644 clustering/dynamic.seqdiag create mode 100644 clustering/static.png create mode 100644 clustering/static.seqdiag diff --git a/clustering.md b/clustering.md new file mode 100644 index 00000000..659bed7d --- /dev/null +++ b/clustering.md @@ -0,0 +1,56 @@ +# Clustering in Kubernetes + + +## Overview +The term "clustering" refers to the process of having all members of the kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address. + +Once a cluster is established, the following is true: + +1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity. + 1. **Location** The master knows the name and location of all of the nodes in the cluster. + 2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node. + 3. **Caller AuthN/Z** Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though. +2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events. + 1. **Location** The nodes must know where the master is at. + 2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to. + 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also. + +## Current Implementation + +A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i). + +Communication from the master to nodes is currently over HTTP and is not secured or authenticated in any way. (1.ii, 1.iii). + +The location of the master is communicated out of band to the nodes. For GCE, this is done via Salt. Other cluster instructions/scripts use other methods. (2.i) + +Currently most communication from the node to the master is over HTTP. When it is done over HTTPS there is currently no verification of the cert of the master (2.ii). + +Currently, the node/kubelet is authenticated to the master via a token shared across all nodes. This token is distributed out of band (using Salt for GCE) and is optional. If it is not present then the kubelet is unable to publish events to the master. (2.iii) + +Our current mix of out of band communication doesn't meet all of our needs from a security point of view and is difficult to set up and configure. + +## Proposed Solution + +The proposed solution will provide a range of options for setting up and maintaining a secure Kubernetes cluster. We want to both allow for centrally controlled systems (leveraging pre-existing trust and configuration systems) or more ad-hoc automagic systems that are incredibly easy to set up. + +The building blocks of an easier solution: + +* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will work to explicitly distributing and trusting the CAs that should be trusted for each link. We will also use client certificates for all AuthN. +* [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate. + * **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors. +* **Scoped Kubelet Accounts** These accounts are per-minion and (optionally) give a minion permission to register itself. +* [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master. + +### Static Clustering + +In this sequence diagram there is out of band admin entity that is creating all certificates and distributing them. It is also making sure that the kubelets know where to find the master. This provides for a lot of control but is more difficult to set up as lots of information must be communicated outside of Kubernetes. + +![Static Sequence Diagram](clustering/static.png) + +### Dynamic Clustering + +This diagram dynamic clustering using the bootstrap API endpoint. That API endpoint is used to both find the location of the master and communicate the root CA for the master. + +This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means. + +![Dynamic Sequence Diagram](clustering/dynamic.png) diff --git a/clustering/.gitignore b/clustering/.gitignore new file mode 100644 index 00000000..67bcd6cb --- /dev/null +++ b/clustering/.gitignore @@ -0,0 +1 @@ +DroidSansMono.ttf diff --git a/clustering/Makefile b/clustering/Makefile new file mode 100644 index 00000000..3f95bc07 --- /dev/null +++ b/clustering/Makefile @@ -0,0 +1,16 @@ +FONT := DroidSansMono.ttf + +PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag)) + +.PHONY: all +all: $(PNGS) + +.PHONY: watch +watch: + fswatch *.seqdiag | xargs -n 1 sh -c "make || true" + +$(FONT): + curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf + +%.png: %.seqdiag $(FONT) + seqdiag -a -f '$(FONT)' $< diff --git a/clustering/README.md b/clustering/README.md new file mode 100644 index 00000000..04abb1bc --- /dev/null +++ b/clustering/README.md @@ -0,0 +1,9 @@ +This directory contains diagrams for the clustering design doc. + +This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with + +```bash +pip install seqdiag +``` + +Just call `make` to regenerate the diagrams. diff --git a/clustering/dynamic.png b/clustering/dynamic.png new file mode 100644 index 00000000..9f2ff9db Binary files /dev/null and b/clustering/dynamic.png differ diff --git a/clustering/dynamic.seqdiag b/clustering/dynamic.seqdiag new file mode 100644 index 00000000..95bb395e --- /dev/null +++ b/clustering/dynamic.seqdiag @@ -0,0 +1,24 @@ +seqdiag { + activation = none; + + + user[label = "Admin User"]; + bootstrap[label = "Bootstrap API\nEndpoint"]; + master; + kubelet[stacked]; + + user -> bootstrap [label="createCluster", return="cluster ID"]; + user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"]; + + user ->> master [label="start\n- bootstrap-cluster-uri"]; + master => bootstrap [label="setMaster\n- master-location\n- master-ca"]; + + user ->> kubelet [label="start\n- bootstrap-cluster-uri"]; + kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"]; + kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="retuns\n- kubelet-cert"]; + user => master [label="getSignRequests"]; + user => master [label="approveSignRequests"]; + kubelet <<-- master [label="returns\n- kubelet-cert"]; + + kubelet => master [label="register\n- kubelet-location"] +} diff --git a/clustering/static.png b/clustering/static.png new file mode 100644 index 00000000..a01ebbe8 Binary files /dev/null and b/clustering/static.png differ diff --git a/clustering/static.seqdiag b/clustering/static.seqdiag new file mode 100644 index 00000000..bdc54b76 --- /dev/null +++ b/clustering/static.seqdiag @@ -0,0 +1,16 @@ +seqdiag { + activation = none; + + admin[label = "Manual Admin"]; + ca[label = "Manual CA"] + master; + kubelet[stacked]; + + admin => ca [label="create\n- master-cert"]; + admin ->> master [label="start\n- ca-root\n- master-cert"]; + + admin => ca [label="create\n- kubelet-cert"]; + admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"]; + + kubelet => master [label="register\n- kubelet-location"]; +} -- cgit v1.2.3 From 5c7bc51c532fe12fb14a4838b02c23998a69802c Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Thu, 8 Jan 2015 11:15:40 -0500 Subject: Update design doc with final PR merge --- admission_control.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/admission_control.md b/admission_control.md index e3f04894..88afda73 100644 --- a/admission_control.md +++ b/admission_control.md @@ -40,7 +40,6 @@ package admission // Attributes is an interface used by a plug-in to make an admission decision on a individual request. type Attributes interface { - GetClient() client.Interface GetNamespace() string GetKind() string GetOperation() string @@ -60,7 +59,7 @@ of admission.Interface. ``` func init() { - admission.RegisterPlugin("AlwaysDeny", func(config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) + admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) } ``` @@ -73,7 +72,7 @@ will ensure the following: 2. Authenticate user 3. Authorize user 4. If operation=create|update, then validate(object) -5. If operation=create|update|delete, then admissionControl.AdmissionControl(requestAttributes) +5. If operation=create|update|delete, then admission.Admit(requestAttributes) a. invoke each admission.Interface object in sequence 6. Object is persisted -- cgit v1.2.3 From 59e0bba24631462700ad9db6b41fecc730a807e7 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Fri, 9 Jan 2015 09:11:26 -0800 Subject: Tweaks based on comments --- clustering.md | 8 ++++++-- clustering/Makefile | 2 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/clustering.md b/clustering.md index 659bed7d..f447ef10 100644 --- a/clustering.md +++ b/clustering.md @@ -8,13 +8,16 @@ Once a cluster is established, the following is true: 1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity. 1. **Location** The master knows the name and location of all of the nodes in the cluster. + * For the purposes of this doc, location and name should be enough information so that the master can open a TCP connection to the Node. Most probably we will make this either an IP address or a DNS name. It is going to be important to be consistent here (master must be able to reach kubelet on that DNS name) so that we can verify certificates appropriately. 2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node. - 3. **Caller AuthN/Z** Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though. + 3. **Caller AuthN/Z** This would be the master verifying itself (and permissions) when calling the node. Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though. 2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events. 1. **Location** The nodes must know where the master is at. 2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to. 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also. +**Note:** While the description here refers to a singular Master, in the future we should enable multiple Masters operating in an HA mode. While the "Master" is currently the combination of the API Server, Scheduler and Controller Manager, we will restrict ourselves to thinking about the main API and policy engine -- the API Server. + ## Current Implementation A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i). @@ -35,10 +38,11 @@ The proposed solution will provide a range of options for setting up and maintai The building blocks of an easier solution: -* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will work to explicitly distributing and trusting the CAs that should be trusted for each link. We will also use client certificates for all AuthN. +* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly idenitfy the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN. * [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate. * **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors. * **Scoped Kubelet Accounts** These accounts are per-minion and (optionally) give a minion permission to register itself. + * To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic. * [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master. ### Static Clustering diff --git a/clustering/Makefile b/clustering/Makefile index 3f95bc07..c4095421 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -10,7 +10,7 @@ watch: fswatch *.seqdiag | xargs -n 1 sh -c "make || true" $(FONT): - curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf + curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT).ttf %.png: %.seqdiag $(FONT) seqdiag -a -f '$(FONT)' $< -- cgit v1.2.3 From bab87d954eded80b96f38ba9f38c4d3a32fd15d7 Mon Sep 17 00:00:00 2001 From: Clayton Coleman Date: Tue, 20 Jan 2015 13:55:17 -0500 Subject: Clarify name must be lowercase in docs, to match code We restrict DNS_SUBDOMAIN to lowercase for sanity. --- identifiers.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/identifiers.md b/identifiers.md index 1c0660c6..260c237a 100644 --- a/identifiers.md +++ b/identifiers.md @@ -12,10 +12,10 @@ Name : A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations. [rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL) -: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name +: An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name [rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN) -: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters +: One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters [rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID) : A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination -- cgit v1.2.3 From d0eebeeb6c173c6c2ad74d84cc2468b20a4d3d1f Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Thu, 18 Dec 2014 13:58:23 -0500 Subject: Resource controller proposal --- resource_controller.md | 231 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 231 insertions(+) create mode 100644 resource_controller.md diff --git a/resource_controller.md b/resource_controller.md new file mode 100644 index 00000000..2150f6dc --- /dev/null +++ b/resource_controller.md @@ -0,0 +1,231 @@ +# Kubernetes Proposal: ResourceController + +**Related PR:** + +| Topic | Link | +| ----- | ---- | +| Admission Control Proposal | https://github.com/GoogleCloudPlatform/kubernetes/pull/2501 | +| Separate validation from RESTStorage | https://github.com/GoogleCloudPlatform/kubernetes/issues/2977 | + +## Background + +This document proposes a system for enforcing resource limits as part of admission control. + +## Model Changes + +A new resource, **ResourceController**, is introduced to enumerate resource usage constraints scoped to a Kubernetes namespace. + +Authorized users are able to set the **ResourceController.Spec** fields to enumerate desired constraints. + +``` +// ResourceController is an enumerated set of resources constraints enforced as part admission control plug-in +type ResourceController struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + // Spec represents the imposed constraints for allowed resources + Spec ResourceControllerSpec `json:"spec,omitempty"` + // Status represents the observed allocated resources to inform constraints + Status ResourceControllerStatus `json:"status,omitempty"` +} + +type ResourceControllerSpec struct { + // Allowed represents the available resources allowed in a quota + Allowed ResourceList `json:"allowed,omitempty"` +} + +type ResourceControllerStatus struct { + // Allowed represents the available resources allowed in a quota + Allowed ResourceList `json:"allowed,omitempty"` + // Allocated represents the allocated resources leveraged against your quota + Allocated ResourceList `json:"allocated,omitempty"` +} + +// ResourceControllerList is a collection of resource controllers. +type ResourceControllerList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty"` + Items []ResourceController `json:"items"` +} +``` + +Authorized users are able to provide a **ResourceObservation** to control a **ResourceController.Status**. + +``` +// ResourceObservation is written by a resource-controller to update ResourceController.Status +type ResourceObservation struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Status represents the observed allocated resources to inform constraints + Status ResourceControllerStatus `json:"status,omitempty"` +} +``` + +## AdmissionControl plugin: ResourceLimits + +The **ResourceLimits** plug-in introspects all incoming admission requests. + +It makes decisions by introspecting the incoming object, current status, and enumerated constraints on **ResourceController**. + +The following constraints are proposed as enforceable: + +| Key | Type | Description | +| ------ | -------- | -------- | +| kubernetes.io/namespace/pods | int | Maximum number of pods per namespace | +| kubernetes.io/namespace/replicationControllers | int | Maximum number of replicationControllers per namespace | +| kubernetes.io/namespace/services | int | Maximum number of services per namespace | +| kubernetes.io/pods/containers | int | Maximum number of containers per pod | +| kubernetes.io/pods/containers/memory/max | int | Maximum amount of memory per container in a pod | +| kubernetes.io/pods/containers/memory/min | int | Minimum amount of memory per container in a pod | +| kubernetes.io/pods/containers/cpu/max | int | Maximum amount of CPU per container in a pod | +| kubernetes.io/pods/containers/cpu/min | int | Minimum amount of CPU per container in a pod | +| kubernetes.io/pods/cpu/max | int | Maximum CPU usage across all containers per pod | +| kubernetes.io/pods/cpu/min | int | Minimum CPU usage across all containers per pod | +| kubernetes.io/pods/memory/max | int | Maximum memory usage across all containers in pod | +| kubernetes.io/pods/memory/min | int | Minimum memory usage across all containers in pod | +| kubernetes.io/replicationController/replicas | int | Maximum number of replicas per replication controller | + +If the incoming resource would cause a violation of the enumerated constraints, the request is denied with a set of +messages explaining what constraints were the source of the denial. + +If a constraint is not enumerated by a **ResourceController** it is not tracked. + +If a constraint spans resources, for example, it tracks the total number of some **kind** in a **namespace**, +the plug-in will post a **ResourceObservation** with the new incremented **Allocated*** usage for that constraint +using a compare-and-swap to ensure transactional integrity. It is possible that the allocated usage will be persisted +on a create operation, but the create can fail later in the request flow for some other unknown reason. For this scenario, +the allocated usage will appear greater than the actual usage, the **kube-resource-controller** is responsible for +synchronizing the observed allocated usage with actual usage. For delete requests, we will not decrement usage right away, +and will always rely on the **kube-resource-controller** to bring the observed value in line. This is needed until +etcd supports atomic transactions across multiple resources. + +## kube-apiserver + +The server is updated to be aware of **ResourceController** and **ResourceObservation** objects. + +The constraints are only enforced if the kube-apiserver is started as follows: + +``` +$ kube-apiserver -admission_control=ResourceLimits +``` + +## kube-resource-controller + +This is a new daemon that observes **ResourceController** objects in the cluster, and updates their status with current cluster state. + +The daemon runs a synchronization loop to do the following: + +For each resource controller, perform the following steps: + + 1. Reconcile **ResourceController.Status.Allowed** with **ResourceController.Spec.Allowed** + 2. Reconcile **ResourceController.Status.Allocated** with constraints enumerated in **ResourceController.Status.Allowed** + 3. If there was a change, atomically update **ResourceObservation** to force an update to **ResourceController.Status** + +At step 1, allow the **kube-resource-controller** to support an administrator supplied override to enforce that what the +set of constraints desired to not conflict with any configured global constraints. For example, do not let +a **kubernetes.io/pods/memory/max** for any pod in any namespace exceed 8GB. These global constraints could be supplied +via an alternate location in **etcd**, for example, a **ResourceController** in an **infra** namespace that is populated on +bootstrap. + +At step 2, for fields that track total number of {kind} in a namespace, we query the cluster to ensure that the observed status +is in-line with the actual tracked status. This is a stop-gap to etcd supporting transactions across resource updates to ensure +that when a resource is deleted, we can update the observed status. + +## kubectl + +kubectl is modified to support the **ResourceController** resource. + +```kubectl describe``` provides a human-readable output of current constraints and usage in the namespace. + +For example, + +``` +$ kubectl namespace myspace +$ kubectl create -f examples/resource-controller/resource-controller.json +$ kubectl get resourceControllers +NAME LABELS +limits +$ kubectl describe resourceController limits +Name: limits +Key Enforced Allocated +---- ----- ---- +Max pods 15 13 +Max replication controllers 2 2 +Max services 5 0 +Max containers per pod 2 0 +Max replica size 10 0 +... +``` + +## Scenario: How this works in practice + +Admin user wants to impose resource constraints in namespace ```dev``` to enforce the following: + +1. A pod cannot use more than 8GB of RAM +2. The namespace cannot run more than 100 pods at a time. + +To enforce this constraint, the Admin does the following: + +``` +$ cat resource-controller.json +{ + "id": "limits", + "kind": "ResourceController", + "apiVersion": "v1beta1", + "spec": { + "allowed": { + "kubernetes.io/namespace/pods": 100, + "kubernetes.io/pods/memory/max": 8000, + } + }, + "labels": {} +} +$ kubectl namespace dev +$ kubectl create -f resource-controller.json +``` + +The **kube-resource-controller** sees that a new **ResourceController** resource was created, and updates its +status with the current observations in the namespace. + +The Admin describes the resource controller to see the current status: + +``` +$ kubectl describe resourceController limits +Name: limits +Key Enforced Allocated +---- ----- ---- +Max pods 100 50 +Max memory per pod 8000 4000 +```` + +The Admin sees that the current ```dev``` namespace is using 50 pods, and the largest pod consumes 4GB of RAM. + +The Developer that uses this namespace uses the system until he discovers he has exceeded his limits: + +``` +$ kubectl namespace dev +$ kubectl create -f pod.json +Unable to create pod. You have exceeded your max pods in the namespace of 100. +``` + +or + +``` +$ kubectl namespace dev +$ kubectl create -f pod.json +Unable to create pod. It exceeds the max memory usage per pod of 8000 MB. +``` + +The Developer can observe his constraints as appropriate: +``` +$ kubectl describe resourceController limits +Name: limits +Key Enforced Allocated +---- ----- ---- +Max pods 100 100 +Max memory per pod 8000 4000 +```` + +And as a consequence reduce his current number of running pods, or memory requirements of the pod to proceed. +Or he could contact the Admin for his namespace to allocate him more resources. + -- cgit v1.2.3 From 1203b0e6e4d4b83672ea999f158ef77cb9d7ad6b Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Thu, 22 Jan 2015 22:31:28 -0500 Subject: Design document for LimitRange --- admission_control_limit_range.md | 122 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 admission_control_limit_range.md diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md new file mode 100644 index 00000000..69fe144b --- /dev/null +++ b/admission_control_limit_range.md @@ -0,0 +1,122 @@ +# Admission control plugin: LimitRanger + +## Background + +This document proposes a system for enforcing min/max limits per resource as part of admission control. + +## Model Changes + +A new resource, **LimitRange**, is introduced to enumerate min/max limits for a resource type scoped to a +Kubernetes namespace. + +``` +const ( + // Limit that applies to all pods in a namespace + LimitTypePod string = "Pod" + // Limit that applies to all containers in a namespace + LimitTypeContainer string = "Container" +) + +// LimitRangeItem defines a min/max usage limit for any resource that matches on kind +type LimitRangeItem struct { + // Type of resource that this limit applies to + Type string `json:"type,omitempty"` + // Max usage constraints on this kind by resource name + Max ResourceList `json:"max,omitempty"` + // Min usage constraints on this kind by resource name + Min ResourceList `json:"min,omitempty"` +} + +// LimitRangeSpec defines a min/max usage limit for resources that match on kind +type LimitRangeSpec struct { + // Limits is the list of LimitRangeItem objects that are enforced + Limits []LimitRangeItem `json:"limits"` +} + +// LimitRange sets resource usage limits for each kind of resource in a Namespace +type LimitRange struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the limits enforced + Spec LimitRangeSpec `json:"spec,omitempty"` +} + +// LimitRangeList is a list of LimitRange items. +type LimitRangeList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty"` + + // Items is a list of LimitRange objects + Items []LimitRange `json:"items"` +} +``` + +## AdmissionControl plugin: LimitRanger + +The **LimitRanger** plug-in introspects all incoming admission requests. + +It makes decisions by evaluating the incoming object against all defined **LimitRange** objects in the request context namespace. + +The following min/max limits are imposed: + +**Type: Container** + +| ResourceName | Description | +| ------------ | ----------- | +| cpu | Min/Max amount of cpu per container | +| memory | Min/Max amount of memory per container | + +**Type: Pod** + +| ResourceName | Description | +| ------------ | ----------- | +| cpu | Min/Max amount of cpu per pod | +| memory | Min/Max amount of memory per pod | + +If the incoming object would cause a violation of the enumerated constraints, the request is denied with a set of +messages explaining what constraints were the source of the denial. + +If a constraint is not enumerated by a **LimitRange** it is not tracked. + +## kube-apiserver + +The server is updated to be aware of **LimitRange** objects. + +The constraints are only enforced if the kube-apiserver is started as follows: + +``` +$ kube-apiserver -admission_control=LimitRanger +``` + +## kubectl + +kubectl is modified to support the **LimitRange** resource. + +```kubectl describe``` provides a human-readable output of limits. + +For example, + +``` +$ kubectl namespace myspace +$ kubectl create -f examples/limitrange/limit-range.json +$ kubectl get limits +NAME +limits +$ kubectl describe limits limits +Name: limits +Type Resource Min Max +---- -------- --- --- +Pod memory 1Mi 1Gi +Pod cpu 250m 2 +Container cpu 250m 2 +Container memory 1Mi 1Gi +``` + +## Future Enhancements: Define limits for a particular pod or container. + +In the current proposal, the **LimitRangeItem** matches purely on **LimitRangeItem.Type** + +It is expected we will want to define limits for particular pods or containers by name/uid and label/field selector. + +To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. -- cgit v1.2.3 From a44f8f8aaa9177f8f1cdf7e37e74437695fc36fd Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Fri, 23 Jan 2015 12:38:59 -0500 Subject: ResourceQuota proposal --- admission_control_resource_quota.md | 146 ++++++++++++++++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 admission_control_resource_quota.md diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md new file mode 100644 index 00000000..c5cc60c4 --- /dev/null +++ b/admission_control_resource_quota.md @@ -0,0 +1,146 @@ +# Admission control plugin: ResourceQuota + +## Background + +This document proposes a system for enforcing hard resource usage limits per namespace as part of admission control. + +## Model Changes + +A new resource, **ResourceQuota**, is introduced to enumerate hard resource limits in a Kubernetes namespace. + +A new resource, **ResourceQuotaUsage**, is introduced to support atomic updates of a **ResourceQuota** status. + +``` +// The following identify resource constants for Kubernetes object types +const ( + // Pods, number + ResourcePods ResourceName = "pods" + // Services, number + ResourceServices ResourceName = "services" + // ReplicationControllers, number + ResourceReplicationControllers ResourceName = "replicationcontrollers" + // ResourceQuotas, number + ResourceQuotas ResourceName = "resourcequotas" +) + +// ResourceQuotaSpec defines the desired hard limits to enforce for Quota +type ResourceQuotaSpec struct { + // Hard is the set of desired hard limits for each named resource + Hard ResourceList `json:"hard,omitempty"` +} + +// ResourceQuotaStatus defines the enforced hard limits and observed use +type ResourceQuotaStatus struct { + // Hard is the set of enforced hard limits for each named resource + Hard ResourceList `json:"hard,omitempty"` + // Used is the current observed total usage of the resource in the namespace + Used ResourceList `json:"used,omitempty"` +} + +// ResourceQuota sets aggregate quota restrictions enforced per namespace +type ResourceQuota struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the desired quota + Spec ResourceQuotaSpec `json:"spec,omitempty"` + + // Status defines the actual enforced quota and its current usage + Status ResourceQuotaStatus `json:"status,omitempty"` +} + +// ResourceQuotaUsage captures system observed quota status per namespace +// It is used to enforce atomic updates of a backing ResourceQuota.Status field in storage +type ResourceQuotaUsage struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Status defines the actual enforced quota and its current usage + Status ResourceQuotaStatus `json:"status,omitempty"` +} + +// ResourceQuotaList is a list of ResourceQuota items +type ResourceQuotaList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty"` + + // Items is a list of ResourceQuota objects + Items []ResourceQuota `json:"items"` +} + +``` + +## AdmissionControl plugin: ResourceQuota + +The **ResourceQuota** plug-in introspects all incoming admission requests. + +It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request +namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. + +The following resource limits are imposed as part of core Kubernetes: + +| ResourceName | Description | +| ------------ | ----------- | +| cpu | Total cpu usage | +| memory | Total memory usage | +| pods | Total number of pods | +| services | Total number of services | +| replicationcontrollers | Total number of replication controllers | +| resourcequotas | Total number of resource quotas | + +Any resource that is not part of core Kubernetes must follow the resource naming convention prescribed by Kubernetes. + +This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) + +If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a +**ResourceQuotaUsage** document to the server to atomically update the observed usage based on the previously read +**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) +into the system. + +## kube-apiserver + +The server is updated to be aware of **ResourceQuota** objects. + +The quota is only enforced if the kube-apiserver is started as follows: + +``` +$ kube-apiserver -admission_control=ResourceQuota +``` + +## kube-controller-manager + +A new controller is defined that runs a synch loop to run usage stats across the namespace. + +If the observed usage is different than the recorded usage, the controller sends a **ResourceQuotaUsage** resource +to the server to atomically update. + +The synchronization loop frequency will control how quickly DELETE actions are recorded in the system and usage is ticked down. + +To optimize the synchronization loop, this controller will WATCH on Pod resources to track DELETE events, and in response, recalculate +usage. This is because a Pod deletion will have the most impact on observed cpu and memory usage in the system, and we anticipate +this being the resource most closely running at the prescribed quota limits. + +## kubectl + +kubectl is modified to support the **ResourceQuota** resource. + +```kubectl describe``` provides a human-readable output of quota. + +For example, + +``` +$ kubectl namespace myspace +$ kubectl create -f examples/resourcequota/resource-quota.json +$ kubectl get quota +NAME +myquota +$ kubectl describe quota myquota +Name: myquota +Resource Used Hard +-------- ---- ---- +cpu 100m 20 +memory 0 1.5Gb +pods 1 10 +replicationControllers 1 10 +services 2 3 +``` \ No newline at end of file -- cgit v1.2.3 From 24f580084eb3c2acf9a9ec7e83e6129cdc5065dc Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Fri, 23 Jan 2015 12:39:53 -0500 Subject: Remove resource_controller proposal --- resource_controller.md | 231 ------------------------------------------------- 1 file changed, 231 deletions(-) delete mode 100644 resource_controller.md diff --git a/resource_controller.md b/resource_controller.md deleted file mode 100644 index 2150f6dc..00000000 --- a/resource_controller.md +++ /dev/null @@ -1,231 +0,0 @@ -# Kubernetes Proposal: ResourceController - -**Related PR:** - -| Topic | Link | -| ----- | ---- | -| Admission Control Proposal | https://github.com/GoogleCloudPlatform/kubernetes/pull/2501 | -| Separate validation from RESTStorage | https://github.com/GoogleCloudPlatform/kubernetes/issues/2977 | - -## Background - -This document proposes a system for enforcing resource limits as part of admission control. - -## Model Changes - -A new resource, **ResourceController**, is introduced to enumerate resource usage constraints scoped to a Kubernetes namespace. - -Authorized users are able to set the **ResourceController.Spec** fields to enumerate desired constraints. - -``` -// ResourceController is an enumerated set of resources constraints enforced as part admission control plug-in -type ResourceController struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - // Spec represents the imposed constraints for allowed resources - Spec ResourceControllerSpec `json:"spec,omitempty"` - // Status represents the observed allocated resources to inform constraints - Status ResourceControllerStatus `json:"status,omitempty"` -} - -type ResourceControllerSpec struct { - // Allowed represents the available resources allowed in a quota - Allowed ResourceList `json:"allowed,omitempty"` -} - -type ResourceControllerStatus struct { - // Allowed represents the available resources allowed in a quota - Allowed ResourceList `json:"allowed,omitempty"` - // Allocated represents the allocated resources leveraged against your quota - Allocated ResourceList `json:"allocated,omitempty"` -} - -// ResourceControllerList is a collection of resource controllers. -type ResourceControllerList struct { - TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty"` - Items []ResourceController `json:"items"` -} -``` - -Authorized users are able to provide a **ResourceObservation** to control a **ResourceController.Status**. - -``` -// ResourceObservation is written by a resource-controller to update ResourceController.Status -type ResourceObservation struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - // Status represents the observed allocated resources to inform constraints - Status ResourceControllerStatus `json:"status,omitempty"` -} -``` - -## AdmissionControl plugin: ResourceLimits - -The **ResourceLimits** plug-in introspects all incoming admission requests. - -It makes decisions by introspecting the incoming object, current status, and enumerated constraints on **ResourceController**. - -The following constraints are proposed as enforceable: - -| Key | Type | Description | -| ------ | -------- | -------- | -| kubernetes.io/namespace/pods | int | Maximum number of pods per namespace | -| kubernetes.io/namespace/replicationControllers | int | Maximum number of replicationControllers per namespace | -| kubernetes.io/namespace/services | int | Maximum number of services per namespace | -| kubernetes.io/pods/containers | int | Maximum number of containers per pod | -| kubernetes.io/pods/containers/memory/max | int | Maximum amount of memory per container in a pod | -| kubernetes.io/pods/containers/memory/min | int | Minimum amount of memory per container in a pod | -| kubernetes.io/pods/containers/cpu/max | int | Maximum amount of CPU per container in a pod | -| kubernetes.io/pods/containers/cpu/min | int | Minimum amount of CPU per container in a pod | -| kubernetes.io/pods/cpu/max | int | Maximum CPU usage across all containers per pod | -| kubernetes.io/pods/cpu/min | int | Minimum CPU usage across all containers per pod | -| kubernetes.io/pods/memory/max | int | Maximum memory usage across all containers in pod | -| kubernetes.io/pods/memory/min | int | Minimum memory usage across all containers in pod | -| kubernetes.io/replicationController/replicas | int | Maximum number of replicas per replication controller | - -If the incoming resource would cause a violation of the enumerated constraints, the request is denied with a set of -messages explaining what constraints were the source of the denial. - -If a constraint is not enumerated by a **ResourceController** it is not tracked. - -If a constraint spans resources, for example, it tracks the total number of some **kind** in a **namespace**, -the plug-in will post a **ResourceObservation** with the new incremented **Allocated*** usage for that constraint -using a compare-and-swap to ensure transactional integrity. It is possible that the allocated usage will be persisted -on a create operation, but the create can fail later in the request flow for some other unknown reason. For this scenario, -the allocated usage will appear greater than the actual usage, the **kube-resource-controller** is responsible for -synchronizing the observed allocated usage with actual usage. For delete requests, we will not decrement usage right away, -and will always rely on the **kube-resource-controller** to bring the observed value in line. This is needed until -etcd supports atomic transactions across multiple resources. - -## kube-apiserver - -The server is updated to be aware of **ResourceController** and **ResourceObservation** objects. - -The constraints are only enforced if the kube-apiserver is started as follows: - -``` -$ kube-apiserver -admission_control=ResourceLimits -``` - -## kube-resource-controller - -This is a new daemon that observes **ResourceController** objects in the cluster, and updates their status with current cluster state. - -The daemon runs a synchronization loop to do the following: - -For each resource controller, perform the following steps: - - 1. Reconcile **ResourceController.Status.Allowed** with **ResourceController.Spec.Allowed** - 2. Reconcile **ResourceController.Status.Allocated** with constraints enumerated in **ResourceController.Status.Allowed** - 3. If there was a change, atomically update **ResourceObservation** to force an update to **ResourceController.Status** - -At step 1, allow the **kube-resource-controller** to support an administrator supplied override to enforce that what the -set of constraints desired to not conflict with any configured global constraints. For example, do not let -a **kubernetes.io/pods/memory/max** for any pod in any namespace exceed 8GB. These global constraints could be supplied -via an alternate location in **etcd**, for example, a **ResourceController** in an **infra** namespace that is populated on -bootstrap. - -At step 2, for fields that track total number of {kind} in a namespace, we query the cluster to ensure that the observed status -is in-line with the actual tracked status. This is a stop-gap to etcd supporting transactions across resource updates to ensure -that when a resource is deleted, we can update the observed status. - -## kubectl - -kubectl is modified to support the **ResourceController** resource. - -```kubectl describe``` provides a human-readable output of current constraints and usage in the namespace. - -For example, - -``` -$ kubectl namespace myspace -$ kubectl create -f examples/resource-controller/resource-controller.json -$ kubectl get resourceControllers -NAME LABELS -limits -$ kubectl describe resourceController limits -Name: limits -Key Enforced Allocated ----- ----- ---- -Max pods 15 13 -Max replication controllers 2 2 -Max services 5 0 -Max containers per pod 2 0 -Max replica size 10 0 -... -``` - -## Scenario: How this works in practice - -Admin user wants to impose resource constraints in namespace ```dev``` to enforce the following: - -1. A pod cannot use more than 8GB of RAM -2. The namespace cannot run more than 100 pods at a time. - -To enforce this constraint, the Admin does the following: - -``` -$ cat resource-controller.json -{ - "id": "limits", - "kind": "ResourceController", - "apiVersion": "v1beta1", - "spec": { - "allowed": { - "kubernetes.io/namespace/pods": 100, - "kubernetes.io/pods/memory/max": 8000, - } - }, - "labels": {} -} -$ kubectl namespace dev -$ kubectl create -f resource-controller.json -``` - -The **kube-resource-controller** sees that a new **ResourceController** resource was created, and updates its -status with the current observations in the namespace. - -The Admin describes the resource controller to see the current status: - -``` -$ kubectl describe resourceController limits -Name: limits -Key Enforced Allocated ----- ----- ---- -Max pods 100 50 -Max memory per pod 8000 4000 -```` - -The Admin sees that the current ```dev``` namespace is using 50 pods, and the largest pod consumes 4GB of RAM. - -The Developer that uses this namespace uses the system until he discovers he has exceeded his limits: - -``` -$ kubectl namespace dev -$ kubectl create -f pod.json -Unable to create pod. You have exceeded your max pods in the namespace of 100. -``` - -or - -``` -$ kubectl namespace dev -$ kubectl create -f pod.json -Unable to create pod. It exceeds the max memory usage per pod of 8000 MB. -``` - -The Developer can observe his constraints as appropriate: -``` -$ kubectl describe resourceController limits -Name: limits -Key Enforced Allocated ----- ----- ---- -Max pods 100 100 -Max memory per pod 8000 4000 -```` - -And as a consequence reduce his current number of running pods, or memory requirements of the pod to proceed. -Or he could contact the Admin for his namespace to allocate him more resources. - -- cgit v1.2.3 From 89f9224cc11190711f35230398dac3c30f590a06 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Fri, 23 Jan 2015 12:41:44 -0500 Subject: Doc tweaks --- admission_control_resource_quota.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index c5cc60c4..08bc6bec 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -77,7 +77,7 @@ The **ResourceQuota** plug-in introspects all incoming admission requests. It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. -The following resource limits are imposed as part of core Kubernetes: +The following resource limits are imposed as part of core Kubernetes at the namespace level: | ResourceName | Description | | ------------ | ----------- | @@ -97,6 +97,10 @@ If the incoming request does not cause the total usage to exceed any of the enum **ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) into the system. +To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document. As a result, +its encouraged to actually impose a cap on the total number of individual quotas that are tracked in the **Namespace** to 1 by explicitly +capping it in **ResourceQuota** document. + ## kube-apiserver The server is updated to be aware of **ResourceQuota** objects. @@ -109,7 +113,9 @@ $ kube-apiserver -admission_control=ResourceQuota ## kube-controller-manager -A new controller is defined that runs a synch loop to run usage stats across the namespace. +A new controller is defined that runs a synch loop to calculate quota usage across the namespace. + +**ResourceQuota** usage is only calculated if a namespace has a **ResourceQuota** object. If the observed usage is different than the recorded usage, the controller sends a **ResourceQuotaUsage** resource to the server to atomically update. -- cgit v1.2.3 From f7b6bd0a26a9fae8d3b90378d02ca6a3f7b2c548 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Mon, 26 Jan 2015 10:34:44 -0800 Subject: Small tweaks to sequence diagram generation. Fix up name of font download and no transparency so it is easier to iterate. --- clustering/Makefile | 4 ++-- clustering/dynamic.png | Bin 87530 -> 72373 bytes clustering/static.png | Bin 45845 -> 36583 bytes 3 files changed, 2 insertions(+), 2 deletions(-) diff --git a/clustering/Makefile b/clustering/Makefile index c4095421..298479f1 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -10,7 +10,7 @@ watch: fswatch *.seqdiag | xargs -n 1 sh -c "make || true" $(FONT): - curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT).ttf + curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT) %.png: %.seqdiag $(FONT) - seqdiag -a -f '$(FONT)' $< + seqdiag --no-transparency -a -f '$(FONT)' $< diff --git a/clustering/dynamic.png b/clustering/dynamic.png index 9f2ff9db..92b40fee 100644 Binary files a/clustering/dynamic.png and b/clustering/dynamic.png differ diff --git a/clustering/static.png b/clustering/static.png index a01ebbe8..bcdeca7e 100644 Binary files a/clustering/static.png and b/clustering/static.png differ -- cgit v1.2.3 From 050db5a2f886f39b18cfe36ea768976bb91fdf55 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Mon, 26 Jan 2015 13:50:26 -0800 Subject: Add Dockerfile for sequence diagram generation --- clustering/Dockerfile | 12 ++++++++++++ clustering/Makefile | 13 +++++++++++++ clustering/README.md | 17 +++++++++++++++++ 3 files changed, 42 insertions(+) create mode 100644 clustering/Dockerfile diff --git a/clustering/Dockerfile b/clustering/Dockerfile new file mode 100644 index 00000000..3353419d --- /dev/null +++ b/clustering/Dockerfile @@ -0,0 +1,12 @@ +FROM debian:jessie + +RUN apt-get update +RUN apt-get -qy install python-seqdiag make curl + +WORKDIR /diagrams + +RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf + +ADD . /diagrams + +CMD bash -c 'make >/dev/stderr && tar cf - *.png' \ No newline at end of file diff --git a/clustering/Makefile b/clustering/Makefile index 298479f1..f6aa53ed 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -14,3 +14,16 @@ $(FONT): %.png: %.seqdiag $(FONT) seqdiag --no-transparency -a -f '$(FONT)' $< + +# Build the stuff via a docker image +.PHONY: docker +docker: + docker build -t clustering-seqdiag . + docker run --rm clustering-seqdiag | tar xvf - + +docker-clean: + docker rmi clustering-seqdiag || true + docker images -q --filter "dangling=true" | xargs docker rmi + +fix-clock-skew: + boot2docker ssh sudo date -u -D "%Y%m%d%H%M.%S" --set "$(shell date -u +%Y%m%d%H%M.%S)" diff --git a/clustering/README.md b/clustering/README.md index 04abb1bc..7e9d79c8 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -7,3 +7,20 @@ pip install seqdiag ``` Just call `make` to regenerate the diagrams. + +## Building with Docker +If you are on a Mac or your pip install is messed up, you can easily build with docker. + +``` +make docker +``` + +The first run will be slow but things should be fast after that. + +To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`. + +If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`. + +## Automatically rebuild on file changes + +If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`. \ No newline at end of file -- cgit v1.2.3 From c1937164730775dbadad1542ed4119a1f56e0494 Mon Sep 17 00:00:00 2001 From: Mrunal Patel Date: Wed, 28 Jan 2015 15:03:06 -0800 Subject: Replace "net" by "pod infra" in docs and format strings. --- networking.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/networking.md b/networking.md index 3f52d388..d90f56b1 100644 --- a/networking.md +++ b/networking.md @@ -62,7 +62,7 @@ Docker allocates IP addresses from a bridge we create on each node, using its - creates a new pair of veth devices and binds them to the netns - auto-assigns an IP from docker’s IP range -2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID. +2. Create the user containers and specify the name of the pod infra container as their “POD” argument. Docker finds the PID of the command running in the pod infra container and attaches to the netns and ipcns of that PID. ### Other networking implementation examples With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE. @@ -77,7 +77,7 @@ Right now, docker inspect doesn't show the networking configuration of the conta ### External IP assignment -We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). +We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across pod infra container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the pod infra container dies, all the user containers must be stopped and restarted because the netns of the pod infra container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). ### Naming, discovery, and load balancing -- cgit v1.2.3 From ab574621c1398afe67a4a58018bd46a4b1908a27 Mon Sep 17 00:00:00 2001 From: csrwng Date: Thu, 22 Jan 2015 09:32:30 -0500 Subject: [Proposal] Security Contexts --- security_context.md | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 security_context.md diff --git a/security_context.md b/security_context.md new file mode 100644 index 00000000..87d67aa7 --- /dev/null +++ b/security_context.md @@ -0,0 +1,158 @@ +# Security Contexts +## Abstract +A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)): + +1. Ensure a clear isolation between container and the underlying host it runs on +2. Limit the ability of the container to negatively impact the infrastructure or other containers + +## Background + +The problem of securing containers in Kubernetes has come up [before](https://github.com/GoogleCloudPlatform/kubernetes/issues/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface. + +## Motivation + +### Container isolation + +In order to improve container isolation from host and other containers running on the host, containers should only be +granted the access they need to perform their work. To this end it should be possible to take advantage of Docker +features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) +to the container process. + +Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers. + +### External integration with shared storage +In order to support external integration with shared storage, processes running in a Kubernetes cluster +should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. +Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks. + +## Constraints and Assumptions +* It is out of the scope of this document to prescribe a specific set + of constraints to isolate containers from their host. Different use cases need different + settings. +* The concept of a security context should not be tied to a particular security mechanism or platform + (ie. SELinux, AppArmor) +* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for + [service accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). + +## Use Cases + +In order of increasing complexity, following are example use cases that would +be addressed with security contexts: + +1. Kubernetes is used to run a single cloud application. In order to protect + nodes from containers: + * All containers run as a single non-root user + * Privileged containers are disabled + * All containers run with a particular MCS label + * Kernel capabilities like CHOWN and MKNOD are removed from containers + +2. Just like case #1, except that I have more than one application running on + the Kubernetes cluster. + * Each application is run in its own namespace to avoid name collisions + * For each application a different uid and MCS label is used + +3. Kubernetes is used as the base for a PAAS with + multiple projects, each project represented by a namespace. + * Each namespace is associated with a range of uids/gids on the node that + are mapped to uids/gids on containers using linux user namespaces. + * Certain pods in each namespace have special privileges to perform system + actions such as talking back to the server for deployment, run docker + builds, etc. + * External NFS storage is assigned to each namespace and permissions set + using the range of uids/gids assigned to that namespace. + +## Proposed Design + +### Overview +A *security context* consists of a set of constraints that determine how a container +is secured before getting created and run. It has a 1:1 correspondence to a +[service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). A *security context provider* is passed to the Kubelet so it can have a chance +to mutate Docker API calls in order to apply the security context. + +It is recommended that this design be implemented in two phases: + +1. Implement the security context provider extension point in the Kubelet + so that a default security context can be applied on container run and creation. +2. Implement a security context structure that is part of a service account. The + default context provider can then be used to apply a security context based + on the service account associated with the pod. + +### Security Context Provider + +The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container: + +```go +type SecurityContextProvider interface { + // ModifyContainerConfig is called before the Docker createContainer call. + // The security context provider can make changes to the Config with which + // the container is created. + // An error is returned if it's not possible to secure the container as + // requested with a security context. + ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error + + // ModifyHostConfig is called before the Docker runContainer call. + // The security context provider can make changes to the HostConfig, affecting + // security options, whether the container is privileged, volume binds, etc. + // An error is returned if it's not possible to secure the container as requested + // with a security context. + ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig) +} +``` +If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. + +### Security Context + +A security context has a 1:1 correspondence to a service account and it can be included as +part of the service account resource. Following is an example of an initial implementation: + +```go +type SecurityContext struct { + // user is the uid to use when running the container + User int + + // allowPrivileged indicates whether this context allows privileged mode containers + AllowPrivileged bool + + // allowedVolumeTypes lists the types of volumes that a container can bind + AllowedVolumeTypes []string + + // addCapabilities is the list of Linux kernel capabilities to add + AddCapabilities []string + + // removeCapabilities is the list of Linux kernel capabilities to remove + RemoveCapabilities []string + + // SELinux specific settings (optional) + SELinux *SELinuxContext + + // AppArmor specific settings (optional) + AppArmor *AppArmorContext + + // FUTURE: + // With Linux user namespace support, it should be possible to map + // a range of container uids/gids to arbitrary host uids/gids + // UserMappings []IDMapping + // GroupMappings []IDMapping +} + +type SELinuxContext struct { + // MCS label/SELinux level to run the container under + Level string + + // SELinux type label for container processes + Type string + + // FUTURE: + // LabelVolumeMountsExclusive []Volume + // LabelVolumeMountsShared []Volume +} + +type AppArmorContext struct { + // AppArmor profile + Profile string +} +``` + +#### Security Context Lifecycle + +The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. \ No newline at end of file -- cgit v1.2.3 From 4c9e6d37b6276e38b1d24d5299545ee1c7ca0472 Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Tue, 3 Feb 2015 22:38:01 +0000 Subject: Fix the broken links in the labels and access design docs. --- access.md | 4 ++-- labels.md | 12 ++++++------ 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/access.md b/access.md index 7af64ac9..8a2f1edd 100644 --- a/access.md +++ b/access.md @@ -151,7 +151,7 @@ In the Simple Profile: Namespaces versus userAccount vs Labels: - `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. -- `labels` (see [docs/labels.md](labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. +- `labels` (see [docs/labels.md](/docs/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. - `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people. @@ -212,7 +212,7 @@ Policy objects may be applicable only to a single namespace or to all namespaces ## Accounting -The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources.md](resources.md)). +The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources.md](/docs/resources.md)). Initially: - a `quota` object is immutable. diff --git a/labels.md b/labels.md index 8415376d..bc151f7c 100644 --- a/labels.md +++ b/labels.md @@ -1,6 +1,6 @@ # Labels -_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes. +_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](/docs/api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes. Each object can have a set of key/value labels set on it, with at most one label with a particular key. ``` @@ -10,13 +10,13 @@ Each object can have a set of key/value labels set on it, with at most one label } ``` -Unlike [names and UIDs](identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). +Unlike [names and UIDs](/docs/identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes. Label selectors may also be used to associate policies with sets of objects. -We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md). +We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](/docs/container-environment.md). Valid label keys are comprised of two segments - prefix and name - separated by a slash (`/`). The name segment is required and must be a DNS label: 63 @@ -50,8 +50,8 @@ key1 exists LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future. Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s: -- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods. -- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more. +- `service`: A [service](/docs/services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods. +- `replicationController`: A [replication controller](/docs/replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more. The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector. @@ -73,4 +73,4 @@ Since labels can be set at pod creation time, no separate set add/remove operati ## Labels vs. annotations -We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](annotations.md). +We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](/docs/annotations.md). -- cgit v1.2.3 From 3b687b605b7bd6abe51b91e4ec18678e05849814 Mon Sep 17 00:00:00 2001 From: csrwng Date: Mon, 9 Feb 2015 14:17:51 -0500 Subject: Specify intent for container isolation and add details for id mapping --- security_context.md | 86 ++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 59 insertions(+), 27 deletions(-) diff --git a/security_context.md b/security_context.md index 87d67aa7..400d30e9 100644 --- a/security_context.md +++ b/security_context.md @@ -98,6 +98,7 @@ type SecurityContextProvider interface { ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig) } ``` + If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. ### Security Context @@ -106,53 +107,84 @@ A security context has a 1:1 correspondence to a service account and it can be i part of the service account resource. Following is an example of an initial implementation: ```go + +// SecurityContext specifies the security constraints associated with a service account type SecurityContext struct { // user is the uid to use when running the container User int - // allowPrivileged indicates whether this context allows privileged mode containers + // AllowPrivileged indicates whether this context allows privileged mode containers AllowPrivileged bool - // allowedVolumeTypes lists the types of volumes that a container can bind + // AllowedVolumeTypes lists the types of volumes that a container can bind AllowedVolumeTypes []string - // addCapabilities is the list of Linux kernel capabilities to add + // AddCapabilities is the list of Linux kernel capabilities to add AddCapabilities []string - // removeCapabilities is the list of Linux kernel capabilities to remove + // RemoveCapabilities is the list of Linux kernel capabilities to remove RemoveCapabilities []string - // SELinux specific settings (optional) - SELinux *SELinuxContext - - // AppArmor specific settings (optional) - AppArmor *AppArmorContext + // Isolation specifies the type of isolation required for containers + // in this security context + Isolation ContainerIsolationSpec +} + +// ContainerIsolationSpec indicates intent for container isolation +type ContainerIsolationSpec struct { + // Type is the container isolation type (None, Private) + Type ContainerIsolationType - // FUTURE: - // With Linux user namespace support, it should be possible to map - // a range of container uids/gids to arbitrary host uids/gids - // UserMappings []IDMapping - // GroupMappings []IDMapping + // FUTURE: IDMapping specifies how users and groups from the host will be mapped + IDMapping *IDMapping } -type SELinuxContext struct { - // MCS label/SELinux level to run the container under - Level string - - // SELinux type label for container processes - Type string - - // FUTURE: - // LabelVolumeMountsExclusive []Volume - // LabelVolumeMountsShared []Volume +// ContainerIsolationType is the type of container isolation for a security context +type ContainerIsolationType string + +const ( + // ContainerIsolationNone means that no additional consraints are added to + // containers to isolate them from their host + ContainerIsolationNone ContainerIsolationType = "None" + + // ContainerIsolationPrivate means that containers are isolated in process + // and storage from their host and other containers. + ContainerIsolationPrivate ContainerIsolationType = "Private" +) + +// IDMapping specifies the requested user and group mappings for containers +// associated with a specific security context +type IDMapping struct { + // SharedUsers is the set of user ranges that must be unique to the entire cluster + SharedUsers []IDMappingRange + + // SharedGroups is the set of group ranges that must be unique to the entire cluster + SharedGroups []IDMappingRange + + // PrivateUsers are mapped to users on the host node, but are not necessarily + // unique to the entire cluster + PrivateUsers []IDMappingRange + + // PrivateGroups are mapped to groups on the host node, but are not necessarily + // unique to the entire cluster + PrivateGroups []IDMappingRange } -type AppArmorContext struct { - // AppArmor profile - Profile string +// IDMappingRange specifies a mapping between container IDs and node IDs +type IDMappingRange struct { + // ContainerID is the starting container ID + ContainerID int + + // HostID is the starting host ID + HostID int + + // Length is the length of the ID range + Length int } + ``` + #### Security Context Lifecycle The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. \ No newline at end of file -- cgit v1.2.3 From 4df971f078f655daad6e51dabd2fc05da8811e33 Mon Sep 17 00:00:00 2001 From: Saad Ali Date: Wed, 11 Feb 2015 18:04:30 -0800 Subject: Documentation for Event Compression --- event_compression.md | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 event_compression.md diff --git a/event_compression.md b/event_compression.md new file mode 100644 index 00000000..ab33a509 --- /dev/null +++ b/event_compression.md @@ -0,0 +1,79 @@ +# Kubernetes Event Compression + +This document captures the design of event compression. + + +## Background + +Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate ```image_not_existing``` and ```container_is_waiting``` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](https://github.com/GoogleCloudPlatform/kubernetes/issues/3853)). + +## Proposal +Each binary that generates events (for example, ```kubelet```) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. + +Event compression should be best effort (not guaranteed). Meaning, in the worst case, ```n``` identical (minus timestamp) events may still result in ```n``` event entries. + +## Design +Instead of a single Timestamp, each event object [contains](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/types.go#L1111) the following fields: + * ```FirstTimestamp util.Time``` + * The date/time of the first occurrence of the event. + * ```LastTimestamp util.Time``` + * The date/time of the most recent occurrence of the event. + * On first occurrence, this is equal to the FirstTimestamp. + * ```Count int``` + * The number of occurrences of this event between FirstTimestamp and LastTimestamp + * On first occurrence, this is 1. + +Each binary that generates events will: + * Maintain a new global hash table to keep track of previously generated events (see ```pkg/client/record/events_cache.go```). + * The code that “records/writes” events (see ```StartRecording``` in ```pkg/client/record/event.go```), uses the global hash table to check if any new event has been seen previously. + * The key for the hash table is generated from the event object minus timestamps/count/transient fields (see ```pkg/client/record/events_cache.go```), specifically the following events fields are used to construct a unique key for an event: + * ```event.Source.Component``` + * ```event.Source.Host``` + * ```event.InvolvedObject.Kind``` + * ```event.InvolvedObject.Namespace``` + * ```event.InvolvedObject.Name``` + * ```event.InvolvedObject.UID``` + * ```event.InvolvedObject.APIVersion``` + * ```event.Reason``` + * ```event.Message``` + * If the key for a new event matches the key for a previously generated events (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate: + * Instead of the usual POST/create event API, the new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. + * The event is also updated in the global hash table with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). + * If the key for a new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique: + * The usual POST/create event API is called to create a new event entry in etcd. + * An entry for the event is also added to the global hash table. + +## Issues/Risks + * Hash table clean up + * If the component (e.g. kubelet) runs for a long period of time and generates a ton of unique events, the hash table could grow very large in memory. + * *Future consideration:* remove entries from the hash table that are older than some specified time. + * Event history is not preserved across application restarts + * Each component keeps track of event history in memory, a restart causes event history to be cleared. + * That means that compression will not occur across component restarts. + * Similarly, if in the future events are aged out of the hash table, then events will only be compressed until they age out of the hash table, at which point any new instance of the event will cause a new entry to be created in etcd. + +## Example +Sample kubectl output +``` +FIRSTTIME LASTTIME COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE +Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-2.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal + +``` + +This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries. + +## Related Pull Requests/Issues + * Issue [#4073](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress duplicate events + * PR [#4157](https://github.com/GoogleCloudPlatform/kubernetes/issues/4157): Add "Update Event" to Kubernetes API + * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event + * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress recurring events in to a single event to optimize etcd storage -- cgit v1.2.3 From cbbd382b3f5bfb7207f7a83aea99b1967067f002 Mon Sep 17 00:00:00 2001 From: Clayton Coleman Date: Mon, 2 Feb 2015 14:35:33 -0500 Subject: Kubernetes pod and namespace security model This proposed update to docs/design/security.md includes proposals on how to ensure containers have consistent Linux security behavior across nodes, how containers authenticate and authorize to the master and other components, and how secret data could be distributed to pods to allow that authentication. References concepts from #3910, #2030, and #2297 as well as upstream issues around the Docker vault and Docker secrets. --- security.md | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 93 insertions(+), 4 deletions(-) diff --git a/security.md b/security.md index 22034bdf..27f07cd6 100644 --- a/security.md +++ b/security.md @@ -1,17 +1,106 @@ # Security in Kubernetes -General design principles and guidelines related to security of containers, APIs, and infrastructure in Kubernetes. +Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster. +While Kubernetes today is not primarily a multi-tenant system, the long term evolution of Kubernetes will increasingly rely on proper boundaries between users and administrators. The code running on the cluster must be appropriately isolated and secured to prevent malicious parties from affecting the entire cluster. -## Objectives -1. Ensure a clear isolation between container and the underlying host it runs on +## High Level Goals + +1. Ensure a clear isolation between the container and the underlying host it runs on 2. Limit the ability of the container to negatively impact the infrastructure or other containers 3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components 4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components +5. Allow users of the system to be cleanly separated from administrators +6. Allow administrative functions to be delegated to users where necessary +7. Allow applications to be run on the cluster that have "secret" data (keys, certs, passwords) which is properly abstracted from "public" data. + + +## Use cases + +### Roles: + +We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories: + +1. k8s admin - administers a kubernetes cluster and has access to the undelying components of the system +2. k8s project administrator - administrates the security of a small subset of the cluster +3. k8s developer - launches pods on a kubernetes cluster and consumes cluster resources + +Automated process users fall into the following categories: + +1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources indepedent of the human users attached to a project +2. k8s infrastructure user - the user that kubernetes infrastructure components use to perform cluster functions with clearly defined roles + + +### Description of roles: + +* Developers: + * write pod specs. + * making some of their own images, and using some "community" docker images + * know which pods need to talk to which other pods + * decide which pods should be share files with other pods, and which should not. + * reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod. + * do not often reason about operating system or organizational security. + * are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. + +* Project Admins: + * allocate identity and roles within a namespace + * reason about organizational security within a namespace + * don't give a developer permissions that are not needed for role. + * protect files on shared storage from unnecessary cross-team access + * are less focused about application security + +* Administrators: + * are less focused on application security. Focused on operating system security. + * protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers. + * comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. + * decides who can use which Linux Capabilities, run privileged containers, use hostDir, etc. + * e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not. + + +## Proposed Design + +A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*. + + +1. The API should authenticate and authorize user actions [authn and authz](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/access.md) +2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. +3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) +4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) + 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption + 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk + 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated to the container's action +5. When container processes runs on the cluster, they should run in a [security context](https://github.com/GoogleCloudPlatform/kubernetes/pull/3910) that isolates those processes via Linux user security, user namespaces, and permissions. + 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID + 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID + 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions + 4. Project administrators should be able to run pods within a namespace under different security contexts, and developers must be able to specify which of the available security contexts they may use + 5. Developers should be able to run their own images or images from the community and expect those images to run correctly + 6. Developers may need to ensure their images work within higher security requirements specified by administrators + 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. + 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes +6. Developers should be able to define [secrets](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) that are automatically added to the containers when pods are run + 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: + 1. An SSH private key for git cloning remote data + 2. A client certificate for accessing a remote system + 3. A private key and certificate for a web server + 4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master + 5. A .dockercfg file for pulling images from a protected registry + 2. Developers should be able to define the pod spec so that a secret lands in a specific location + 3. Project administrators should be able to limit developers within a namespace from viewing or modify secrets (anyone who can launch an arbitrary pod can view secrets) + 4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied + + +### Related design discussion + +* Authorization and authentication https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/access.md +* Secret distribution via files https://github.com/GoogleCloudPlatform/kubernetes/pull/2030 +* Docker secrets https://github.com/docker/docker/pull/6697 +* Docker vault https://github.com/docker/docker/issues/10310 +## Specific Design Points -## Design Points +### TODO: authorization, authentication ### Isolate the data store from the minions and supporting infrastructure -- cgit v1.2.3 From ec77204e813546253b7af509aad039e989b53ad4 Mon Sep 17 00:00:00 2001 From: Saad Ali Date: Tue, 17 Feb 2015 16:36:08 -0800 Subject: Update Event Compression Design Doc with LRU Cache --- event_compression.md | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/event_compression.md b/event_compression.md index ab33a509..99dda143 100644 --- a/event_compression.md +++ b/event_compression.md @@ -23,10 +23,10 @@ Instead of a single Timestamp, each event object [contains](https://github.com/G * The number of occurrences of this event between FirstTimestamp and LastTimestamp * On first occurrence, this is 1. -Each binary that generates events will: - * Maintain a new global hash table to keep track of previously generated events (see ```pkg/client/record/events_cache.go```). - * The code that “records/writes” events (see ```StartRecording``` in ```pkg/client/record/event.go```), uses the global hash table to check if any new event has been seen previously. - * The key for the hash table is generated from the event object minus timestamps/count/transient fields (see ```pkg/client/record/events_cache.go```), specifically the following events fields are used to construct a unique key for an event: +Each binary that generates events: + * Maintains a historical record of previously generated events: + * Implmented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/pkg/client/record/events_cache.go). + * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * ```event.Source.Component``` * ```event.Source.Host``` * ```event.InvolvedObject.Kind``` @@ -36,26 +36,24 @@ Each binary that generates events will: * ```event.InvolvedObject.APIVersion``` * ```event.Reason``` * ```event.Message``` - * If the key for a new event matches the key for a previously generated events (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate: - * Instead of the usual POST/create event API, the new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. - * The event is also updated in the global hash table with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). - * If the key for a new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique: + * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. + * When an event is generated, the previously generated events cache is checked (see [```pkg/client/record/event.go```](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/client/record/event.go)). + * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: + * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. + * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). + * If the key for the new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique and a new event entry is created in etcd: * The usual POST/create event API is called to create a new event entry in etcd. - * An entry for the event is also added to the global hash table. + * An entry for the event is also added to the previously generated events cache. ## Issues/Risks - * Hash table clean up - * If the component (e.g. kubelet) runs for a long period of time and generates a ton of unique events, the hash table could grow very large in memory. - * *Future consideration:* remove entries from the hash table that are older than some specified time. - * Event history is not preserved across application restarts - * Each component keeps track of event history in memory, a restart causes event history to be cleared. - * That means that compression will not occur across component restarts. - * Similarly, if in the future events are aged out of the hash table, then events will only be compressed until they age out of the hash table, at which point any new instance of the event will cause a new entry to be created in etcd. + * Compression is not guaranteed, because each component keeps track of event history in memory + * An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts. + * Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd. ## Example Sample kubectl output ``` -FIRSTTIME LASTTIME COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE +FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet. @@ -77,3 +75,4 @@ This demonstrates what would have been 20 separate entries (indicating schedulin * PR [#4157](https://github.com/GoogleCloudPlatform/kubernetes/issues/4157): Add "Update Event" to Kubernetes API * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress recurring events in to a single event to optimize etcd storage + * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map -- cgit v1.2.3 From 35402355a74c93d81df5052148e071c1c2bd6feb Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Tue, 17 Feb 2015 20:18:38 -0500 Subject: Secrets proposal --- secrets.md | 547 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 547 insertions(+) create mode 100644 secrets.md diff --git a/secrets.md b/secrets.md new file mode 100644 index 00000000..6d561eec --- /dev/null +++ b/secrets.md @@ -0,0 +1,547 @@ +# Secret Distribution + +## Abstract + +A proposal for the distribution of secrets (passwords, keys, etc) to the Kubelet and to +containers inside Kubernetes using a custom volume type. + +## Motivation + +Secrets are needed in containers to access internal resources like the Kubernetes master or +external resources such as git repositories, databases, etc. Users may also want behaviors in the +kubelet that depend on secret data (credentials for image pull from a docker registry) associated +with pods. + +Goals of this design: + +1. Describe a secret resource +2. Define the various challenges attendant to managing secrets on the node +3. Define a mechanism for consuming secrets in containers without modification + +## Constraints and Assumptions + +* This design does not prescribe a method for storing secrets; storage of secrets should be + pluggable to accomodate different use-cases +* Encryption of secret data and node security are orthogonal concerns +* It is assumed that node and master are secure and that compromising their security could also + compromise secrets: + * If a node is compromised, the only secrets that could potentially be exposed should be the + secrets belonging to containers scheduled onto it + * If the master is compromised, all secrets in the cluster may be exposed +* Secret rotation is an orthogonal concern, but it should be facilitated by this proposal + +## Use Cases + +1. As a user, I want to store secret artifacts for my applications and consume them securely in + containers, so that I can keep the configuration for my applications separate from the images + that use them: + 1. As a cluster operator, I want to allow a pod to access the Kubernetes master using a custom + `.kubeconfig` file, so that I can securely reach the master + 2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials + from a `.dockercfg` file, so that containers can push images + 3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys, + so that I can push and fetch to and from the repository +2. As a user, I want to allow containers to consume supplemental information about services such + as username and password which should be kept secret, so that I can share secrets about a + service amongst the containers in my application securely +3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a secret and have + the kubelet implement some reserved behaviors based on the types of secrets the service account + consumes: + 1. Use credentials for a docker registry to pull the pod's docker image + 2. Present kubernetes auth token to the pod or transparently decorate traffic between the pod + and master service +4. As a user, I want to be able to indicate that a secret expires and for that secret's value to + be rotated once it expires, so that the system can help me follow good practices + +### Use-Case: Configuration artifacts + +Many configuration files contain secrets intermixed with other configuration information. For +example, a user's application may contain a properties file than contains database credentials, +SaaS API tokens, etc. Users should be able to consume configuration artifacts in their containers +and be able to control the path on the container's filesystems where the artifact will be +presented. + +### Use-Case: Metadata about services + +Most pieces of information about how to use a service are secrets. For example, a service that +provides a MySQL database needs to provide the username, password, and database name to consumers +so that they can authenticate and use the correct database. Containers in pods consuming the MySQL +service would also consume the secrets associated with the MySQL service. + +### Use-Case: Secrets associated with service accounts + +[Service Accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) are proposed as a +mechanism to decouple capabilities and security contexts from individual human users. A +`ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is +associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and +other system components to take action based on the secret's type. + +#### Example: service account consumes auth token secret + +As an example, the service account proposal discusses service accounts consuming secrets which +contain kubernetes auth tokens. When a Kubelet starts a pod associated with a service account +which consumes this type of secret, the Kubelet may take a number of actions: + +1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's + file system +2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the + `kubernetes-master` service with the auth token, e. g. by adding a header to the request + (see the [LOAS Daemon](https://github.com/GoogleCloudPlatform/kubernetes/issues/2209) proposal) + +#### Example: service account consumes docker registry credentials + +Another example use case is where a pod is associated with a secret containing docker registry +credentials. The Kubelet could use these credentials for the docker pull to retrieve the image. + +### Use-Case: Secret expiry and rotation + +Rotation is considered a good practice for many types of secret data. It should be possible to +express that a secret has an expiry date; this would make it possible to implement a system +component that could regenerate expired secrets. As an example, consider a component that rotates +expired secrets. The rotator could periodically regenerate the values for expired secrets of +common types and update their expiry dates. + +## Deferral: Consuming secrets as environment variables + +Some images will expect to receive configuration items as environment variables instead of files. +We should consider what the best way to allow this is; there are a few different options: + +1. Force the user to adapt files into environment variables. Users can store secrets that need to + be presented as environment variables in a format that is easy to consume from a shell: + + $ cat /etc/secrets/my-secret.txt + export MY_SECRET_ENV=MY_SECRET_VALUE + + The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for + the image either inline in the command or in an init script, + +2. Give secrets an attribute that allows users to express the intent that the platform should + generate the above syntax in the file used to present a secret. The user could consume these + files in the same manner as the above option. + +3. Give secrets attributes that allow the user to express that the secret should be presented to + the container as an environment variable. The container's environment would contain the + desired values and the software in the container could use them without accomodation the + command or setup script. + +For our initial work, we will treat all secrets as files to narrow the problem space. There will +be a future proposal that handles exposing secrets as environment variables. + +## Flow analysis of secret data with respect to the API server + +There are two fundamentally different use-cases for access to secrets: + +1. CRUD operations on secrets by their owners +2. Read-only access to the secrets needed for a particular node by the kubelet + +### Use-Case: CRUD operations by owners + +In use cases for CRUD operations, the user experience for secrets should be no different than for +other API resources. + +#### Data store backing the REST API + +The data store backing the REST API should be pluggable because different cluster operators will +have different preferences for the central store of secret data. Some possibilities for storage: + +1. An etcd collection alongside the storage for other API resources +2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) +3. An external datastore such as an external etcd, RDBMS, etc. + +#### Size limit for secrets + +There should be a size limit for secrets in order to: + +1. Prevent DOS attacks against the API server +2. Allow kubelet implementations that prevent secret data from touching the node's filesystem + +The size limit should satisfy the following conditions: + +1. Large enough to store common artifact types (encryption keypairs, certificates, small + configuration files) +2. Small enough to avoid large impact on node resource consumption (storage, RAM for tmpfs, etc) + +To begin discussion, we propose an initial value for this size limit of **1MB**. + +#### Other limitations on secrets + +Defining a policy for limitations on how a secret may be referenced by another API resource and how +constraints should be applied throughout the cluster is tricky due to the number of variables +involved: + +1. Should there be a maximum number of secrets a pod can reference via a volume? +2. Should there be a maximum number of secrets a service account can reference? +3. Should there be a total maximum number of secrets a pod can reference via its own spec and its + associated service account? +4. Should there be a total size limit on the amount of secret data consumed by a pod? +5. How will cluster operators want to be able to configure these limits? +6. How will these limits impact API server validations? +7. How will these limits affect scheduling? + +For now, we will not implement validations around these limits. Cluster operators will decide how +much node storage is allocated to secrets. It will be the operator's responsibility to ensure that +the allocated storage is sufficient for the workload scheduled onto a node. + +### Use-Case: Kubelet read of secrets for node + +The use-case where the kubelet reads secrets has several additional requirements: + +1. Kubelets should only be able to receive secret data which is required by pods scheduled onto + the kubelet's node +2. Kubelets should have read-only access to secret data +3. Secret data should not be transmitted over the wire insecurely +4. Kubelets must ensure pods do not have access to each other's secrets + +#### Read of secret data by the Kubelet + +The Kubelet should only be allowed to read secrets which are consumed by pods scheduled onto that +Kubelet's node and their associated service accounts. Authorization of the Kubelet to read this +data would be delegated to an authorization plugin and associated policy rule. + +#### Secret data on the node: data at rest + +Consideration must be given to whether secret data should be allowed to be at rest on the node: + +1. If secret data is not allowed to be at rest, the size of secret data becomes another draw on + the node's RAM - should it affect scheduling? +2. If secret data is allowed to be at rest, should it be encrypted? + 1. If so, how should be this be done? + 2. If not, what threats exist? What types of secret are appropriate to store this way? + +For the sake of limiting complexity, we propose that initially secret data should not be allowed +to be at rest on a node; secret data should be stored on a node-level tmpfs filesystem. This +filesystem can be subdivided into directories for use by the kubelet and by the volume plugin. + +#### Secret data on the node: resource consumption + +The Kubelet will be responsible for creating the per-node tmpfs file system for secret storage. +It is hard to make a prescriptive declaration about how much storage is appropriate to reserve for +secrets because different installations will vary widely in available resources, desired pod to +node density, overcommit policy, and other operation dimensions. That being the case, we propose +for simplicity that the amount of secret storage be controlled by a new parameter to the kubelet +with a default value of **64MB**. It is the cluster operator's responsibility to handle choosing +the right storage size for their installation and configuring their Kubelets correctly. + +Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that +the cluster-wide storage size be readable from a central configuration store like the one proposed +in [#1553](https://github.com/GoogleCloudPlatform/kubernetes/issues/1553). When such a store +exists, the Kubelet could be modified to read this configuration item from the store. + +When the Kubelet is modified to advertise node resources (as proposed in +[#4441](https://github.com/GoogleCloudPlatform/kubernetes/issues/4441)), the capacity calculation +for available memory should factor in the potential size of the node-level tmpfs in order to avoid +memory overcommit on the node. + +#### Secret data on the node: isolation + +Every pod will have a [security context](https://github.com/GoogleCloudPlatform/kubernetes/pull/3910). +Secret data on the node should be isolated according to the security context of the container. The +Kubelet volume plugin API will be changed so that a volume plugin receives the security context of +a volume along with the volume spec. This will allow volume plugins to implement setting the +security context of volumes they manage. + +## Community work: + +Several proposals / upstream patches are notable as background for this proposal: + +1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) +2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) +3. [Kubernetes service account proposal](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) +4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) +5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) + +## Proposed Design + +We propose a new `Secret` resource which is mounted into containers with a new volume type. Secret +volumes will be handled by a volume plugin that does the actual work of fetching the secret and +storing it. Secrets contain multiple pieces of data that are presented as different files within +the secret volume (example: SSH key pair). + +In order to remove the burden from the end user in specifying every file that a secret consists of, +it should be possible to mount all files provided by a secret with a single ```VolumeMount``` entry +in the container specification. + +### Secret API Resource + +A new resource for secrets will be added to the API: + +```go +type Secret struct { + TypeMeta + ObjectMeta + + // Keys in this map are the paths relative to the volume + // presented to a container for this secret data. + Data map[string][]byte + Type SecretType +} + +type SecretType string + +const ( + SecretTypeOpaque SecretType = "opaque" // Opaque (arbitrary data; default) + SecretTypeKubernetesAuthToken SecretType = "kubernetes-auth" // Kubernetes auth token + SecretTypeDockerRegistryAuth SecretType = "docker-reg-auth" // Docker registry auth + // FUTURE: other type values +) + +const MaxSecretSize = 1 * 1024 * 1024 +``` + +A Secret can declare a type in order to provide type information to system components that work +with secrets. The default type is `opaque`, which represents arbitrary user-owned data. + +Secrets are validated against `MaxSecretSize`. + +A new REST API and registry interface will be added to accompany the `Secret` resource. The +default implementation of the registry will store `Secret` information in etcd. Future registry +implementations could store the `TypeMeta` and `ObjectMeta` fields in etcd and store the secret +data in another data store entirely, or store the whole object in another data store. + +#### Other validations related to secrets + +Initially there will be no validations for the number of secrets a pod references, or the number of +secrets that can be associated with a service account. These may be added in the future as the +finer points of secrets and resource allocation are fleshed out. + +### Secret Volume Source + +A new `SecretSource` type of volume source will be added to the ```VolumeSource``` struct in the +API: + +```go +type VolumeSource struct { + // Other fields omitted + + // SecretSource represents a secret that should be presented in a volume + SecretSource *SecretSource `json:"secret"` +} + +type SecretSource struct { + Target ObjectReference +} +``` + +Secret volume sources are validated to ensure that the specified object reference actually points +to an object of type `Secret`. + +### Secret Volume Plugin + +A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will +require access to the API server to retrieve secret data and therefore the volume `Host` interface +will have to change to expose a client interface: + +```go +type Host interface { + // Other methods omitted + + // GetKubeClient returns a client interface + GetKubeClient() client.Interface +} +``` + +The secret volume plugin will be responsible for: + +1. Returning a `volume.Builder` implementation from `NewBuilder` that: + 1. Retrieves the secret data for the volume from the API server + 2. Places the secret data onto the container's filesystem + 3. Sets the correct security attributes for the volume based on the pod's `SecurityContext` +2. Returning a `volume.Cleaner` implementation from `NewClear` that cleans the volume from the + container's filesystem + +### Kubelet: Node-level secret storage + +The Kubelet must be modified to accept a new parameter for the secret storage size and to create +a tmpfs file system of that size to store secret data. Rough accounting of specific changes: + +1. The Kubelet should have a new field added called `secretStorageSize`; units are megabytes +2. `NewMainKubelet` should accept a value for secret storage size +3. The Kubelet server should have a new flag added for secret storage size +4. The Kubelet's `setupDataDirs` method should be changed to create the secret storage + +### Kubelet: New behaviors for secrets associated with service accounts + +For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's +`ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type +`docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed +to accomodate this. Subsequent proposals can address this on a type-by-type basis. + +## Examples + +For clarity, let's examine some detailed examples of some common use-cases in terms of the +suggested changes. All of these examples are assumed to be created in a namespace called +`example`. + +### Use-Case: Pod with ssh keys + +To create a pod that uses an ssh key stored as a secret, we first need to create a secret: + +```json +{ + "apiVersion": "v1beta2", + "kind": "Secret", + "id": "ssh-key-secret", + "data": { + "id_rsa.pub": "dmFsdWUtMQ0K", + "id_rsa": "dmFsdWUtMg0KDQo=" + } +} +``` + +**Note:** The values of secret data are encoded as base64-encoded strings. + +Now we can create a pod which references the secret with the ssh key and consumes it in a volume: + +```json +{ + "id": "secret-test-pod", + "kind": "Pod", + "apiVersion":"v1beta2", + "labels": { + "name": "secret-test" + }, + "desiredState": { + "manifest": { + "version": "v1beta1", + "id": "secret-test-pod", + "containers": [{ + "name": "ssh-test-container", + "image": "mySshImage", + "volumeMounts": [{ + "name": "secret-volume", + "mountPath": "/etc/secret-volume", + "readOnly": true + }] + }], + "volumes": [{ + "name": "secret-volume", + "source": { + "secret": { + "target": { + "kind": "Secret", + "namespace": "example", + "name": "ssh-key-secret" + } + } + } + }] + } + } +} +``` + +When the container's command runs, the pieces of the key will be available in: + + /etc/secret-volume/id_rsa.pub + /etc/secret-volume/id_rsa + +The container is then free to use the secret data to establish an ssh connection. + +### Use-Case: Pods with pod / test credentials + +Let's compare examples where a pod consumes a secret containing prod credentials and another pod +consumes a secret with test environment credentials. + +The secrets: + +```json +[{ + "apiVersion": "v1beta2", + "kind": "Secret", + "id": "prod-db-secret", + "data": { + "username": "dmFsdWUtMQ0K", + "password": "dmFsdWUtMg0KDQo=" + } +}, +{ + "apiVersion": "v1beta2", + "kind": "Secret", + "id": "test-db-secret", + "data": { + "username": "dmFsdWUtMQ0K", + "password": "dmFsdWUtMg0KDQo=" + } +}] +``` + +The pods: + +```json +[{ + "id": "prod-db-client-pod", + "kind": "Pod", + "apiVersion":"v1beta2", + "labels": { + "name": "prod-db-client" + }, + "desiredState": { + "manifest": { + "version": "v1beta1", + "id": "prod-db-pod", + "containers": [{ + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [{ + "name": "secret-volume", + "mountPath": "/etc/secret-volume", + "readOnly": true + }] + }], + "volumes": [{ + "name": "secret-volume", + "source": { + "secret": { + "target": { + "kind": "Secret", + "namespace": "example", + "name": "prod-db-secret" + } + } + } + }] + } + } +}, +{ + "id": "test-db-client-pod", + "kind": "Pod", + "apiVersion":"v1beta2", + "labels": { + "name": "test-db-client" + }, + "desiredState": { + "manifest": { + "version": "v1beta1", + "id": "test-db-pod", + "containers": [{ + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [{ + "name": "secret-volume", + "mountPath": "/etc/secret-volume", + "readOnly": true + }] + }], + "volumes": [{ + "name": "secret-volume", + "source": { + "secret": { + "target": { + "kind": "Secret", + "namespace": "example", + "name": "test-db-secret" + } + } + } + }] + } + } +}] +``` + +The specs for the two pods differ only in the value of the object referred to by the secret volume +source. Both containers will have the following files present on their filesystems: + + /etc/secret-volume/username + /etc/secret-volume/password -- cgit v1.2.3 From e6e17729be57537bc49aa8734c2bfdb202ebdbd4 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Thu, 19 Feb 2015 10:25:13 -0500 Subject: Minor addendums to secrets proposal --- secrets.md | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/secrets.md b/secrets.md index 6d561eec..ce02f930 100644 --- a/secrets.md +++ b/secrets.md @@ -29,6 +29,8 @@ Goals of this design: secrets belonging to containers scheduled onto it * If the master is compromised, all secrets in the cluster may be exposed * Secret rotation is an orthogonal concern, but it should be facilitated by this proposal +* A user who can consume a secret in a container can know the value of the secret; secrets must + be provisioned judiciously ## Use Cases @@ -270,10 +272,12 @@ type Secret struct { TypeMeta ObjectMeta - // Keys in this map are the paths relative to the volume - // presented to a container for this secret data. - Data map[string][]byte - Type SecretType + // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN. + // The serialized form of the secret data is a base64 encoded string. + Data map[string][]byte `json:"data,omitempty"` + + // Used to facilitate programatic handling of secret data. + Type SecretType `json:"type,omitempty"` } type SecretType string @@ -291,7 +295,8 @@ const MaxSecretSize = 1 * 1024 * 1024 A Secret can declare a type in order to provide type information to system components that work with secrets. The default type is `opaque`, which represents arbitrary user-owned data. -Secrets are validated against `MaxSecretSize`. +Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must be valid DNS +subdomains. A new REST API and registry interface will be added to accompany the `Secret` resource. The default implementation of the registry will store `Secret` information in etcd. Future registry @@ -325,6 +330,11 @@ type SecretSource struct { Secret volume sources are validated to ensure that the specified object reference actually points to an object of type `Secret`. +In the future, the `SecretSource` will be extended to allow: + +1. Fine-grained control over which pieces of secret data are exposed in the volume +2. The paths and filenames for how secret data are exposed + ### Secret Volume Plugin A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will @@ -382,13 +392,14 @@ To create a pod that uses an ssh key stored as a secret, we first need to create "kind": "Secret", "id": "ssh-key-secret", "data": { - "id_rsa.pub": "dmFsdWUtMQ0K", - "id_rsa": "dmFsdWUtMg0KDQo=" + "id-rsa.pub": "dmFsdWUtMQ0K", + "id-rsa": "dmFsdWUtMg0KDQo=" } } ``` -**Note:** The values of secret data are encoded as base64-encoded strings. +**Note:** The values of secret data are encoded as base64-encoded strings. Newlines are not +valid within these strings and must be omitted. Now we can create a pod which references the secret with the ssh key and consumes it in a volume: @@ -432,8 +443,8 @@ Now we can create a pod which references the secret with the ssh key and consume When the container's command runs, the pieces of the key will be available in: - /etc/secret-volume/id_rsa.pub - /etc/secret-volume/id_rsa + /etc/secret-volume/id-rsa.pub + /etc/secret-volume/id-rsa The container is then free to use the secret data to establish an ssh connection. -- cgit v1.2.3 From d1ed142faf5e511f742cab9a77e21ee1ca58edc2 Mon Sep 17 00:00:00 2001 From: Andy Goldstein Date: Thu, 8 Jan 2015 15:41:38 -0500 Subject: Add streaming command execution & port forwarding Add streaming command execution & port forwarding via HTTP connection upgrades (currently using SPDY). --- command_execution_port_forwarding.md | 144 +++++++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 command_execution_port_forwarding.md diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md new file mode 100644 index 00000000..3b9aeec7 --- /dev/null +++ b/command_execution_port_forwarding.md @@ -0,0 +1,144 @@ +# Container Command Execution & Port Forwarding in Kubernetes + +## Abstract + +This describes an approach for providing support for: + +- executing commands in containers, with stdin/stdout/stderr streams attached +- port forwarding to containers + +## Background + +There are several related issues/PRs: + +- [Support attach](https://github.com/GoogleCloudPlatform/kubernetes/issues/1521) +- [Real container ssh](https://github.com/GoogleCloudPlatform/kubernetes/issues/1513) +- [Provide easy debug network access to services](https://github.com/GoogleCloudPlatform/kubernetes/issues/1863) +- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576) + +## Motivation + +Users and administrators are accustomed to being able to access their systems +via SSH to run remote commands, get shell access, and do port forwarding. + +Supporting SSH to containers in Kubernetes is a difficult task. You must +specify a "user" and a hostname to make an SSH connection, and `sshd` requires +real users (resolvable by NSS and PAM). Because a container belongs to a pod, +and the pod belongs to a namespace, you need to specify namespace/pod/container +to uniquely identify the target container. Unfortunately, a +namespace/pod/container is not a real user as far as SSH is concerned. Also, +most Linux systems limit user names to 32 characters, which is unlikely to be +large enough to contain namespace/pod/container. We could devise some scheme to +map each namespace/pod/container to a 32-character user name, adding entries to +`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the +time. Alternatively, we could write custom NSS and PAM modules that allow the +host to resolve a namespace/pod/container to a user without needing to keep +files or LDAP in sync. + +As an alternative to SSH, we are using a multiplexed streaming protocol that +runs on top of HTTP. There are no requirements about users being real users, +nor is there any limitation on user name length, as the protocol is under our +control. The only downside is that standard tooling that expects to use SSH +won't be able to work with this mechanism, unless adapters can be written. + +## Constraints and Assumptions + +- SSH support is not currently in scope +- CGroup confinement is ultimately desired, but implementing that support is not currently in scope +- SELinux confinement is ultimately desired, but implementing that support is not currently in scope + +## Use Cases + +- As a user of a Kubernetes cluster, I want to run arbitrary commands in a container, attaching my local stdin/stdout/stderr to the container +- As a user of a Kubernetes cluster, I want to be able to connect to local ports on my computer and have them forwarded to ports in the container + +## Process Flow + +### Remote Command Execution Flow +1. The client connects to the Kubernetes Master to initiate a remote command execution +request +2. The Master proxies the request to the Kubelet where the container lives +3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container + +### Port Forwarding Flow +1. The client connects to the Kubernetes Master to initiate a remote command execution +request +2. The Master proxies the request to the Kubelet where the container lives +3. The client listens on each specified local port, awaiting local connections +4. The client connects to one of the local listening ports +4. The client notifies the Kubelet of the new connection +5. The Kubelet executes nsenter + socat and streams data back and forth between the client and the port in the container + + +## Design Considerations + +### Streaming Protocol + +The current multiplexed streaming protocol used is SPDY. This is not the +long-term desire, however. As soon as there is viable support for HTTP/2 in Go, +we will switch to that. + +### Master as First Level Proxy + +Clients should not be allowed to communicate directly with the Kubelet for +security reasons. Therefore, the Master is currently the only suggested entry +point to be used for remote command execution and port forwarding. This is not +necessarily desirable, as it means that all remote command execution and port +forwarding traffic must travel through the Master, potentially impacting other +API requests. + +In the future, it might make more sense to retrieve an authorization token from +the Master, and then use that token to initiate a remote command execution or +port forwarding request with a load balanced proxy service dedicated to this +functionality. This would keep the streaming traffic out of the Master. + +### Kubelet as Backend Proxy + +The kubelet is currently responsible for handling remote command execution and +port forwarding requests. Just like with the Master described above, this means +that all remote command execution and port forwarding streaming traffic must +travel through the Kubelet, which could result in a degraded ability to service +other requests. + +In the future, it might make more sense to use a separate service on the node. + +Alternatively, we could possibly inject a process into the container that only +listens for a single request, expose that process's listening port on the node, +and then issue a redirect to the client such that it would connect to the first +level proxy, which would then proxy directly to the injected process's exposed +port. This would minimize the amount of proxying that takes place. + +### Scalability + +There are at least 2 different ways to execute a command in a container: +`docker exec` and `nsenter`. While `docker exec` might seem like an easier and +more obvious choice, it has some drawbacks. + +#### `docker exec` + +We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port +on the node), but this would require proxying from the edge and securing the +Docker API. `docker exec` calls go through the Docker daemon, meaning that all +stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop. +Additionally, you can't isolate 1 malicious `docker exec` call from normal +usage, meaning an attacker could initiate a denial of service or other attack +and take down the Docker daemon, or the node itself. + +We expect remote command execution and port forwarding requests to be long +running and/or high bandwidth operations, and routing all the streaming data +through the Docker daemon feels like a bottleneck we can avoid. + +#### `nsenter` + +The implementation currently uses `nsenter` to run commands in containers, +joining the appropriate container namespaces. `nsenter` runs directly on the +node and is not proxied through any single daemon process. + +### Security + +Authentication and authorization hasn't specifically been tested yet with this +functionality. We need to make sure that users are not allowed to execute +remote commands or do port forwarding to containers they aren't allowed to +access. + +Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts. \ No newline at end of file -- cgit v1.2.3 From 684bb8868e54d701edab9b6eeaa3500fb6a931e1 Mon Sep 17 00:00:00 2001 From: Deyuan Deng Date: Fri, 20 Feb 2015 10:44:02 -0500 Subject: Admission doc cleanup --- admission_control.md | 6 +++--- admission_control_limit_range.md | 20 ++++++++++---------- admission_control_resource_quota.md | 25 +++++++++++++------------ 3 files changed, 26 insertions(+), 25 deletions(-) diff --git a/admission_control.md b/admission_control.md index 88afda73..1e1c1e53 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,6 +1,6 @@ # Kubernetes Proposal - Admission Control -**Related PR:** +**Related PR:** | Topic | Link | | ----- | ---- | @@ -35,7 +35,7 @@ The kube-apiserver takes the following OPTIONAL arguments to enable admission co An **AdmissionControl** plug-in is an implementation of the following interface: -``` +```go package admission // Attributes is an interface used by a plug-in to make an admission decision on a individual request. @@ -57,7 +57,7 @@ type Interface interface { A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation of admission.Interface. -``` +```go func init() { admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) } diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 69fe144b..e3a56c87 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -9,7 +9,7 @@ This document proposes a system for enforcing min/max limits per resource as par A new resource, **LimitRange**, is introduced to enumerate min/max limits for a resource type scoped to a Kubernetes namespace. -``` +```go const ( // Limit that applies to all pods in a namespace LimitTypePod string = "Pod" @@ -54,7 +54,7 @@ type LimitRangeList struct { ## AdmissionControl plugin: LimitRanger -The **LimitRanger** plug-in introspects all incoming admission requests. +The **LimitRanger** plug-in introspects all incoming admission requests. It makes decisions by evaluating the incoming object against all defined **LimitRange** objects in the request context namespace. @@ -97,20 +97,20 @@ kubectl is modified to support the **LimitRange** resource. For example, -``` +```shell $ kubectl namespace myspace $ kubectl create -f examples/limitrange/limit-range.json $ kubectl get limits NAME limits $ kubectl describe limits limits -Name: limits -Type Resource Min Max ----- -------- --- --- -Pod memory 1Mi 1Gi -Pod cpu 250m 2 -Container cpu 250m 2 -Container memory 1Mi 1Gi +Name: limits +Type Resource Min Max +---- -------- --- --- +Pod memory 1Mi 1Gi +Pod cpu 250m 2 +Container memory 1Mi 1Gi +Container cpu 250m 2 ``` ## Future Enhancements: Define limits for a particular pod or container. diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 08bc6bec..ebad0728 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -10,7 +10,7 @@ A new resource, **ResourceQuota**, is introduced to enumerate hard resource limi A new resource, **ResourceQuotaUsage**, is introduced to support atomic updates of a **ResourceQuota** status. -``` +```go // The following identify resource constants for Kubernetes object types const ( // Pods, number @@ -139,14 +139,15 @@ $ kubectl namespace myspace $ kubectl create -f examples/resourcequota/resource-quota.json $ kubectl get quota NAME -myquota -$ kubectl describe quota myquota -Name: myquota -Resource Used Hard --------- ---- ---- -cpu 100m 20 -memory 0 1.5Gb -pods 1 10 -replicationControllers 1 10 -services 2 3 -``` \ No newline at end of file +quota +$ kubectl describe quota quota +Name: quota +Resource Used Hard +-------- ---- ---- +cpu 0m 20 +memory 0 1Gi +pods 5 10 +replicationcontrollers 5 20 +resourcequotas 1 1 +services 3 5 +``` -- cgit v1.2.3 From 9b36d8d8ed359be5e43215c7304adf6a53e78409 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Tue, 11 Nov 2014 10:52:31 -0800 Subject: Service account proposal. COMMIT_BLOCKED_ON_GENDOCS --- security.md | 4 +- security_context.md | 8 +-- service_accounts.md | 164 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 171 insertions(+), 5 deletions(-) create mode 100644 service_accounts.md diff --git a/security.md b/security.md index 27f07cd6..ba699739 100644 --- a/security.md +++ b/security.md @@ -97,6 +97,8 @@ A pod runs in a *security context* under a *service account* that is defined by * Secret distribution via files https://github.com/GoogleCloudPlatform/kubernetes/pull/2030 * Docker secrets https://github.com/docker/docker/pull/6697 * Docker vault https://github.com/docker/docker/issues/10310 +* Service Accounts: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md +* Secret volumes https://github.com/GoogleCloudPlatform/kubernetes/4126 ## Specific Design Points @@ -112,4 +114,4 @@ Both the Kubelet and Kube Proxy need information related to their specific roles The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes. -The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). \ No newline at end of file +The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). diff --git a/security_context.md b/security_context.md index 400d30e9..7dc10e69 100644 --- a/security_context.md +++ b/security_context.md @@ -172,13 +172,13 @@ type IDMapping struct { // IDMappingRange specifies a mapping between container IDs and node IDs type IDMappingRange struct { - // ContainerID is the starting container ID + // ContainerID is the starting container UID or GID ContainerID int - // HostID is the starting host ID + // HostID is the starting host UID or GID HostID int - // Length is the length of the ID range + // Length is the length of the UID/GID range Length int } @@ -187,4 +187,4 @@ type IDMappingRange struct { #### Security Context Lifecycle -The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. \ No newline at end of file +The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. diff --git a/service_accounts.md b/service_accounts.md new file mode 100644 index 00000000..5d86f244 --- /dev/null +++ b/service_accounts.md @@ -0,0 +1,164 @@ +#Service Accounts + +## Motivation + +Processes in Pods may need to call the Kubernetes API. For example: + - scheduler + - replication controller + - minion controller + - a map-reduce type framework which has a controller that then tries to make a dynamically determined number of workers and watch them + - continuous build and push system + - monitoring system + +They also may interact with services other than the Kubernetes API, such as: + - an image repository, such as docker -- both when the images are pulled to start the containers, and for writing + images in the case of pods that generate images. + - accessing other cloud services, such as blob storage, in the context of a larged, integrated, cloud offering (hosted + or private). + - accessing files in an NFS volume attached to the pod + +## Design Overview +A service account binds together several things: + - a *name*, understood by users, and perhaps by peripheral systems, for an identity + - a *principal* that can be authenticated and (authorized)[../authorization.md] + - a [security context](./security_contexts.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other + capabilities and controls on interaction with the file system and OS. + - a set of [secrets](./secrets.md), which a container may use to + access various networked resources. + +## Design Discussion + +A new object Kind is added: +```go +type ServiceAccount struct { + TypeMeta `json:",inline" yaml:",inline"` + ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"` + + username string + securityContext ObjectReference // (reference to a securityContext object) + secrets []ObjectReference // (references to secret objects +} +``` + +The name ServiceAccount is chosen because it is widely used already (e.g. by Kerberos and LDAP) +to refer to this type of account. Note that it has no relation to kubernetes Service objects. + +The ServiceAccount object does not include any information that could not be defined separately: + - username can be defined however users are defined. + - securityContext and secrets are only referenced and are created using the REST API. + +The purpose of the serviceAccount object is twofold: + - to bind usernames to securityContexts and secrets, so that the username can be used to refer succinctly + in contexts where explicitly naming securityContexts and secrets would be inconvenient + - to provide an interface to simplify allocation of new securityContexts and secrets. +These features are explained later. + +### Names + +From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to kubernetes API. +This includes a human running `kubectl` on her desktop and a container in a Pod on a Node making API calls. + +There is already a notion of a username in kubernetes, which is populated into a request context after authentication. +However, there is no API object representing a user. While this may evolve, it is expected that in mature installations, +the canonical storage of user identifiers will be handled by a system external to kubernetes. + +Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be +simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity ( +`alice@example.com` vs `alice@example.org`.) Naming convention may distinguish service accounts from user +accounts (e.g. `alice@example.com` vs `build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), +but Kubernetes does not require this. + +Kubernetes also does not require that there be a distinction between human and Pod users. It will be possible +to setup a cluster where Alice the human talks to the kubernetes API as username `alice` and starts pods that +also talk to the API as user `alice` and write files to NFS as user `alice`. But, this is not recommended. + +Instead, it is recommended that Pods and Humans have distinct identities, and reference implementations will +make this distinction. + +The distinction is useful for a number of reasons: + - the requirements for humans and automated processes are different: + - Humans need a wide range of capabilities to do their daily activities. Automated processes often have more narrowly-defined activities. + - Humans may better tolerate the exceptional conditions created by expiration of a token. Remembering to handle + this in a program is more annoying. So, either long-lasting credentials or automated rotation of credentials is + needed. + - A Human typically keeps credentials on a machine that is not part of the cluster and so not subject to automatic + management. A VM with a role/service-account can have its credentials automatically managed. + - the identity of a Pod cannot in general be mapped to a single human. + - If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human. + +**TODO**: consider getting rid of separate serviceAccount object and just rolling its parts into the SecurityContext or +Pod Object. + +The `secrets` field is a list of references to /secret objects that an process started as that service account should +have access to to be able to assert that role. + +The secrets are not inline with the serviceAccount object. This way, most or all users can have permission to `GET /serviceAccounts` so they can remind themselves +what serviceAccounts are available for use. + +Nothing will prevent creation of a serviceAccount with two secrets of type `SecretTypeKubernetesAuth`, or secrets of two +different types. Kubelet and client libraries will have some behavior, TBD, to handle the case of multiple secrets of a +given type (pick first or provide all and try each in order, etc). + +When a serviceAccount and a matching secret exist, then a `User.Info` for the serviceAccount and a `BearerToken` from the secret +are added to the map of tokens used by the authentication process in the apiserver, and similarly for other types. (We +might have some types that do not do anything on apiserver but just get pushed to the kubelet.) + +### Pods +The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a +default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the +Service Account Finalizer (see below). + +TBD: how policy limits which users can make pods with which service accounts. + +### Authorization +Kubernetes API Authorization Policies refer to users. Pods created with a `Pods.Spec.ServiceAccountUsername` typically +get a `Secret` which allows them to authenticate to the Kubernetes APIserver as a particular user. So any +policy that is desired can be applied to them. + +A higher level workflow is needed to coordinate creation of serviceAccounts, secrets and relevant policy objects. +Users are free to extend kubernetes to put this business logic wherever is convenient for them, though the +Service Account Finalizer is one place where this can happen (see below). + +### Kubelet + +The kubelet will treat as "not ready to run" (needing a finalizer to act on it) any Pod which has an empty +SecurityContext. + +The kubelet will set a default, restrictive, security context for any pods created from non-Apiserver config +sources (http, file). + +Kubelet watches apiserver for secrets which are needed by pods bound to it. + +**TODO**: how to only let kubelet see secrets it needs to know. + +### The service account finalizer + +There are several ways to use Pods with SecurityContexts and Secrets. + +One way is to explicitly specify the securityContext and all secrets of a Pod when the pod is initially created, +like this: + +**TODO**: example of pod with explicit refs. + +Another way is with the *Service Account Finalizer*, a plugin process which is optional, and which handles +business logic around service accounts. + +The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount definitions. + +First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no `Pod.Spec.SecurityContext` set, +then it copies in the referenced securityContext and secrets references for the corresponding `serviceAccount`. + +Second, if ServiceAccount definitions change, it may take some actions. +**TODO**: decide what actions it takes when a serviceAccount defintion changes. Does it stop pods, or just +allow someone to list ones that out out of spec? In general, people may want to customize this? + +Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include +a new username (e.g. `NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), a new +securityContext, a newly generated secret to authenticate that serviceAccount to the Kubernetes API, and default +policies for that service account. +**TODO**: more concrete example. What are typical default permissions for default service account (e.g. readonly access +to services in the same namespace and read-write access to events in that namespace?) + +Finally, it may provide an interface to automate creation of new serviceAccounts. In that case, the user may want +to GET serviceAccounts to see what has been created. + -- cgit v1.2.3 From 0d339383f4c17664472df0eb03289161062bda11 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 19 Feb 2015 22:03:36 -0800 Subject: minor fixups as I review secrets --- secrets.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/secrets.md b/secrets.md index ce02f930..ac8776bd 100644 --- a/secrets.md +++ b/secrets.md @@ -283,9 +283,9 @@ type Secret struct { type SecretType string const ( - SecretTypeOpaque SecretType = "opaque" // Opaque (arbitrary data; default) - SecretTypeKubernetesAuthToken SecretType = "kubernetes-auth" // Kubernetes auth token - SecretTypeDockerRegistryAuth SecretType = "docker-reg-auth" // Docker registry auth + SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) + SecretTypeKubernetesAuthToken SecretType = "KubernetesAuth" // Kubernetes auth token + SecretTypeDockerRegistryAuth SecretType = "DockerRegistryAuth" // Docker registry auth // FUTURE: other type values ) -- cgit v1.2.3 From dee17e393e2937729a784b34b96fd4418e046f24 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Mon, 23 Feb 2015 10:57:51 -0800 Subject: comments on base64-ness of secrets --- secrets.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/secrets.md b/secrets.md index ac8776bd..dc596183 100644 --- a/secrets.md +++ b/secrets.md @@ -273,7 +273,8 @@ type Secret struct { ObjectMeta // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN. - // The serialized form of the secret data is a base64 encoded string. + // The serialized form of the secret data is a base64 encoded string, + // representing the arbitrary (possibly non-string) data value here. Data map[string][]byte `json:"data,omitempty"` // Used to facilitate programatic handling of secret data. @@ -398,8 +399,9 @@ To create a pod that uses an ssh key stored as a secret, we first need to create } ``` -**Note:** The values of secret data are encoded as base64-encoded strings. Newlines are not -valid within these strings and must be omitted. +**Note:** The serialized JSON and YAML values of secret data are encoded as +base64 strings. Newlines are not valid within these strings and must be +omitted. Now we can create a pod which references the secret with the ssh key and consumes it in a volume: -- cgit v1.2.3 From 26159771f22b3e58a0f34be13f1d9ac54e942acf Mon Sep 17 00:00:00 2001 From: Ben McCann Date: Mon, 23 Feb 2015 13:55:02 -0800 Subject: Update links to security contexts and service accounts to point to actual docs instead of pull requests now that those proposals have been merged --- secrets.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/secrets.md b/secrets.md index ce02f930..60e825f2 100644 --- a/secrets.md +++ b/secrets.md @@ -72,7 +72,7 @@ service would also consume the secrets associated with the MySQL service. ### Use-Case: Secrets associated with service accounts -[Service Accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) are proposed as a +[Service Accounts](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) are proposed as a mechanism to decouple capabilities and security contexts from individual human users. A `ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and @@ -236,7 +236,7 @@ memory overcommit on the node. #### Secret data on the node: isolation -Every pod will have a [security context](https://github.com/GoogleCloudPlatform/kubernetes/pull/3910). +Every pod will have a [security context](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/security_context.md). Secret data on the node should be isolated according to the security context of the container. The Kubelet volume plugin API will be changed so that a volume plugin receives the security context of a volume along with the volume spec. This will allow volume plugins to implement setting the @@ -248,7 +248,7 @@ Several proposals / upstream patches are notable as background for this proposal 1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) 2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) +3. [Kubernetes service account proposal](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) 4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) 5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) -- cgit v1.2.3 From db1eb9d48c41fe19eca08bb0b0d0e34f37f4f925 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Tue, 24 Feb 2015 22:05:24 -0500 Subject: Fix nits in security proposal --- security.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/security.md b/security.md index ba699739..7bdca440 100644 --- a/security.md +++ b/security.md @@ -38,7 +38,7 @@ Automated process users fall into the following categories: * write pod specs. * making some of their own images, and using some "community" docker images * know which pods need to talk to which other pods - * decide which pods should be share files with other pods, and which should not. + * decide which pods should share files with other pods, and which should not. * reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod. * do not often reason about operating system or organizational security. * are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. @@ -66,11 +66,11 @@ A pod runs in a *security context* under a *service account* that is defined by 1. The API should authenticate and authorize user actions [authn and authz](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/access.md) 2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. 3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) -4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) +4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk - 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated to the container's action -5. When container processes runs on the cluster, they should run in a [security context](https://github.com/GoogleCloudPlatform/kubernetes/pull/3910) that isolates those processes via Linux user security, user namespaces, and permissions. + 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action +5. When container processes run on the cluster, they should run in a [security context](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions @@ -79,7 +79,7 @@ A pod runs in a *security context* under a *service account* that is defined by 6. Developers may need to ensure their images work within higher security requirements specified by administrators 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes -6. Developers should be able to define [secrets](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297) that are automatically added to the containers when pods are run +6. Developers should be able to define [secrets](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/secrets.md) that are automatically added to the containers when pods are run 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: 1. An SSH private key for git cloning remote data 2. A client certificate for accessing a remote system @@ -87,7 +87,7 @@ A pod runs in a *security context* under a *service account* that is defined by 4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master 5. A .dockercfg file for pulling images from a protected registry 2. Developers should be able to define the pod spec so that a secret lands in a specific location - 3. Project administrators should be able to limit developers within a namespace from viewing or modify secrets (anyone who can launch an arbitrary pod can view secrets) + 3. Project administrators should be able to limit developers within a namespace from viewing or modifying secrets (anyone who can launch an arbitrary pod can view secrets) 4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied -- cgit v1.2.3 From 4cc000f3f4bcb0fa92dd35adb5a59a1b650967cc Mon Sep 17 00:00:00 2001 From: markturansky Date: Tue, 3 Mar 2015 15:06:18 -0500 Subject: Persistent storage proposal --- persistent-storage.md | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 persistent-storage.md diff --git a/persistent-storage.md b/persistent-storage.md new file mode 100644 index 00000000..c29319aa --- /dev/null +++ b/persistent-storage.md @@ -0,0 +1,85 @@ +# PersistentVolume + +This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. + +### tl;dr + +Two new API kinds: + +A `PersistentVolume` is created by a cluster admin and is a piece of persistent storage exposed as a volume. It is analogous to a node. + +A `PersistentVolumeClaim` is a user's request for a persistent volume to use in a pod. It is analogous to a pod. + +One new system component: + +`PersistentVolumeManager` watches for new volumes to manage in the system, analogous to the node controller. The volume manager also watches for claims by users and binds them to available volumes. This +component is a singleton that manages all persistent volumes in the cluster. + +Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider. + +### Goals + +* Allow administrators to describe available storage +* Allow pod authors to discover and request persistent volumes to use with pods +* Enforce security through access control lists and securing storage to the same namespace as the pod volume +* Enforce quotas through admission control +* Enforce scheduler rules by resource counting +* Ensure developers can rely on storage being available without being closely bound to a particular disk, server, network, or storage device. + + +#### Describe available storage + +Cluster adminstrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. +All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available, matching volume. + +Many means of dynamic provisioning will be eventually be implemented for various storage types. + +``` + + $ cluster/kubectl.sh get pv + +``` + +##### API Implementation: + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume in system namespace | +| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume in system namespace with {name} | +| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume in system namespace with {name} | +| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume in system namespace with {name} | +| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume in system namespace | +| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume in system namespace | + + + +#### Request Storage + + +Kubernetes users request a persistent volume for their pod by creating a *PersistentVolumeClaim*. Their request for storage is described by their requirements for resource and mount capabilities. + +Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled. + +Users attach their claim to their pod using a new *PersistentVolumeClaimVolumeSource* volume source. + + +##### Users require a full API to manage their claims. + + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/ns/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | +| GET | GET | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | +| UPDATE | PUT | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | +| DELETE | DELETE | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | +| LIST | GET | /api/{version}/ns/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | +| WATCH | GET | /api/{version}/watch/ns/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | + + + +#### Scheduling constraints + +Scheduling constraints are to be handled similar to pod resource constraints. Pods will need to be annotated or decorated with the number of resources it requires on a node. Similarly, a node will need to list how many it has used or available. + +TBD + -- cgit v1.2.3 From c0c7a57db64ed2496ff4f0aa066e69b029fe423f Mon Sep 17 00:00:00 2001 From: markturansky Date: Thu, 5 Mar 2015 11:51:52 -0500 Subject: Added more detail and explained workflow/lifecycle of a persistent volume using examples --- persistent-storage.md | 156 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 138 insertions(+), 18 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index c29319aa..bafdb343 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,4 +1,4 @@ -# PersistentVolume +# Persistent Storage This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. @@ -6,14 +6,17 @@ This document proposes a model for managing persistent, cluster-scoped storage f Two new API kinds: -A `PersistentVolume` is created by a cluster admin and is a piece of persistent storage exposed as a volume. It is analogous to a node. +A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. -A `PersistentVolumeClaim` is a user's request for a persistent volume to use in a pod. It is analogous to a pod. +A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod. One new system component: -`PersistentVolumeManager` watches for new volumes to manage in the system, analogous to the node controller. The volume manager also watches for claims by users and binds them to available volumes. This -component is a singleton that manages all persistent volumes in the cluster. +`PersistentVolumeManager` is a singleton running in master that manages all PVs in the system, analogous to the node controller. The volume manager watches the API for newly created volumes to manage. The manager also watches for claims by users and binds them to available volumes. + +One new volume: + +`PersistentVolumeClaimVolumeSource` references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A `PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another type of volume that is owned by someone else (the system). Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider. @@ -29,18 +32,12 @@ Kubernetes makes no guarantees at runtime that the underlying storage exists or #### Describe available storage -Cluster adminstrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. -All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available, matching volume. +Cluster adminstrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. Many means of dynamic provisioning will be eventually be implemented for various storage types. -``` - - $ cluster/kubectl.sh get pv -``` - -##### API Implementation: +##### PersistentVolume API | Action | HTTP Verb | Path | Description | | ---- | ---- | ---- | ---- | @@ -52,18 +49,16 @@ Many means of dynamic provisioning will be eventually be implemented for various | WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume in system namespace | - #### Request Storage - -Kubernetes users request a persistent volume for their pod by creating a *PersistentVolumeClaim*. Their request for storage is described by their requirements for resource and mount capabilities. +Kubernetes users request persistent storage for their pod by creating a ```PersistentVolumeClaim```. Their request for storage is described by their requirements for resources and mount capabilities. Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled. -Users attach their claim to their pod using a new *PersistentVolumeClaimVolumeSource* volume source. +Users attach their claim to their pod using a new ```PersistentVolumeClaimVolumeSource``` volume source. -##### Users require a full API to manage their claims. +##### PersistentVolumeClaim API | Action | HTTP Verb | Path | Description | @@ -83,3 +78,128 @@ Scheduling constraints are to be handled similar to pod resource constraints. P TBD + +### Example + +#### Admin provisions storage + +An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs. + +``` +POST: + +kind: PersistentVolume +apiVersion: v1beta3 +metadata: + name: pv0001 +spec: + capacity: + storage: 10 + persistentDisk: + pdName: "abc123" + fsType: "ext4" + +-------------------------------------------------- + +cluster/kubectl.sh get pv + +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM +pv0001 map[] 10737418240 RWO Pending + + +``` + +#### Users request storage + +A user requests storage by posting a PVC to the API. Their request contains the AccessModes they wish their volume to have and the minimum size needed. + +The user must be within a namespace to create PVCs. + +``` + +POST: +kind: PersistentVolumeClaim +apiVersion: v1beta3 +metadata: + name: myclaim-1 +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 3 + +-------------------------------------------------- + +cluster/kubectl.sh get pvc + + +NAME LABELS STATUS VOLUME +myclaim-1 map[] pending + +``` + + +#### Matching and binding + + The ```PersistentVolumeManager``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found. + +``` + +cluster/kubectl.sh get pv + +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM +pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e + + +cluster/kubectl.sh get pvc + +NAME LABELS STATUS VOLUME +myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e + + +``` + +#### Claim usage + +The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim and mount its volume for a pod. + +The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods. + +``` +POST: + +kind: Pod +apiVersion: v1beta3 +metadata: + name: mypod +spec: + containers: + - image: dockerfile/nginx + name: myfrontend + volumeMounts: + - mountPath: "/var/www/html" + name: mypd + volumes: + - name: mypd + source: + persistentVolumeClaim: + accessMode: ReadWriteOnce + claimRef: + name: myclaim-1 + +``` + +#### Releasing a claim and Recycling a volume + +When a claim holder is finished with their data, they can delete their claim. + +``` + +cluster/kubectl.sh delete pvc myclaim-1 + +``` + +The ```PersistentVolumeManager``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. + +Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. \ No newline at end of file -- cgit v1.2.3 From b1152d31d161c30d0f48448996018e3cf3e59440 Mon Sep 17 00:00:00 2001 From: Young Date: Sun, 8 Mar 2015 15:38:21 +0000 Subject: simple typo --- persistent-storage.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index bafdb343..d9824c2a 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -32,7 +32,7 @@ Kubernetes makes no guarantees at runtime that the underlying storage exists or #### Describe available storage -Cluster adminstrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. +Cluster administrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. Many means of dynamic provisioning will be eventually be implemented for various storage types. @@ -202,4 +202,4 @@ cluster/kubectl.sh delete pvc myclaim-1 The ```PersistentVolumeManager``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. -Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. \ No newline at end of file +Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. -- cgit v1.2.3 From f09c5510822ab57c0d08d1196fa7c4d24f0a0c37 Mon Sep 17 00:00:00 2001 From: markturansky Date: Mon, 9 Mar 2015 12:21:54 -0400 Subject: Edited to reflect that PVs have no namespace --- persistent-storage.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index bafdb343..a4c1c9ce 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -34,6 +34,8 @@ Kubernetes makes no guarantees at runtime that the underlying storage exists or Cluster adminstrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. +PVs are system objects and, thus, have no namespace. + Many means of dynamic provisioning will be eventually be implemented for various storage types. @@ -41,12 +43,12 @@ Many means of dynamic provisioning will be eventually be implemented for various | Action | HTTP Verb | Path | Description | | ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume in system namespace | -| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume in system namespace with {name} | -| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume in system namespace with {name} | -| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume in system namespace with {name} | -| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume in system namespace | -| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume in system namespace | +| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume | +| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} | +| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} | +| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} | +| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume | +| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume | #### Request Storage -- cgit v1.2.3 From 9249c265badd7e73f415fa8a539c4f6b9ed07b5a Mon Sep 17 00:00:00 2001 From: markturansky Date: Tue, 10 Mar 2015 10:18:24 -0400 Subject: Added verbiage about events --- persistent-storage.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/persistent-storage.md b/persistent-storage.md index a4c1c9ce..586f75bf 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -81,6 +81,13 @@ Scheduling constraints are to be handled similar to pod resource constraints. P TBD +#### Events + +The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary. + +Events that communicate the state of a mounted volume are left to the volume plugins. + + ### Example #### Admin provisions storage -- cgit v1.2.3 From aa00e8e7f157bbf96f06ea4c92f1e69b866eca34 Mon Sep 17 00:00:00 2001 From: Salvatore Dario Minonne Date: Mon, 9 Mar 2015 18:44:31 +0100 Subject: updating labels.md and design/labels.md --- labels.md | 76 --------------------------------------------------------------- 1 file changed, 76 deletions(-) delete mode 100644 labels.md diff --git a/labels.md b/labels.md deleted file mode 100644 index bc151f7c..00000000 --- a/labels.md +++ /dev/null @@ -1,76 +0,0 @@ -# Labels - -_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](/docs/api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes. - -Each object can have a set of key/value labels set on it, with at most one label with a particular key. -``` -"labels": { - "key1" : "value1", - "key2" : "value2" -} -``` - -Unlike [names and UIDs](/docs/identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). - -Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes. - -Label selectors may also be used to associate policies with sets of objects. - -We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](/docs/container-environment.md). - -Valid label keys are comprised of two segments - prefix and name - separated -by a slash (`/`). The name segment is required and must be a DNS label: 63 -characters or less, all lowercase, beginning and ending with an alphanumeric -character (`[a-z0-9]`), with dashes (`-`) and alphanumerics between. The -prefix and slash are optional. If specified, the prefix must be a DNS -subdomain (a series of DNS labels separated by dots (`.`), not longer than 253 -characters in total. - -If the prefix is omitted, the label key is presumed to be private to the user. -System components which use labels must specify a prefix. The `kubernetes.io` -prefix is reserved for kubernetes core components. - -## Motivation - -Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users. Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings. - -## Label selectors - -Label selectors permit very simple filtering by label keys and values. The simplicity of label selectors is deliberate. It is intended to facilitate transparency for humans, easy set overlap detection, efficient indexing, and reverse-indexing (i.e., finding all label selectors matching an object's labels - https://github.com/GoogleCloudPlatform/kubernetes/issues/1348). - -Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well. - -We are in the process of extending the label selection specification (see [selector.go](/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms: -``` -key1 in (value11, value12, ...) -key1 not in (value11, value12, ...) -key1 exists -``` - -LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future. - -Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s: -- `service`: A [service](/docs/services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods. -- `replicationController`: A [replication controller](/docs/replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more. - -The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector. - -For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common. - -In the future, label selectors will be used to identify other types of distributed service workers, such as worker pool members or peers in a distributed application. - -Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions. - -Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might target all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc. - -Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on. - -Pods (and other objects) may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions. - -Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc. - -Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to (i.e., they are reversible). OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to. - -## Labels vs. annotations - -We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](/docs/annotations.md). -- cgit v1.2.3 From 59748dcb90e9ce05eca6608fbe8c3c32898edd87 Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Mon, 16 Mar 2015 13:20:03 +0100 Subject: Remove BoundPod structure --- security_context.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/security_context.md b/security_context.md index 7dc10e69..cd10202e 100644 --- a/security_context.md +++ b/security_context.md @@ -83,19 +83,19 @@ The Kubelet will have an interface that points to a `SecurityContextProvider`. T ```go type SecurityContextProvider interface { - // ModifyContainerConfig is called before the Docker createContainer call. - // The security context provider can make changes to the Config with which - // the container is created. - // An error is returned if it's not possible to secure the container as - // requested with a security context. - ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error + // ModifyContainerConfig is called before the Docker createContainer call. + // The security context provider can make changes to the Config with which + // the container is created. + // An error is returned if it's not possible to secure the container as + // requested with a security context. + ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) error // ModifyHostConfig is called before the Docker runContainer call. // The security context provider can make changes to the HostConfig, affecting // security options, whether the container is privileged, volume binds, etc. // An error is returned if it's not possible to secure the container as requested - // with a security context. - ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig) + // with a security context. + ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig) } ``` -- cgit v1.2.3 From 6a22c4b38d1a80440fa462f1d11bd5f24be087e4 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Mon, 9 Mar 2015 14:34:12 -0400 Subject: Update namespaces design --- namespaces.md | 386 +++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 264 insertions(+), 122 deletions(-) diff --git a/namespaces.md b/namespaces.md index 761daa1a..0e89bf56 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,193 +1,335 @@ -# Kubernetes Proposal - Namespaces +# Namespaces -**Related PR:** +## Abstract -| Topic | Link | -| ---- | ---- | -| Identifiers.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/1216 | -| Access.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/891 | -| Indexing | https://github.com/GoogleCloudPlatform/kubernetes/pull/1183 | -| Cluster Subdivision | https://github.com/GoogleCloudPlatform/kubernetes/issues/442 | +A Namespace is a mechanism to partition resources created by users into +a logically named group. -## Background +## Motivation -High level goals: +A single cluster should be able to satisfy the needs of multiple user communities. -* Enable an easy-to-use mechanism to logically scope Kubernetes resources -* Ensure extension resources to Kubernetes can share the same logical scope as core Kubernetes resources -* Ensure it aligns with access control proposal -* Ensure system has log n scale with increasing numbers of scopes +Each user community wants to be able to work in isolation from other communities. + +Each user community has its own: + +1. resources (pods, services, replication controllers, etc.) +2. policies (who can or cannot perform actions in their community) +3. constraints (this community is allowed this much quota, etc.) + +A cluster operator may create a Namespace for each unique user community. + +The Namespace provides a unique scope for: + +1. named resources (to avoid basic naming collisions) +2. delegated management authority to trusted users +3. ability to limit community resource consumption ## Use cases -Actors: +1. As a cluster operator, I want to support multiple user communities on a single cluster. +2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users + in those communities. +3. As a cluster operator, I want to limit the amount of resources each community can consume in order + to limit the impact to other communities using the cluster. +4. As a cluster user, I want to interact with resources that are pertinent to my user community in + isolation of what other user communities are doing on the cluster. + +## Design + +### Data Model + +A *Namespace* defines a logically named group for multiple *Kind*s of resources. + +``` +type Namespace struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + Spec NamespaceSpec `json:"spec,omitempty"` + Status NamespaceStatus `json:"status,omitempty"` +} +``` + +A *Namespace* name is a DNS compatible subdomain. + +A *Namespace* must exist prior to associating content with it. + +A *Namespace* must not be deleted if there is content associated with it. -1. k8s admin - administers a kubernetes cluster -2. k8s service - k8s daemon operates on behalf of another user (i.e. controller-manager) -2. k8s policy manager - enforces policies imposed on k8s cluster -3. k8s user - uses a kubernetes cluster to schedule pods +To associate a resource with a *Namespace* the following conditions must be satisfied: -User stories: +1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with the server +2. The resource's *TypeMeta.Namespace* field must have a value that references an existing *Namespace* -1. Ability to set immutable namespace to k8s resources -2. Ability to list k8s resource scoped to a namespace -3. Restrict a namespace identifier to a DNS-compatible string to support compound naming conventions -4. Ability for a k8s policy manager to enforce a k8s user's access to a set of namespaces -5. Ability to set/unset a default namespace for use by kubecfg client -6. Ability for a k8s service to monitor resource changes across namespaces -7. Ability for a k8s service to list resources across namespaces +The *Name* of a resource associated with a *Namespace* is unique to that *Kind* in that *Namespace*. -## Proposed Design +It is intended to be used in resource URLs; provided by clients at creation time, and encouraged to be +human friendly; intended to facilitate idempotent creation, space-uniqueness of singleton objects, +distinguish distinct entities, and reference particular entities across operations. -### Model Changes +### Authorization -Introduce a new attribute *Namespace* for each resource that must be scoped in a Kubernetes cluster. +A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*. -A *Namespace* is a DNS compatible subdomain. +See [Authorization plugins](../authorization.md) + +### Limit Resource Consumption + +A *Namespace* provides a scope to limit resource consumption. + +A *LimitRange* defines min/max constraints on the amount of resources a single entity can consume in +a *Namespace*. + +See [Admission control: Limit Range](admission_control_limit_range.md) + +A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and allows cluster operators +to define *Hard* resource usage limits that a *Namespace* may consume. + +See [Admission control: Resource Quota](admission_control_resource_quota.md) + +### Finalizers + +Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects. ``` -// TypeMeta is shared by all objects sent to, or returned from the client -type TypeMeta struct { - Kind string `json:"kind,omitempty"` - Uid string `json:"uid,omitempty"` - CreationTimestamp util.Time `json:"creationTimestamp,omitempty"` - SelfLink string `json:"selfLink,omitempty"` - ResourceVersion uint64 `json:"resourceVersion,omitempty"` - APIVersion string `json:"apiVersion,omitempty"` - Namespace string `json:"namespace,omitempty"` - Name string `json:"name,omitempty"` +type FinalizerName string + +// These are internal finalizers to Kubernetes, must be qualified name unless defined here +const ( + FinalizerKubernetes FinalizerName = "kubernetes" +) + +// NamespaceSpec describes the attributes on a Namespace +type NamespaceSpec struct { + // Finalizers is an opaque list of values that must be empty to permanently remove object from storage + Finalizers []FinalizerName } ``` -An identifier, *UID*, is unique across time and space intended to distinguish between historical occurences of similar entities. +A *FinalizerName* is a qualified name. -A *Name* is unique within a given *Namespace* at a particular time, used in resource URLs; provided by clients at creation time -and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish -distinct entities, and reference particular entities across operations. +The API Server enforces that a *Namespace* can only be deleted from storage if and only if +it's *Namespace.Spec.Finalizers* is empty. -As of this writing, the following resources MUST have a *Namespace* and *Name* +A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation. -* pod -* service -* replicationController -* endpoint +Each *Namespace* created has *kubernetes* as an item in its list of initial *Namespace.Spec.Finalizers* +set by default. -A *policy* MAY be associated with a *Namespace*. +### Phases -If a *policy* has an associated *Namespace*, the resource paths it enforces are scoped to a particular *Namespace*. +A *Namespace* may exist in the following phases. -## k8s API server +``` +type NamespacePhase string +const( + NamespaceActive NamespacePhase = "Active" + NamespaceTerminating NamespaceTerminating = "Terminating" +) + +type NamespaceStatus struct { + ... + Phase NamespacePhase +} +``` -In support of namespace isolation, the Kubernetes API server will address resources by the following conventions: +A *Namespace* is in the **Active** phase if it does not have a *ObjectMeta.DeletionTimestamp*. -The typical actors for the following requests are the k8s user or the k8s service. +A *Namespace* is in the **Terminating** phase if it has a *ObjectMeta.DeletionTimestamp*. -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/ns/{ns}/{resourceType}/ | Create instance of {resourceType} in namespace {ns} | -| GET | GET | /api/{version}/ns/{ns}/{resourceType}/{name} | Get instance of {resourceType} in namespace {ns} with {name} | -| UPDATE | PUT | /api/{version}/ns/{ns}/{resourceType}/{name} | Update instance of {resourceType} in namespace {ns} with {name} | -| DELETE | DELETE | /api/{version}/ns/{ns}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {ns} with {name} | -| LIST | GET | /api/{version}/ns/{ns}/{resourceType} | List instances of {resourceType} in namespace {ns} | -| WATCH | GET | /api/{version}/watch/ns/{ns}/{resourceType} | Watch for changes to a {resourceType} in namespace {ns} | +**Active** -The typical actor for the following requests are the k8s service or k8s admin as enforced by k8s Policy. +Upon creation, a *Namespace* goes in the *Active* phase. This means that content may be associated with +a namespace, and all normal interactions with the namespace are allowed to occur in the cluster. -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | -| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | +If a DELETE request occurs for a *Namespace*, the *Namespace.ObjectMeta.DeletionTimestamp* is set +to the current server time. A *namespace controller* observes the change, and sets the *Namespace.Status.Phase* +to *Terminating*. -The legacy API patterns for k8s are an alias to interacting with the *default* namespace as follows. +**Terminating** -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/{resourceType}/ | Create instance of {resourceType} in namespace *default* | -| GET | GET | /api/{version}/{resourceType}/{name} | Get instance of {resourceType} in namespace *default* | -| UPDATE | PUT | /api/{version}/{resourceType}/{name} | Update instance of {resourceType} in namespace *default* | -| DELETE | DELETE | /api/{version}/{resourceType}/{name} | Delete instance of {resourceType} in namespace *default* | +A *namespace controller* watches for *Namespace* objects that have a *Namespace.ObjectMeta.DeletionTimestamp* +value set in order to know when to initiate graceful termination of the *Namespace* associated content that +are known to the cluster. -The k8s API server verifies the *Namespace* on resource creation matches the *{ns}* on the path. +The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one. -The k8s API server will enable efficient mechanisms to filter model resources based on the *Namespace*. This may require -the creation of an index on *Namespace* that could support query by namespace with optional label selectors. +Admission control blocks creation of new resources in that namespace in order to prevent a race-condition +where the controller could believe all of a given resource type had been deleted from the namespace, +when in fact some other rogue client agent had created new objects. Using admission control in this +scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle. -The k8s API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context -of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request, -then the k8s API server will reject the request. +Once all objects known to the *namespace controller* have been deleted, the *namespace controller* +executes a *finalize* operation on the namespace that removes the *kubernetes* value from +the *Namespace.Spec.Finalizers* list. -TODO: Update to discuss k8s api server proxy patterns +If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and +whose *Namespace.Spec.Finalizers* list is empty, it will signal the server to permanently remove +the *Namespace* from storage by sending a final DELETE action to the API server. -## k8s storage +### REST API -A namespace provides a unique identifier space and therefore must be in the storage path of a resource. +To interact with the Namespace API: -In etcd, we want to continue to still support efficient WATCH across namespaces. +| Action | HTTP Verb | Path | Description | +| ------ | --------- | ---- | ----------- | +| CREATE | POST | /api/{version}/namespaces | Create a namespace | +| LIST | GET | /api/{version}/namespaces | List all namespaces | +| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} | +| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} | +| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} | +| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces | -Resources that persist content in etcd will have storage paths as follows: +This specification reserves the name *finalize* as a sub-resource to namespace. -/registry/{resourceType}/{resource.Namespace}/{resource.Name} +As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*. -This enables k8s service to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}. +To interact with content associated with a Namespace: -Upon scheduling a pod to a particular host, the pod's namespace must be in the key path as follows: +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} | +| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} | +| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} | +| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} | +| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} | +| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} | +| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | +| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | -/host/{host}/pod/{pod.Namespace}/{pod.Name} +The API server verifies the *Namespace* on resource creation matches the *{namespace}* on the path. -## k8s Authorization service +The API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context +of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request, +then the API server will reject the request. -This design assumes the existence of an authorization service that filters incoming requests to the k8s API Server in order -to enforce user authorization to a particular k8s resource. It performs this action by associating the *subject* of a request -with a *policy* to an associated HTTP path and verb. This design encodes the *namespace* in the resource path in order to enable -external policy servers to function by resource path alone. If a request is made by an identity that is not allowed by -policy to the resource, the request is terminated. Otherwise, it is forwarded to the apiserver. +### Storage -## k8s controller-manager +A namespace provides a unique identifier space and therefore must be in the storage path of a resource. -The controller-manager will provision pods in the same namespace as the associated replicationController. +In etcd, we want to continue to still support efficient WATCH across namespaces. -## k8s Kubelet +Resources that persist content in etcd will have storage paths as follows: -There is no major change to the kubelet introduced by this proposal. +/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} -### kubecfg client +This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}. -kubecfg supports following: +### Kubelet -``` -kubecfg [OPTIONS] ns {namespace} -``` +The kubelet will register pod's it sources from a file or http source with a namespace associated with the +*cluster-id* -To set a namespace to use across multiple operations: +### Example: OpenShift Origin managing a Kubernetes Namespace -``` -$ kubecfg ns ns1 -``` +In this example, we demonstrate how the design allows for agents built on-top of +Kubernetes that manage their own set of resource types associated with a *Namespace* +to take part in Namespace termination. -To view the current namespace: +OpenShift creates a Namespace in Kubernetes ``` -$ kubecfg ns -Using namespace ns1 +{ + "apiVersion":"v1beta3", + "kind": "Namespace", + "metadata": { + "name": "development", + }, + "spec": { + "finalizers": ["openshift.com/origin", "kubernetes"], + }, + "status": { + "phase": "Active", + }, + "labels": { + "name": "development" + }, +} ``` -To reset to the default namespace: +OpenShift then goes and creates a set of resources (pods, services, etc) associated +with the "development" namespace. It also creates its own set of resources in its +own storage associated with the "development" namespace unknown to Kubernetes. + +User deletes the Namespace in Kubernetes, and Namespace now has following state: ``` -$ kubecfg ns default +{ + "apiVersion":"v1beta3", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "..." + }, + "spec": { + "finalizers": ["openshift.com/origin", "kubernetes"], + }, + "status": { + "phase": "Terminating", + }, + "labels": { + "name": "development" + }, +} ``` -In addition, each kubecfg request may explicitly specify a namespace for the operation via the following OPTION +The Kubernetes *namespace controller* observes the namespace has a *deletionTimestamp* +and begins to terminate all of the content in the namespace that it knows about. Upon +success, it executes a *finalize* action that modifies the *Namespace* by +removing *kubernetes* from the list of finalizers: ---ns +``` +{ + "apiVersion":"v1beta3", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "..." + }, + "spec": { + "finalizers": ["openshift.com/origin"], + }, + "status": { + "phase": "Terminating", + }, + "labels": { + "name": "development" + }, +} +``` -When loading resource files specified by the -c OPTION, the kubecfg client will ensure the namespace is set in the -message body to match the client specified default. +OpenShift Origin has its own *namespace controller* that is observing cluster state, and +it observes the same namespace had a *deletionTimestamp* assigned to it. It too will go +and purge resources from its own storage that it manages associated with that namespace. +Upon completion, it executes a *finalize* action and removes the reference to "openshift.com/origin" +from the list of finalizers. -If no default namespace is applied, the client will assume the following default namespace: +This results in the following state: -* default +``` +{ + "apiVersion":"v1beta3", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "..." + }, + "spec": { + "finalizers": [], + }, + "status": { + "phase": "Terminating", + }, + "labels": { + "name": "development" + }, +} +``` -The kubecfg client would store default namespace information in the same manner it caches authentication information today -as a file on user's file system. +At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace +has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all +content associated from that namespace has been purged. It performs a final DELETE action +to remove that Namespace from the storage. +At this point, all content associated with that Namespace, and the Namespace itself are gone. \ No newline at end of file -- cgit v1.2.3 From 1569ae19e6a36a99ef7a70a1b9f1d937deecfee7 Mon Sep 17 00:00:00 2001 From: Maciej Szulik Date: Tue, 24 Mar 2015 12:01:41 +0100 Subject: Fixed markdown --- service_accounts.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/service_accounts.md b/service_accounts.md index 5d86f244..a3a1bb49 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,6 +1,6 @@ #Service Accounts -## Motivation +## Motivation Processes in Pods may need to call the Kubernetes API. For example: - scheduler @@ -20,7 +20,7 @@ They also may interact with services other than the Kubernetes API, such as: ## Design Overview A service account binds together several things: - a *name*, understood by users, and perhaps by peripheral systems, for an identity - - a *principal* that can be authenticated and (authorized)[../authorization.md] + - a *principal* that can be authenticated and [authorized](../authorization.md) - a [security context](./security_contexts.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other capabilities and controls on interaction with the file system and OS. - a set of [secrets](./secrets.md), which a container may use to @@ -60,7 +60,7 @@ This includes a human running `kubectl` on her desktop and a container in a Pod There is already a notion of a username in kubernetes, which is populated into a request context after authentication. However, there is no API object representing a user. While this may evolve, it is expected that in mature installations, -the canonical storage of user identifiers will be handled by a system external to kubernetes. +the canonical storage of user identifiers will be handled by a system external to kubernetes. Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity ( @@ -84,7 +84,7 @@ The distinction is useful for a number of reasons: - A Human typically keeps credentials on a machine that is not part of the cluster and so not subject to automatic management. A VM with a role/service-account can have its credentials automatically managed. - the identity of a Pod cannot in general be mapped to a single human. - - If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human. + - If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human. **TODO**: consider getting rid of separate serviceAccount object and just rolling its parts into the SecurityContext or Pod Object. @@ -106,7 +106,7 @@ might have some types that do not do anything on apiserver but just get pushed t ### Pods The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the -Service Account Finalizer (see below). +Service Account Finalizer (see below). TBD: how policy limits which users can make pods with which service accounts. @@ -122,7 +122,7 @@ Service Account Finalizer is one place where this can happen (see below). ### Kubelet The kubelet will treat as "not ready to run" (needing a finalizer to act on it) any Pod which has an empty -SecurityContext. +SecurityContext. The kubelet will set a default, restrictive, security context for any pods created from non-Apiserver config sources (http, file). @@ -141,7 +141,7 @@ like this: **TODO**: example of pod with explicit refs. Another way is with the *Service Account Finalizer*, a plugin process which is optional, and which handles -business logic around service accounts. +business logic around service accounts. The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount definitions. -- cgit v1.2.3 From 337cdac032efaa215feb2d37215d34b1b4bc77ca Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Tue, 24 Mar 2015 13:00:26 +0100 Subject: Change "/ns" to "/namespaces" in few remaining places. --- persistent-storage.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index 5b84ddd2..5907e11d 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -65,12 +65,12 @@ Users attach their claim to their pod using a new ```PersistentVolumeClaimVolume | Action | HTTP Verb | Path | Description | | ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/ns/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | -| GET | GET | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | -| UPDATE | PUT | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | -| DELETE | DELETE | /api/{version}/ns/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | -| LIST | GET | /api/{version}/ns/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | -| WATCH | GET | /api/{version}/watch/ns/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | +| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | +| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | +| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | +| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | +| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | +| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | -- cgit v1.2.3 From 6e8f790f1c1def2f1cbce19ae29027008cf38b91 Mon Sep 17 00:00:00 2001 From: Mark Maglana Date: Wed, 25 Mar 2015 14:54:16 -0700 Subject: Fix confusing use of "comprise" The word "comprise" means "be composed of" or "contain" so "applications comprised of multiple containers" would mean "applications composed of of multiple containers" or "applications contained of multiple containers" which is confusing. I understand that this is nitpicking and that "comprise" has a new meaning which is the opposite of its original definition just like how "literally" now means "figuratively" to some people. However, I believe that clarity is of utmost importance in technical documentation which is why I'm proposing this change. --- access.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/access.md b/access.md index 8a2f1edd..9de4d6c8 100644 --- a/access.md +++ b/access.md @@ -15,7 +15,7 @@ Each of these can act as normal users or attackers. - External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access. - K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods) - K8s Project Admins: People who manage access for some K8s Users - - K8s Cluster Admins: People who control the machines, networks, or binaries that comprise a K8s cluster. + - K8s Cluster Admins: People who control the machines, networks, or binaries that make up a K8s cluster. - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. ### Threats -- cgit v1.2.3 From 2108ead7d3ed6cdcea54aa73e6fb913010d7849c Mon Sep 17 00:00:00 2001 From: Tamer Tas Date: Wed, 1 Apr 2015 00:56:20 +0300 Subject: Fix typo in Secrets --- secrets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/secrets.md b/secrets.md index d47d6092..3c61de68 100644 --- a/secrets.md +++ b/secrets.md @@ -277,7 +277,7 @@ type Secret struct { // representing the arbitrary (possibly non-string) data value here. Data map[string][]byte `json:"data,omitempty"` - // Used to facilitate programatic handling of secret data. + // Used to facilitate programmatic handling of secret data. Type SecretType `json:"type,omitempty"` } -- cgit v1.2.3 From f08e73cb56a68974b3be06c97bce7ec8dab1c786 Mon Sep 17 00:00:00 2001 From: Tamer Tas Date: Wed, 1 Apr 2015 01:18:49 +0300 Subject: Fix typo in Secrets design document --- secrets.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/secrets.md b/secrets.md index 3c61de68..965f6e90 100644 --- a/secrets.md +++ b/secrets.md @@ -21,7 +21,7 @@ Goals of this design: ## Constraints and Assumptions * This design does not prescribe a method for storing secrets; storage of secrets should be - pluggable to accomodate different use-cases + pluggable to accommodate different use-cases * Encryption of secret data and node security are orthogonal concerns * It is assumed that node and master are secure and that compromising their security could also compromise secrets: @@ -375,7 +375,7 @@ a tmpfs file system of that size to store secret data. Rough accounting of spec For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed -to accomodate this. Subsequent proposals can address this on a type-by-type basis. +to accommodate this. Subsequent proposals can address this on a type-by-type basis. ## Examples -- cgit v1.2.3 From 149d7ab358aa8c6f190e865daf0cdb846de8b2d0 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Wed, 1 Apr 2015 16:40:27 -0400 Subject: Update design doc for limit range change --- admission_control_limit_range.md | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index e3a56c87..3f2ccd7b 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -25,6 +25,8 @@ type LimitRangeItem struct { Max ResourceList `json:"max,omitempty"` // Min usage constraints on this kind by resource name Min ResourceList `json:"min,omitempty"` + // Default usage constraints on this kind by resource name + Default ResourceList `json:"default,omitempty"` } // LimitRangeSpec defines a min/max usage limit for resources that match on kind @@ -74,6 +76,14 @@ The following min/max limits are imposed: | cpu | Min/Max amount of cpu per pod | | memory | Min/Max amount of memory per pod | +If a resource specifies a default value, it may get applied on the incoming resource. For example, if a default +value is provided for container cpu, it is set on the incoming container if and only if the incoming container +does not specify a resource requirements limit field. + +If a resource specifies a min value, it may get applied on the incoming resource. For example, if a min +value is provided for container cpu, it is set on the incoming container if and only if the incoming container does +not specify a resource requirements requests field. + If the incoming object would cause a violation of the enumerated constraints, the request is denied with a set of messages explaining what constraints were the source of the denial. @@ -105,12 +115,12 @@ NAME limits $ kubectl describe limits limits Name: limits -Type Resource Min Max ----- -------- --- --- -Pod memory 1Mi 1Gi -Pod cpu 250m 2 -Container memory 1Mi 1Gi -Container cpu 250m 2 +Type Resource Min Max Default +---- -------- --- --- --- +Pod memory 1Mi 1Gi - +Pod cpu 250m 2 - +Container memory 1Mi 1Gi 1Mi +Container cpu 250m 250m 250m ``` ## Future Enhancements: Define limits for a particular pod or container. -- cgit v1.2.3 From 58542e4f17c95567849639312ac58e822f251853 Mon Sep 17 00:00:00 2001 From: Kris Rousey Date: Wed, 1 Apr 2015 14:49:33 -0700 Subject: Changing the case of API to be consistent with surrounding uses. --- identifiers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/identifiers.md b/identifiers.md index 260c237a..d2e5d5c7 100644 --- a/identifiers.md +++ b/identifiers.md @@ -39,7 +39,7 @@ Name 1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency. * Examples: "guestbook.user", "backend-x4eb1" -2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random). +2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random). * Example: "api.k8s.example.com" 3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time. -- cgit v1.2.3 From 95d56057379a06eb74fe16a4992729eaa844d38f Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 16 Apr 2015 09:11:47 -0700 Subject: Stop using dockerfile/* images As per http://blog.docker.com/2015/03/updates-available-to-popular-repos-update-your-images/ docker has stopped answering dockerfile/redis and dockerfile/nginx. Fix all users in our tree. Sadly this means a lot of published examples are now broken. --- persistent-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/persistent-storage.md b/persistent-storage.md index 5907e11d..45ab8d42 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -184,7 +184,7 @@ metadata: name: mypod spec: containers: - - image: dockerfile/nginx + - image: nginx name: myfrontend volumeMounts: - mountPath: "/var/www/html" -- cgit v1.2.3 From cabc8404af6b7f84cefb286d161d3ea6204eb7b0 Mon Sep 17 00:00:00 2001 From: Brian Grant Date: Thu, 16 Apr 2015 21:41:07 +0000 Subject: Update docs. Add design principles. Fixes #6133. Fixes #4182. # *** ERROR: *** docs are out of sync between cli and markdown # run hack/run-gendocs.sh > docs/kubectl.md to regenerate # # Your commit will be aborted unless you regenerate docs. COMMIT_BLOCKED_ON_GENDOCS --- README.md | 17 +++++++++++++++++ architecture.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ principles.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 116 insertions(+) create mode 100644 README.md create mode 100644 architecture.md create mode 100644 principles.md diff --git a/README.md b/README.md new file mode 100644 index 00000000..cda831a4 --- /dev/null +++ b/README.md @@ -0,0 +1,17 @@ +# Kubernetes Design Overview + +Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. + +Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration. + +Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways. + +Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary. + +Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts. + +A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the availability doc](../availability.md) and [cluster federation proposal](../proposals/federation.md) for more details). + +Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner. + +For more about the Kubernetes architecture, see [architecture](architecture.md). diff --git a/architecture.md b/architecture.md new file mode 100644 index 00000000..06a0a0ef --- /dev/null +++ b/architecture.md @@ -0,0 +1,44 @@ +# Kubernetes architecture + +A running Kubernetes cluster contains node agents (kubelet) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making kubelet itself (all our components, really) run within containers, and making the scheduler 100% pluggable. + +![Architecture Diagram](../architecture.png?raw=true "Architecture overview") + +## The Kubernetes Node + +When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that compose the cluster-level control plane. + +The Kubernetes node has the services necessary to run application containers and be managed from the master systems. + +Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers. + +### Kubelet +The **Kubelet** manages [pods](../pods.md) and their containers, their images, their volumes, etc. + +### Kube-Proxy + +Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../docs/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. + +Service endpoints are currently found via [DNS](../dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. + +## The Kubernetes Control Plane + +The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster. + +### etcd + +All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes. + +### Kubernetes API Server + +The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a CRUD-y server, with most/all business logic implemented in separate components or in plug-ins. It mainly processes REST operations, validates them, and updates the corresponding objects in `etcd` (and eventually other stores). + +### Scheduler + +The scheduler binds unscheduled pods to nodes via the `/binding` API. The scheduler is pluggable, and we expect to support multiple cluster schedulers and even user-provided schedulers in the future. + +### Kubernetes Controller Manager Server + +All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable. + +The [`replicationController`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. diff --git a/principles.md b/principles.md new file mode 100644 index 00000000..499b540b --- /dev/null +++ b/principles.md @@ -0,0 +1,55 @@ +# Design Principles + +Principles to follow when extending Kubernetes. + +## API + +See also the [API conventions](../api-conventions.md). + +* All APIs should be declarative. +* API objects should be complementary and composable, not opaque wrappers. +* The control plane should be transparent -- there are no hidden internal APIs. +* The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior. +* Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation. +* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation. +* Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs). + +## Control logic + +* Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization. +* Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them. +* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation. +* Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions. +* Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans. +* Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure. + +## Architecture + +* Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.). +* Compromising a single node shouldn't compromise the cluster. +* Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage). +* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients. +* Watch is preferred over polling. + +## Extensibility + +TODO: pluggability + +## Bootstrapping + +* [Self-hosting](https://github.com/GoogleCloudPlatform/kubernetes/issues/246) of all components is a goal. +* Minimize the number of dependencies, particularly those required for steady-state operation. +* Stratify the dependencies that remain via principled layering. +* Break any circular dependencies by converting hard dependencies to soft dependencies. + * Also accept that data from other components from another source, such as local files, which can then be manually populated at bootstrap time and then continuously updated once those other components are available. + * State should be rediscoverable and/or reconstructable. + * Make it easy to run temporary, bootstrap instances of all components in order to create the runtime state needed to run the components in the steady state; use a lock (master election for distributed components, file lock for local components like Kubelet) to coordinate handoff. We call this technique "pivoting". + * Have a solution to restart dead components. For distributed components, replication works well. For local components such as Kubelet, a process manager or even a simple shell loop works. + +## Availability + +TODO + +## General principles + +* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) -- cgit v1.2.3 From 8e7529849a47e5dc5f7a7b92f501a70474150e89 Mon Sep 17 00:00:00 2001 From: Robert Bailey Date: Mon, 20 Apr 2015 21:33:03 -0700 Subject: Remove old design file (that has been fully implemented). --- isolation_between_nodes_and_master.md | 48 ----------------------------------- 1 file changed, 48 deletions(-) delete mode 100644 isolation_between_nodes_and_master.md diff --git a/isolation_between_nodes_and_master.md b/isolation_between_nodes_and_master.md deleted file mode 100644 index a91927d8..00000000 --- a/isolation_between_nodes_and_master.md +++ /dev/null @@ -1,48 +0,0 @@ -# Design: Limit direct access to etcd from within Kubernetes - -All nodes have effective access of "root" on the entire Kubernetes cluster today because they have access to etcd, the central data store. The kubelet, the service proxy, and the nodes themselves have a connection to etcd that can be used to read or write any data in the system. In a cluster with many hosts, any container or user that gains the ability to write to the network device that can reach etcd, on any host, also gains that access. - -* The Kubelet and Kube Proxy currently rely on an efficient "wait for changes over HTTP" interface get their current state and avoid missing changes - * This interface is implemented by etcd as the "watch" operation on a given key containing useful data - - -## Options: - -1. Do nothing -2. Introduce an HTTP proxy that limits the ability of nodes to access etcd - 1. Prevent writes of data from the kubelet - 2. Prevent reading data not associated with the client responsibilities - 3. Introduce a security token granting access -3. Introduce an API on the apiserver that returns the data a node Kubelet and Kube Proxy needs - 1. Remove the ability of nodes to access etcd via network configuration - 2. Provide an alternate implementation for the event writing code Kubelet - 3. Implement efficient "watch for changes over HTTP" to offer comparable function with etcd - 4. Ensure that the apiserver can scale at or above the capacity of the etcd system. - 5. Implement authorization scoping for the nodes that limits the data they can view -4. Implement granular access control in etcd - 1. Authenticate HTTP clients with client certificates, tokens, or BASIC auth and authorize them for read only access - 2. Allow read access of certain subpaths based on what the requestor's tokens are - - -## Evaluation: - -Option 1 would be considered unacceptable for deployment in a multi-tenant or security conscious environment. It would be acceptable in a low security deployment where all software is trusted. It would be acceptable in proof of concept environments on a single machine. - -Option 2 would require implementing an http proxy that for 2-1 could block POST/PUT/DELETE requests (and potentially HTTP method tunneling parameters accepted by etcd). 2-2 would be more complicated and would require filtering operations based on deep understanding of the etcd API *and* the underlying schema. It would be possible, but involve extra software. - -Option 3 would involve extending the existing apiserver to return pods associated with a given node over an HTTP "watch for changes" mechanism, which is already implemented. Proper security would involve checking that the caller is authorized to access that data - one imagines a per node token, key, or SSL certificate that could be used to authenticate and then authorize access to only the data belonging to that node. The current event publishing mechanism from the kubelet would also need to be replaced with a secure API endpoint or a change to a polling model. The apiserver would also need to be able to function in a horizontally scalable mode by changing or fixing the "operations" queue to work in a stateless, scalable model. In practice, the amount of traffic even a large Kubernetes deployment would drive towards an apiserver would be tens of requests per second (500 hosts, 1 request per host every minute) which is negligible if well implemented. Implementing this would also decouple the data store schema from the nodes, allowing a different data store technology to be added in the future without affecting existing nodes. This would also expose that data to other consumers for their own purposes (monitoring, implementing service discovery). - -Option 4 would involve extending etcd to [support access control](https://github.com/coreos/etcd/issues/91). Administrators would need to authorize nodes to connect to etcd, and expose network routability directly to etcd. The mechanism for handling this authentication and authorization would be different than the authorization used by Kubernetes controllers and API clients. It would not be possible to completely replace etcd as a data store without also implementing a new Kubelet config endpoint. - - -## Preferred solution: - -Implement the first parts of option 3 - an efficient watch API for the pod, service, and endpoints data for the Kubelet and Kube Proxy. Authorization and authentication are planned in the future - when a solution is available, implement a custom authorization scope that allows API access to be restricted to only the data about a single node or the service endpoint data. - -In general, option 4 is desirable in addition to option 3 as a mechanism to further secure the store to infrastructure components that must access it. - - -## Caveats - -In all four options, compromise of a host will allow an attacker to imitate that host. For attack vectors that are reproducible from inside containers (privilege escalation), an attacker can distribute himself to other hosts by requesting new containers be spun up. In scenario 1, the cluster is totally compromised immediately. In 2-1, the attacker can view all information about the cluster including keys or authorization data defined with pods. In 2-2 and 3, the attacker must still distribute himself in order to get access to a large subset of information, and cannot see other data that is potentially located in etcd like side storage or system configuration. For attack vectors that are not exploits, but instead allow network access to etcd, an attacker in 2ii has no ability to spread his influence, and is instead restricted to the subset of information on the host. For 3-5, they can do nothing they could not do already (request access to the nodes / services endpoint) because the token is not visible to them on the host. - -- cgit v1.2.3 From 747bb0de5dd4f1fdb64962aebeb89c283c79e81b Mon Sep 17 00:00:00 2001 From: caesarxuchao Date: Tue, 21 Apr 2015 21:22:28 -0700 Subject: fix the link to services.md --- architecture.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture.md b/architecture.md index 06a0a0ef..3f021aaf 100644 --- a/architecture.md +++ b/architecture.md @@ -17,7 +17,7 @@ The **Kubelet** manages [pods](../pods.md) and their containers, their images, t ### Kube-Proxy -Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../docs/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. +Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. Service endpoints are currently found via [DNS](../dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. -- cgit v1.2.3 From 2a117380292929bf29f390e90f8cbe57d3b8fd74 Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Fri, 10 Apr 2015 16:11:12 -0700 Subject: Suggest a simple rolling update. --- simple-rolling-update.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 simple-rolling-update.md diff --git a/simple-rolling-update.md b/simple-rolling-update.md new file mode 100644 index 00000000..43b086ae --- /dev/null +++ b/simple-rolling-update.md @@ -0,0 +1,92 @@ +## Simple rolling update +This is a lightweight design document for simple rolling update in ```kubectl``` + +Complete execution flow can be found [here](#execution-details). + +### Lightweight rollout +Assume that we have a current replication controller named ```foo``` and it is running image ```image:v1``` + +```kubectl rolling-update rc foo [foo-v2] --image=myimage:v2``` + +If the user doesn't specify a name for the 'next' controller, then the 'next' controller is renamed to +the name of the original controller. + +Obviously there is a race here, where if you kill the client between delete foo, and creating the new version of 'foo' you might be surprised about what is there, but I think that's ok. +See [Recovery](#recovery) below + +If the user does specify a name for the 'next' controller, then the 'next' controller is retained with its existing name, +and the old 'foo' controller is deleted. For the purposes of the rollout, we add a unique-ifying label ```kubernetes.io/deployment``` to both the ```foo``` and ```foo-next``` controllers. +The value of that label is the hash of the complete JSON representation of the```foo-next``` or```foo``` controller. The name of this label can be overridden by the user with the ```--deployment-label-key``` flag. + +#### Recovery +If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. +To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replicaController in the ```kubernetes.io/``` annotation namespace: + * ```desired-replicas``` The desired number of replicas for this controller (either N or zero) + * ```update-partner``` A pointer to the replicaiton controller resource that is the other half of this update (syntax `````` the namespace is assumed to be identical to the namespace of this replication controller.) + +Recovery is achieved by issuing the same command again: + +``` +kubectl rolling-update rc foo [foo-v2] --image=myimage:v2 +``` + +Whenever the rolling update command executes, the kubectl client looks for replication controllers called ```foo``` and ```foo-next```, if they exist, an attempt is +made to roll ```foo``` to ```foo-next```. If ```foo-next``` does not exist, then it is created, and the rollout is a new rollout. If ```foo``` doesn't exist, then +it is assumed that the rollout is nearly completed, and ```foo-next``` is renamed to ```foo```. Details of the execution flow are given below. + + +### Aborting a rollout +Abort is assumed to want to reverse a rollout in progress. + +```kubectl rolling-update rc foo [foo-v2] --abort``` + +This is really just semantic sugar for: + +```kubectl rolling-update rc foo-v2 foo``` + +With the added detail that it moves the ```desired-replicas``` annotation from ```foo-v2``` to ```foo``` + + +### Execution Details + +For the purposes of this example, assume that we are rolling from ```foo``` to ```foo-next``` where the only change is an image update from `v1` to `v2` + +If the user doesn't specify a ```foo-next``` name, then it is either discovered from the ```update-partner``` annotation on ```foo```. If that annotation doesn't exist, +then ```foo-next``` is synthesized using the pattern ```-``` + +#### Initialization + * If ```foo``` and ```foo-next``` do not exist: + * Exit, and indicate an error to the user, that the specified controller doesn't exist. + * Goto Rollout + * If ```foo``` exists, but ```foo-next``` does not: + * Create ```foo-next``` populate it with the ```v2``` image, set ```desired-replicas``` to ```foo.Spec.Replicas``` + * Goto Rollout + * If ```foo-next``` exists, but ```foo``` does not: + * Assume that we are in the rename phase. + * Goto Rename + * If both ```foo``` and ```foo-next``` exist: + * Assume that we are in a partial rollout + * If ```foo-next``` is missing the ```desired-replicas``` annotation + * Populate the ```desired-replicas``` annotation to ```foo-next``` using the current size of ```foo``` + * Goto Rollout + +#### Rollout + * While size of ```foo-next``` < ```desired-replicas``` annotation on ```foo-next``` + * increase size of ```foo-next``` + * if size of ```foo``` > 0 + decrease size of ```foo``` + * Goto Rename + +#### Rename + * delete ```foo``` + * create ```foo``` that is identical to ```foo-next``` + * delete ```foo-next``` + +#### Abort + * If ```foo-next``` doesn't exist + * Exit and indicate to the user that they may want to simply do a new rollout with the old version + * If ```foo``` doesn't exist + * Exit and indicate not found to the user + * Otherwise, ```foo-next``` and ```foo``` both exist + * Set ```desired-replicas``` annotation on ```foo``` to match the annotation on ```foo-next``` + * Goto Rollout with ```foo``` and ```foo-next``` trading places. -- cgit v1.2.3 From 812a7b9a63daccadcde843df358f7fb6ea9ccb76 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 23 Apr 2015 16:36:27 -0700 Subject: Make docs links go through docs.k8s.io --- networking.md | 2 +- secrets.md | 6 +++--- security.md | 12 ++++++------ 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/networking.md b/networking.md index d90f56b1..d664806f 100644 --- a/networking.md +++ b/networking.md @@ -83,7 +83,7 @@ We want to be able to assign IP addresses externally from Docker ([Docker issue In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically. -[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol. +[Service](http://docs.k8s.io/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](http://docs.k8s.io/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol. We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier. diff --git a/secrets.md b/secrets.md index 965f6e90..e07f271d 100644 --- a/secrets.md +++ b/secrets.md @@ -72,7 +72,7 @@ service would also consume the secrets associated with the MySQL service. ### Use-Case: Secrets associated with service accounts -[Service Accounts](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) are proposed as a +[Service Accounts](http://docs.k8s.io/design/service_accounts.md) are proposed as a mechanism to decouple capabilities and security contexts from individual human users. A `ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and @@ -236,7 +236,7 @@ memory overcommit on the node. #### Secret data on the node: isolation -Every pod will have a [security context](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/security_context.md). +Every pod will have a [security context](http://docs.k8s.io/design/security_context.md). Secret data on the node should be isolated according to the security context of the container. The Kubelet volume plugin API will be changed so that a volume plugin receives the security context of a volume along with the volume spec. This will allow volume plugins to implement setting the @@ -248,7 +248,7 @@ Several proposals / upstream patches are notable as background for this proposal 1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) 2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) +3. [Kubernetes service account proposal](http://docs.k8s.io/design/service_accounts.md) 4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) 5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) diff --git a/security.md b/security.md index 7bdca440..b446f66c 100644 --- a/security.md +++ b/security.md @@ -63,14 +63,14 @@ Automated process users fall into the following categories: A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*. -1. The API should authenticate and authorize user actions [authn and authz](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/access.md) +1. The API should authenticate and authorize user actions [authn and authz](http://docs.k8s.io/design/access.md) 2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. 3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) -4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md) +4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](http://docs.k8s.io/design/service_accounts.md) 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action -5. When container processes run on the cluster, they should run in a [security context](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. +5. When container processes run on the cluster, they should run in a [security context](http://docs.k8s.io/design/security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions @@ -79,7 +79,7 @@ A pod runs in a *security context* under a *service account* that is defined by 6. Developers may need to ensure their images work within higher security requirements specified by administrators 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes -6. Developers should be able to define [secrets](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/secrets.md) that are automatically added to the containers when pods are run +6. Developers should be able to define [secrets](http://docs.k8s.io/design/secrets.md) that are automatically added to the containers when pods are run 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: 1. An SSH private key for git cloning remote data 2. A client certificate for accessing a remote system @@ -93,11 +93,11 @@ A pod runs in a *security context* under a *service account* that is defined by ### Related design discussion -* Authorization and authentication https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/access.md +* Authorization and authentication http://docs.k8s.io/design/access.md * Secret distribution via files https://github.com/GoogleCloudPlatform/kubernetes/pull/2030 * Docker secrets https://github.com/docker/docker/pull/6697 * Docker vault https://github.com/docker/docker/issues/10310 -* Service Accounts: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/service_accounts.md +* Service Accounts: http://docs.k8s.io/design/service_accounts.md * Secret volumes https://github.com/GoogleCloudPlatform/kubernetes/4126 ## Specific Design Points -- cgit v1.2.3 From 14760aef25239601cc8859694bd464d18404f55c Mon Sep 17 00:00:00 2001 From: markturansky Date: Tue, 14 Apr 2015 17:14:39 -0400 Subject: PersistentVolumeClaimBinder implementation --- persistent-storage.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index 45ab8d42..fb53ad10 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -12,7 +12,7 @@ A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to u One new system component: -`PersistentVolumeManager` is a singleton running in master that manages all PVs in the system, analogous to the node controller. The volume manager watches the API for newly created volumes to manage. The manager also watches for claims by users and binds them to available volumes. +`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage. One new volume: @@ -32,7 +32,7 @@ Kubernetes makes no guarantees at runtime that the underlying storage exists or #### Describe available storage -Cluster administrators use the API to manage *PersistentVolumes*. The singleton PersistentVolumeManager watches the Kubernetes API for new volumes and adds them to its internal cache of volumes in the system. All persistent volumes are managed and made available by the volume manager. The manager also watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. +Cluster administrators use the API to manage *PersistentVolumes*. A custom store ```NewPersistentVolumeOrderedIndex``` will index volumes by access modes and sort by storage capacity. The ```PersistentVolumeClaimBinder``` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. PVs are system objects and, thus, have no namespace. @@ -151,7 +151,7 @@ myclaim-1 map[] pending #### Matching and binding - The ```PersistentVolumeManager``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found. + The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found. ``` @@ -209,6 +209,6 @@ cluster/kubectl.sh delete pvc myclaim-1 ``` -The ```PersistentVolumeManager``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. +The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. -- cgit v1.2.3 From 4e50c7273b2b096b968aeec4f6a3379d33ac5d8d Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Thu, 30 Apr 2015 22:16:59 -0700 Subject: Add a central simple getting started guide with kubernetes guide. Point several getting started guides at this doc. --- simple-rolling-update.md | 1 - 1 file changed, 1 deletion(-) diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 43b086ae..2d2bd826 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -57,7 +57,6 @@ then ```foo-next``` is synthesized using the pattern ```- Date: Tue, 5 May 2015 18:11:58 -0700 Subject: Fix event doc link --- event_compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/event_compression.md b/event_compression.md index 99dda143..2523c859 100644 --- a/event_compression.md +++ b/event_compression.md @@ -74,5 +74,5 @@ This demonstrates what would have been 20 separate entries (indicating schedulin * Issue [#4073](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress duplicate events * PR [#4157](https://github.com/GoogleCloudPlatform/kubernetes/issues/4157): Add "Update Event" to Kubernetes API * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event - * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress recurring events in to a single event to optimize etcd storage + * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4306): Compress recurring events in to a single event to optimize etcd storage * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map -- cgit v1.2.3 From 24906a08ebbce50f3a4db4b3052f102bffa5bbe7 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Wed, 6 May 2015 10:04:39 -0400 Subject: Fix link to service accounts doc in security context doc --- security_context.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security_context.md b/security_context.md index cd10202e..62e203a5 100644 --- a/security_context.md +++ b/security_context.md @@ -32,7 +32,7 @@ Processes in pods will need to have consistent UID/GID/SELinux category labels i * The concept of a security context should not be tied to a particular security mechanism or platform (ie. SELinux, AppArmor) * Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for - [service accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). + [service accounts](./service_accounts.md). ## Use Cases -- cgit v1.2.3 From 950547c92a52860fd55c49631b8dcc08403ff5fd Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Thu, 30 Apr 2015 10:28:36 -0700 Subject: Add support for --rollback. --- simple-rolling-update.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 2d2bd826..c5667b44 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -38,7 +38,7 @@ it is assumed that the rollout is nearly completed, and ```foo-next``` is rename ### Aborting a rollout Abort is assumed to want to reverse a rollout in progress. -```kubectl rolling-update rc foo [foo-v2] --abort``` +```kubectl rolling-update rc foo [foo-v2] --rollback``` This is really just semantic sugar for: -- cgit v1.2.3 From 6ab2274c588cbbfaf2ff4a1f7551b2e7b2cc5ad8 Mon Sep 17 00:00:00 2001 From: Weiwei Jiang Date: Thu, 7 May 2015 16:10:50 +0800 Subject: Fix wrong link for security context --- service_accounts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/service_accounts.md b/service_accounts.md index a3a1bb49..5eaa0d99 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -21,7 +21,7 @@ They also may interact with services other than the Kubernetes API, such as: A service account binds together several things: - a *name*, understood by users, and perhaps by peripheral systems, for an identity - a *principal* that can be authenticated and [authorized](../authorization.md) - - a [security context](./security_contexts.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other + - a [security context](./security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other capabilities and controls on interaction with the file system and OS. - a set of [secrets](./secrets.md), which a container may use to access various networked resources. -- cgit v1.2.3 From c7c03fe466e28070d7ecc5194f74a391fcf785a6 Mon Sep 17 00:00:00 2001 From: Paul Weil Date: Fri, 8 May 2015 16:38:28 -0400 Subject: bring doc up to date with actual api types --- security_context.md | 113 +++++++++++++++++++--------------------------------- 1 file changed, 40 insertions(+), 73 deletions(-) diff --git a/security_context.md b/security_context.md index 62e203a5..5f5376b0 100644 --- a/security_context.md +++ b/security_context.md @@ -65,8 +65,8 @@ be addressed with security contexts: ### Overview A *security context* consists of a set of constraints that determine how a container -is secured before getting created and run. It has a 1:1 correspondence to a -[service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). A *security context provider* is passed to the Kubelet so it can have a chance +is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will +be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance to mutate Docker API calls in order to apply the security context. It is recommended that this design be implemented in two phases: @@ -88,7 +88,7 @@ type SecurityContextProvider interface { // the container is created. // An error is returned if it's not possible to secure the container as // requested with a security context. - ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) error + ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) // ModifyHostConfig is called before the Docker runContainer call. // The security context provider can make changes to the HostConfig, affecting @@ -103,88 +103,55 @@ If the value of the SecurityContextProvider field on the Kubelet is nil, the kub ### Security Context -A security context has a 1:1 correspondence to a service account and it can be included as -part of the service account resource. Following is an example of an initial implementation: +A security context resides on the container and represents the runtime parameters that will +be used to create and run the container via container APIs. Following is an example of an initial implementation: ```go +type type Container struct { + ... other fields omitted ... + // Optional: SecurityContext defines the security options the pod should be run with + SecurityContext *SecurityContext +} -// SecurityContext specifies the security constraints associated with a service account +// SecurityContext holds security configuration that will be applied to a container. SecurityContext +// contains duplication of some existing fields from the Container resource. These duplicate fields +// will be populated based on the Container configuration if they are not set. Defining them on +// both the Container AND the SecurityContext will result in an error. type SecurityContext struct { - // user is the uid to use when running the container - User int - - // AllowPrivileged indicates whether this context allows privileged mode containers - AllowPrivileged bool - - // AllowedVolumeTypes lists the types of volumes that a container can bind - AllowedVolumeTypes []string - - // AddCapabilities is the list of Linux kernel capabilities to add - AddCapabilities []string - - // RemoveCapabilities is the list of Linux kernel capabilities to remove - RemoveCapabilities []string - - // Isolation specifies the type of isolation required for containers - // in this security context - Isolation ContainerIsolationSpec -} + // Capabilities are the capabilities to add/drop when running the container + Capabilities *Capabilities -// ContainerIsolationSpec indicates intent for container isolation -type ContainerIsolationSpec struct { - // Type is the container isolation type (None, Private) - Type ContainerIsolationType - - // FUTURE: IDMapping specifies how users and groups from the host will be mapped - IDMapping *IDMapping -} + // Run the container in privileged mode + Privileged *bool -// ContainerIsolationType is the type of container isolation for a security context -type ContainerIsolationType string + // SELinuxOptions are the labels to be applied to the container + // and volumes + SELinuxOptions *SELinuxOptions -const ( - // ContainerIsolationNone means that no additional consraints are added to - // containers to isolate them from their host - ContainerIsolationNone ContainerIsolationType = "None" - - // ContainerIsolationPrivate means that containers are isolated in process - // and storage from their host and other containers. - ContainerIsolationPrivate ContainerIsolationType = "Private" -) - -// IDMapping specifies the requested user and group mappings for containers -// associated with a specific security context -type IDMapping struct { - // SharedUsers is the set of user ranges that must be unique to the entire cluster - SharedUsers []IDMappingRange - - // SharedGroups is the set of group ranges that must be unique to the entire cluster - SharedGroups []IDMappingRange + // RunAsUser is the UID to run the entrypoint of the container process. + RunAsUser *int64 +} - // PrivateUsers are mapped to users on the host node, but are not necessarily - // unique to the entire cluster - PrivateUsers []IDMappingRange +// SELinuxOptions are the labels to be applied to the container. +type SELinuxOptions struct { + // SELinux user label + User string - // PrivateGroups are mapped to groups on the host node, but are not necessarily - // unique to the entire cluster - PrivateGroups []IDMappingRange -} + // SELinux role label + Role string -// IDMappingRange specifies a mapping between container IDs and node IDs -type IDMappingRange struct { - // ContainerID is the starting container UID or GID - ContainerID int + // SELinux type label + Type string - // HostID is the starting host UID or GID - HostID int - - // Length is the length of the UID/GID range - Length int + // SELinux level label. + Level string } - ``` +### Admission +It is up to an admission plugin to determine if the security context is acceptable or not. At the +time of writing, the admission control plugin for security contexts will only allow a context that +has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options +will be denied by default. In the future the admission plugin will base this decision upon +configurable policies that reside within the [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). -#### Security Context Lifecycle - -The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. -- cgit v1.2.3 From d3a429a76f5e1852ed437c30a3cdb1607e04f713 Mon Sep 17 00:00:00 2001 From: Jeff Lowdermilk Date: Thu, 14 May 2015 15:12:45 -0700 Subject: Add ga-beacon analytics to gendocs scripts hack/run-gendocs.sh puts ga-beacon analytics link into all md files, hack/verify-gendocs.sh verifies presence of link. --- README.md | 3 +++ access.md | 3 +++ admission_control.md | 3 +++ admission_control_limit_range.md | 3 +++ admission_control_resource_quota.md | 3 +++ architecture.md | 3 +++ clustering.md | 3 +++ clustering/README.md | 4 +++- command_execution_port_forwarding.md | 4 +++- event_compression.md | 3 +++ identifiers.md | 3 +++ namespaces.md | 4 +++- networking.md | 3 +++ persistent-storage.md | 3 +++ principles.md | 3 +++ secrets.md | 3 +++ security.md | 3 +++ security_context.md | 3 +++ service_accounts.md | 3 +++ simple-rolling-update.md | 3 +++ 20 files changed, 60 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index cda831a4..b70c5615 100644 --- a/README.md +++ b/README.md @@ -15,3 +15,6 @@ A single Kubernetes cluster is not intended to span multiple availability zones. Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() diff --git a/access.md b/access.md index 9de4d6c8..8fd09703 100644 --- a/access.md +++ b/access.md @@ -246,3 +246,6 @@ Initial implementation: Improvements: - API server does logging instead. - Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() diff --git a/admission_control.md b/admission_control.md index 1e1c1e53..749e949e 100644 --- a/admission_control.md +++ b/admission_control.md @@ -77,3 +77,6 @@ will ensure the following: 6. Object is persisted If at any step, there is an error, the request is canceled. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 3f2ccd7b..daddb425 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -130,3 +130,6 @@ In the current proposal, the **LimitRangeItem** matches purely on **LimitRangeIt It is expected we will want to define limits for particular pods or containers by name/uid and label/field selector. To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index ebad0728..b2dfbe85 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -151,3 +151,6 @@ replicationcontrollers 5 20 resourcequotas 1 1 services 3 5 ``` + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/architecture.md b/architecture.md index 3f021aaf..c50cfe0d 100644 --- a/architecture.md +++ b/architecture.md @@ -42,3 +42,6 @@ The scheduler binds unscheduled pods to nodes via the `/binding` API. The schedu All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable. The [`replicationController`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/clustering.md b/clustering.md index f447ef10..d57d631d 100644 --- a/clustering.md +++ b/clustering.md @@ -58,3 +58,6 @@ This diagram dynamic clustering using the bootstrap API endpoint. That API endp This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() diff --git a/clustering/README.md b/clustering/README.md index 7e9d79c8..09d2c4e1 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -23,4 +23,6 @@ If you are using boot2docker and get warnings about clock skew (or if things are ## Automatically rebuild on file changes -If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`. \ No newline at end of file +If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`. + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 3b9aeec7..3e548d40 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -141,4 +141,6 @@ functionality. We need to make sure that users are not allowed to execute remote commands or do port forwarding to containers they aren't allowed to access. -Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts. \ No newline at end of file +Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts. + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() diff --git a/event_compression.md b/event_compression.md index 2523c859..db0337f0 100644 --- a/event_compression.md +++ b/event_compression.md @@ -76,3 +76,6 @@ This demonstrates what would have been 20 separate entries (indicating schedulin * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4306): Compress recurring events in to a single event to optimize etcd storage * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() diff --git a/identifiers.md b/identifiers.md index d2e5d5c7..b75577c2 100644 --- a/identifiers.md +++ b/identifiers.md @@ -88,3 +88,6 @@ objectives. 1. Each container is started up with enough metadata to distinguish the pod from whence it came. 2. Each attempt to run a container is assigned a UID (a string) that is unique across time. 1. This may correspond to Docker's container ID. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() diff --git a/namespaces.md b/namespaces.md index 0e89bf56..ade07e11 100644 --- a/namespaces.md +++ b/namespaces.md @@ -332,4 +332,6 @@ has a deletion timestamp and that its list of finalizers is empty. As a result, content associated from that namespace has been purged. It performs a final DELETE action to remove that Namespace from the storage. -At this point, all content associated with that Namespace, and the Namespace itself are gone. \ No newline at end of file +At this point, all content associated with that Namespace, and the Namespace itself are gone. + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() diff --git a/networking.md b/networking.md index d664806f..f351629e 100644 --- a/networking.md +++ b/networking.md @@ -106,3 +106,6 @@ Another approach could be to create a new host interface alias for each pod, if ### IPv6 IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index fb53ad10..b52e6b71 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -212,3 +212,6 @@ cluster/kubectl.sh delete pvc myclaim-1 The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() diff --git a/principles.md b/principles.md index 499b540b..cf8833a4 100644 --- a/principles.md +++ b/principles.md @@ -53,3 +53,6 @@ TODO ## General principles * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() diff --git a/secrets.md b/secrets.md index e07f271d..119c673a 100644 --- a/secrets.md +++ b/secrets.md @@ -558,3 +558,6 @@ source. Both containers will have the following files present on their filesyst /etc/secret-volume/username /etc/secret-volume/password + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() diff --git a/security.md b/security.md index b446f66c..4c446d10 100644 --- a/security.md +++ b/security.md @@ -115,3 +115,6 @@ Both the Kubelet and Kube Proxy need information related to their specific roles The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes. The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() diff --git a/security_context.md b/security_context.md index 5f5376b0..fdacb173 100644 --- a/security_context.md +++ b/security_context.md @@ -155,3 +155,6 @@ has defined capabilities or privileged. Contexts that attempt to define a UID o will be denied by default. In the future the admission plugin will base this decision upon configurable policies that reside within the [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() diff --git a/service_accounts.md b/service_accounts.md index 5eaa0d99..9e6bc099 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -162,3 +162,6 @@ to services in the same namespace and read-write access to events in that namesp Finally, it may provide an interface to automate creation of new serviceAccounts. In that case, the user may want to GET serviceAccounts to see what has been created. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() diff --git a/simple-rolling-update.md b/simple-rolling-update.md index c5667b44..fed1b84f 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -89,3 +89,6 @@ then ```foo-next``` is synthesized using the pattern ```- Date: Tue, 12 May 2015 19:48:29 -0400 Subject: Add variable expansion and design doc --- expansion.md | 407 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 407 insertions(+) create mode 100644 expansion.md diff --git a/expansion.md b/expansion.md new file mode 100644 index 00000000..00c32797 --- /dev/null +++ b/expansion.md @@ -0,0 +1,407 @@ +# Variable expansion in pod command, args, and env + +## Abstract + +A proposal for the expansion of environment variables using a simple `$(var)` syntax. + +## Motivation + +It is extremely common for users to need to compose environment variables or pass arguments to +their commands using the values of environment variables. Kubernetes should provide a facility for +the 80% cases in order to decrease coupling and the use of workarounds. + +## Goals + +1. Define the syntax format +2. Define the scoping and ordering of substitutions +3. Define the behavior for unmatched variables +4. Define the behavior for unexpected/malformed input + +## Constraints and Assumptions + +* This design should describe the simplest possible syntax to accomplish the use-cases +* Expansion syntax will not support more complicated shell-like behaviors such as default values + (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. + +## Use Cases + +1. As a user, I want to compose new environment variables for a container using a substitution + syntax to reference other variables in the container's environment and service environment + variables +1. As a user, I want to substitute environment variables into a container's command +1. As a user, I want to do the above without requiring the container's image to have a shell +1. As a user, I want to be able to specify a default value for a service variable which may + not exist +1. As a user, I want to see an event associated with the pod if an expansion fails (ie, references + variable names that cannot be expanded) + +### Use Case: Composition of environment variables + +Currently, containers are injected with docker-style environment variables for the services in +their pod's namespace. There are several variables for each service, but users routinely need +to compose URLs based on these variables because there is not a variable for the exact format +they need. Users should be able to build new environment variables with the exact format they need. +Eventually, it should also be possible to turn off the automatic injection of the docker-style +variables into pods and let the users consume the exact information they need via the downward API +and composition. + +#### Expanding expanded variables + +It should be possible to reference an variable which is itself the result of an expansion, if the +referenced variable is declared in the container's environment prior to the one referencing it. +Put another way -- a container's environment is expanded in order, and expanded variables are +available to subsequent expansions. + +### Use Case: Variable expansion in command + +Users frequently need to pass the values of environment variables to a container's command. +Currently, Kubernetes does not perform any expansion of varibles. The workaround is to invoke a +shell in the container's command and have the shell perform the substitution, or to write a wrapper +script that sets up the environment and runs the command. This has a number of drawbacks: + +1. Solutions that require a shell are unfriendly to images that do not contain a shell +2. Wrapper scripts make it harder to use images as base images +3. Wrapper scripts increase coupling to kubernetes + +Users should be able to do the 80% case of variable expansion in command without writing a wrapper +script or adding a shell invocation to their containers' commands. + +### Use Case: Images without shells + +The current workaround for variable expansion in a container's command requires the container's +image to have a shell. This is unfriendly to images that do not contain a shell (`scratch` images, +for example). Users should be able to perform the other use-cases in this design without regard to +the content of their images. + +### Use Case: See an event for incomplete expansions + +It is possible that a container with incorrect variable values or command line may continue to run +for a long period of time, and that the end-user would have no visual or obvious warning of the +incorrect configuration. If the kubelet creates an event when an expansion references a variable +that cannot be expanded, it will help users quickly detect problems with expansions. + +## Design Considerations + +### What features should be supported? + +In order to limit complexity, we want to provide the right amount of functionality so that the 80% +cases can be realized and nothing more. We felt that the essentials boiled down to: + +1. Ability to perform direct expansion of variables in a string +2. Ability to specify default values via a prioritized mapping function but without support for + defaults as a syntax-level feature + +### What should the syntax be? + +The exact syntax for variable expansion has a large impact on how users perceive and relate to the +feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This +syntax is an attractive option on some level, because many people are familiar with it. However, +this syntax also has a large number of lesser known features such as the ability to provide +default values for unset variables, perform inline substitution, etc. + +In the interest of preventing conflation of the expansion feature in Kubernetes with the shell +feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not +to support the bar `$var` format, since it is not required to implement the required use-cases. + +Nested references, ie, variable expansion within variable names, are not supported. + +#### How should unmatched references be treated? + +Ideally, it should be extremely clear when a variable reference couldn't be expanded. We decided +the best experience for unmatched variable references would be to have the entire reference, syntax +included, show up in the output. As an example, if the reference `$(VARIABLE_NAME)` cannot be +expanded, then `$(VARIABLE_NAME)` should be present in the output. + +#### Escaping the operator + +Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution +supported by many shells, because unexpanded variables are present verbatim in the output, we +expect this will not present a problem to many users. If there is a collision between a varible +name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`, +which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. + +## Design + +This design encompasses the variable expansion syntax and specification and the changes needed to +incorporate the expansion feature into the container's environment and command. + +### Syntax and expansion mechanics + +This section describes the expansion syntax, evaluation of variable values, and how unexpected or +malformed inputs are handled. + +#### Syntax + +The inputs to the expansion feature are: + +1. A utf-8 string (the input string) which may contain variable references +2. A function (the mapping function) that maps the name of a variable to the variable's value, of + type `func(string) string` + +Variable references in the input string are indicated exclusively with the syntax +`$()`. The syntax tokens are: + +- `$`: the operator +- `(`: the reference opener +- `)`: the reference closer + +The operator has no meaning unless accompanied by the reference opener and closer tokens. The +operator can be escaped using `$$`. One literal `$` will be emitted for each `$$` in the input. + +The reference opener and closer characters have no meaning when not part of a variable reference. +If a variable reference is malformed, viz: `$(VARIABLE_NAME` without a closing expression, the +operator and expression opening characters are treated as ordinary characters without special +meanings. + +#### Scope and ordering of substitutions + +The scope in which variable references are expanded is defined by the mapping function. Within the +mapping function, any arbitrary strategy may be used to determine the value of a variable name. +The most basic implementation of a mapping function is to use a `map[string]string` to lookup the +value of a variable. + +In order to support default values for variables like service variables presented by the kubelet, +which may not be bound because the service that provides them does not yet exist, there should be a +mapping function that uses a list of `map[string]string` like: + +```go +func MakeMappingFunc(maps ...map[string]string) func(string) string { + return func(input string) string { + for _, context := range maps { + val, ok := context[input] + if ok { + return val + } + } + + return "" + } +} + +// elsewhere +containerEnv := map[string]string{ + "FOO": "BAR", + "ZOO": "ZAB", + "SERVICE2_HOST": "some-host", +} + +serviceEnv := map[string]string{ + "SERVICE_HOST": "another-host", + "SERVICE_PORT": "8083", +} + +// single-map variation +mapping := MakeMappingFunc(containerEnv) + +// default variables not found in serviceEnv +mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv) +``` + +### Implementation changes + +The necessary changes to implement this functionality are: + +1. Add a new interface, `ObjectEventRecorder`, which is like the `EventRecorder` interface, but + scoped to a single object, and a function that returns an `ObjectEventRecorder` given an + `ObjectReference` and an `EventRecorder` +2. Introduce `third_party/golang/expansion` package that provides: + 1. An `Expand(string, func(string) string) string` function + 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function +3. Add a new EnvVarSource for expansions and associated tests +4. Make the kubelet expand environment correctly +5. Make the kubelet expand command correctly + +#### Event Recording + +In order to provide an event when an expansion references undefined variables, the mapping function +must be able to create an event. In order to facilitate this, we should create a new interface in +the `api/client/record` package which is similar to `EventRecorder`, but scoped to a single object: + +```go +// ObjectEventRecorder knows how to record events about a single object. +type ObjectEventRecorder interface { + // Event constructs an event from the given information and puts it in the queue for sending. + // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will + // be used to automate handling of events, so imagine people writing switch statements to + // handle them. You want to make that easy. + // 'message' is intended to be human readable. + // + // The resulting event will be created in the same namespace as the reference object. + Event(reason, message string) + + // Eventf is just like Event, but with Sprintf for the message field. + Eventf(reason, messageFmt string, args ...interface{}) + + // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. + PastEventf(timestamp util.Time, reason, messageFmt string, args ...interface{}) +} +``` + +There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object` +and an `EventRecorder`: + +```go +type objectRecorderImpl struct { + object runtime.Object + recorder EventRecorder +} + +func (r *objectRecorderImpl) Event(reason, message string) { + r.recorder.Event(r.object, reason, message) +} + +func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder { + return &objectRecorderImpl{object, recorder} +} +``` + +#### Expansion package + +The expansion package should provide two methods: + +```go +// MappingFuncFor returns a mapping function for use with Expand that +// implements the expansion semantics defined in the expansion spec; it +// returns the input string wrapped in the expansion syntax if no mapping +// for the input is found. If no expansion is found for a key, an event +// is raised on the given recorder. +func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string { + // ... +} + +// Expand replaces variable references in the input string according to +// the expansion spec using the given mapping function to resolve the +// values of variables. +func Expand(input string, mapping func(string) string) string { + // ... +} +``` + +#### Expansion `EnvVarSource` + +In order to avoid changing the existing behavior of the `EnvVar.Value` field, there should be a new +`EnvVarSource` that represents a variable expansion that an env var's value should come from: + +```go +// EnvVarSource represents a source for the value of an EnvVar. +type EnvVarSource struct { + // Other fields omitted + + Expansion *EnvVarExpansion +} + +type EnvVarExpansion struct { + // The input string to be expanded + Expand string +} +``` + +#### Kubelet changes + +The Kubelet should change to: + +1. Correctly expand environment variables with `Expansion` sources +2. Correctly expand references in the Command and Args + +### Examples + +#### Inputs and outputs + +These examples are in the context of the mapping: + +| Name | Value | +|-------------|------------| +| `VAR_A` | `"A"` | +| `VAR_B` | `"B"` | +| `VAR_C` | `"C"` | +| `VAR_REF` | `$(VAR_A)` | +| `VAR_EMPTY` | `""` | + +No other variables are defined. + +| Input | Result | +|--------------------------------|----------------------------| +| `"$(VAR_A)"` | `"A"` | +| `"___$(VAR_B)___"` | `"___B___"` | +| `"___$(VAR_C)"` | `"___C"` | +| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` | +| `"$(VAR_A)-1"` | `"A-1"` | +| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` | +| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` | +| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` | +| `"f000-$$VAR_A"` | `"f000-$VAR_A"` | +| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` | +| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` | +| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` | +| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` | +| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` | +| `"$(VAR_REF)"` | `"$(VAR_A)"` | +| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` | +| `"foo$(VAR_EMPTY)bar"` | `"foobar"` | +| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` | +| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` | +| `"$?_boo_$!"` | `"$?_boo_$!"` | +| `"$VAR_A"` | `"$VAR_A"` | +| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` | +| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` | +| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` | +| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` | +| `"$$$$$$$(VAR_A)"` | `"$$$A"` | +| `"$VAR_A)"` | `"$VAR_A)"` | +| `"${VAR_A}"` | `"${VAR_A}"` | +| `"$(VAR_B)_______$(A"` | `"B_______$(A"` | +| `"$(VAR_C)_______$("` | `"C_______$("` | +| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` | +| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` | +| `"--$($($($($--"` | `"--$($($($($--"` | +| `"$($($($($--foo$("` | `"$($($($($--foo$("` | +| `"foo0--$($($($("` | `"foo0--$($($($("` | +| `"$(foo$$var)` | `$(foo$$var)` | + +#### In a pod: building a URL + +Notice the `$(var)` syntax. + +```yaml +apiVersion: v1beta3 +kind: Pod +metadata: + name: expansion-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh", "-c", "env" ] + env: + - name: PUBLIC_URL + valueFrom: + expansion: + expand: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" + restartPolicy: Never +``` + +#### In a pod: building a URL using downward API + +```yaml +apiVersion: v1beta3 +kind: Pod +metadata: + name: expansion-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh", "-c", "env" ] + env: + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: "metadata.namespace" + - name: PUBLIC_URL + valueFrom: + expansion: + expand: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" + restartPolicy: Never +``` + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() -- cgit v1.2.3 From aeb81528075456697287e110e1efd4558491de57 Mon Sep 17 00:00:00 2001 From: Vishnu Kannan Date: Tue, 12 May 2015 15:13:03 -0700 Subject: Updating namespaces to be DNS labels instead of DNS names. --- namespaces.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/namespaces.md b/namespaces.md index ade07e11..c4a1a90d 100644 --- a/namespaces.md +++ b/namespaces.md @@ -51,7 +51,7 @@ type Namespace struct { } ``` -A *Namespace* name is a DNS compatible subdomain. +A *Namespace* name is a DNS compatible label. A *Namespace* must exist prior to associating content with it. -- cgit v1.2.3 From 93f791e943a103efca378ca82fcfca1cada7f3e7 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 20 May 2015 17:17:01 -0700 Subject: in docs, update replicationController to replicationcontroller --- architecture.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture.md b/architecture.md index c50cfe0d..ebfb4964 100644 --- a/architecture.md +++ b/architecture.md @@ -41,7 +41,7 @@ The scheduler binds unscheduled pods to nodes via the `/binding` API. The schedu All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable. -The [`replicationController`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. +The [`replicationcontroller`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() -- cgit v1.2.3 From b4aee3255a668603c1f1b37dac0997e4815cb491 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 20 May 2015 16:54:53 -0700 Subject: in docs, update "minions" to "nodes" --- access.md | 4 ++-- security.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/access.md b/access.md index 8fd09703..647ce552 100644 --- a/access.md +++ b/access.md @@ -65,7 +65,7 @@ Cluster in Large organization: Org-run cluster: - organization that runs K8s master components is same as the org that runs apps on K8s. - - Minions may be on-premises VMs or physical machines; Cloud VMs; or a mix. + - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. Hosted cluster: - Offering K8s API as a service, or offering a Paas or Saas built on K8s @@ -223,7 +223,7 @@ Initially: Improvements: - allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object. - allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores. -- tools to help write consistent quota config files based on number of minions, historical namespace usages, QoS needs, etc. +- tools to help write consistent quota config files based on number of nodes, historical namespace usages, QoS needs, etc. - way for K8s Cluster Admin to incrementally adjust Quota objects. Simple profile: diff --git a/security.md b/security.md index 4c446d10..26d543c9 100644 --- a/security.md +++ b/security.md @@ -104,7 +104,7 @@ A pod runs in a *security context* under a *service account* that is defined by ### TODO: authorization, authentication -### Isolate the data store from the minions and supporting infrastructure +### Isolate the data store from the nodes and supporting infrastructure Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer. @@ -114,7 +114,7 @@ Both the Kubelet and Kube Proxy need information related to their specific roles The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes. -The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). +The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() -- cgit v1.2.3 From 5ee2d2ea4d7222bc11a3b667d2d6b6a586ee42e4 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Thu, 21 May 2015 11:05:25 -0700 Subject: update docs/design/secrets.md to v1beta3 --- secrets.md | 229 ++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 120 insertions(+), 109 deletions(-) diff --git a/secrets.md b/secrets.md index 119c673a..5f8cb501 100644 --- a/secrets.md +++ b/secrets.md @@ -389,12 +389,14 @@ To create a pod that uses an ssh key stored as a secret, we first need to create ```json { - "apiVersion": "v1beta2", "kind": "Secret", - "id": "ssh-key-secret", + "apiVersion": "v1beta3", + "metadata": { + "name": "ssh-key-secret" + }, "data": { - "id-rsa.pub": "dmFsdWUtMQ0K", - "id-rsa": "dmFsdWUtMg0KDQo=" + "id-rsa": "dmFsdWUtMg0KDQo=", + "id-rsa.pub": "dmFsdWUtMQ0K" } } ``` @@ -407,38 +409,36 @@ Now we can create a pod which references the secret with the ssh key and consume ```json { - "id": "secret-test-pod", "kind": "Pod", - "apiVersion":"v1beta2", - "labels": { - "name": "secret-test" + "apiVersion": "v1beta3", + "metadata": { + "name": "secret-test-pod", + "labels": { + "name": "secret-test" + } }, - "desiredState": { - "manifest": { - "version": "v1beta1", - "id": "secret-test-pod", - "containers": [{ + "spec": { + "volumes": [ + { + "name": "secret-volume", + "secret": { + "secretName": "ssh-key-secret" + } + } + ], + "containers": [ + { "name": "ssh-test-container", "image": "mySshImage", - "volumeMounts": [{ - "name": "secret-volume", - "mountPath": "/etc/secret-volume", - "readOnly": true - }] - }], - "volumes": [{ - "name": "secret-volume", - "source": { - "secret": { - "target": { - "kind": "Secret", - "namespace": "example", - "name": "ssh-key-secret" - } + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" } - } - }] - } + ] + } + ] } } ``` @@ -452,105 +452,116 @@ The container is then free to use the secret data to establish an ssh connection ### Use-Case: Pods with pod / test credentials -Let's compare examples where a pod consumes a secret containing prod credentials and another pod -consumes a secret with test environment credentials. +This example illustrates a pod which consumes a secret containing prod +credentials and another pod which consumes a secret with test environment +credentials. The secrets: ```json -[{ - "apiVersion": "v1beta2", - "kind": "Secret", - "id": "prod-db-secret", - "data": { - "username": "dmFsdWUtMQ0K", - "password": "dmFsdWUtMg0KDQo=" - } -}, { - "apiVersion": "v1beta2", - "kind": "Secret", - "id": "test-db-secret", - "data": { - "username": "dmFsdWUtMQ0K", - "password": "dmFsdWUtMg0KDQo=" - } -}] + "apiVersion": "v1beta3", + "kind": "List", + "items": + [{ + "kind": "Secret", + "apiVersion": "v1beta3", + "metadata": { + "name": "prod-db-secret" + }, + "data": { + "password": "dmFsdWUtMg0KDQo=", + "username": "dmFsdWUtMQ0K" + } + }, + { + "kind": "Secret", + "apiVersion": "v1beta3", + "metadata": { + "name": "test-db-secret" + }, + "data": { + "password": "dmFsdWUtMg0KDQo=", + "username": "dmFsdWUtMQ0K" + } + }] +} ``` The pods: ```json -[{ - "id": "prod-db-client-pod", - "kind": "Pod", - "apiVersion":"v1beta2", - "labels": { - "name": "prod-db-client" - }, - "desiredState": { - "manifest": { - "version": "v1beta1", - "id": "prod-db-pod", - "containers": [{ - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [{ +{ + "apiVersion": "v1beta3", + "kind": "List", + "items": + [{ + "kind": "Pod", + "apiVersion": "v1beta3", + "metadata": { + "name": "prod-db-client-pod", + "labels": { + "name": "prod-db-client" + } + }, + "spec": { + "volumes": [ + { "name": "secret-volume", - "mountPath": "/etc/secret-volume", - "readOnly": true - }] - }], - "volumes": [{ - "name": "secret-volume", - "source": { "secret": { - "target": { - "kind": "Secret", - "namespace": "example", - "name": "prod-db-secret" - } + "secretName": "prod-db-secret" } } - }] + ], + "containers": [ + { + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" + } + ] + } + ] } - } -}, -{ - "id": "test-db-client-pod", - "kind": "Pod", - "apiVersion":"v1beta2", - "labels": { - "name": "test-db-client" }, - "desiredState": { - "manifest": { - "version": "v1beta1", - "id": "test-db-pod", - "containers": [{ - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [{ + { + "kind": "Pod", + "apiVersion": "v1beta3", + "metadata": { + "name": "test-db-client-pod", + "labels": { + "name": "test-db-client" + } + }, + "spec": { + "volumes": [ + { "name": "secret-volume", - "mountPath": "/etc/secret-volume", - "readOnly": true - }] - }], - "volumes": [{ - "name": "secret-volume", - "source": { "secret": { - "target": { - "kind": "Secret", - "namespace": "example", - "name": "test-db-secret" - } + "secretName": "test-db-secret" } } - }] + ], + "containers": [ + { + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" + } + ] + } + ] } - } -}] + }] +} ``` The specs for the two pods differ only in the value of the object referred to by the secret volume -- cgit v1.2.3 From 7934ee41659f70d1f5b309abc907693988a523eb Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Fri, 22 May 2015 18:21:03 -0400 Subject: Make kubelet expand var refs in cmd, args, env --- expansion.md | 35 ++++++++++------------------------- 1 file changed, 10 insertions(+), 25 deletions(-) diff --git a/expansion.md b/expansion.md index 00c32797..d15f2501 100644 --- a/expansion.md +++ b/expansion.md @@ -207,9 +207,8 @@ The necessary changes to implement this functionality are: 2. Introduce `third_party/golang/expansion` package that provides: 1. An `Expand(string, func(string) string) string` function 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function -3. Add a new EnvVarSource for expansions and associated tests -4. Make the kubelet expand environment correctly -5. Make the kubelet expand command correctly +3. Make the kubelet expand environment correctly +4. Make the kubelet expand command correctly #### Event Recording @@ -277,31 +276,17 @@ func Expand(input string, mapping func(string) string) string { } ``` -#### Expansion `EnvVarSource` - -In order to avoid changing the existing behavior of the `EnvVar.Value` field, there should be a new -`EnvVarSource` that represents a variable expansion that an env var's value should come from: - -```go -// EnvVarSource represents a source for the value of an EnvVar. -type EnvVarSource struct { - // Other fields omitted - - Expansion *EnvVarExpansion -} - -type EnvVarExpansion struct { - // The input string to be expanded - Expand string -} -``` - #### Kubelet changes -The Kubelet should change to: +The Kubelet should be made to correctly expand variables references in a container's environment, +command, and args. Changes will need to be made to: -1. Correctly expand environment variables with `Expansion` sources -2. Correctly expand references in the Command and Args +1. The `makeEnvironmentVariables` function in the kubelet; this is used by + `GenerateRunContainerOptions`, which is used by both the docker and rkt container runtimes +2. The docker manager `setEntrypointAndCommand` func has to be changed to perform variable + expansion +3. The rkt runtime should be made to support expansion in command and args when support for it is + implemented ### Examples -- cgit v1.2.3 From b04be7e742cd98482e332a1caa3d5b71bbbcf636 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Sat, 23 May 2015 13:41:11 -0700 Subject: Rename 'portal IP' to 'cluster IP' most everywhere This covers obvious transforms, but not --portal_net, $PORTAL_NET and similar. --- networking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/networking.md b/networking.md index f351629e..cd2bd0c5 100644 --- a/networking.md +++ b/networking.md @@ -83,7 +83,7 @@ We want to be able to assign IP addresses externally from Docker ([Docker issue In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically. -[Service](http://docs.k8s.io/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](http://docs.k8s.io/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol. +[Service](http://docs.k8s.io/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service IP](http://docs.k8s.io/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service's IP in DNS, and for that to become the preferred resolution protocol. We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier. -- cgit v1.2.3 From f2a6d63ddaf19110bf33d32e24710831a1bd9938 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Fri, 29 May 2015 01:00:36 -0400 Subject: Corrections to examples in expansion docs --- expansion.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/expansion.md b/expansion.md index d15f2501..b3ef161b 100644 --- a/expansion.md +++ b/expansion.md @@ -359,9 +359,7 @@ spec: command: [ "/bin/sh", "-c", "env" ] env: - name: PUBLIC_URL - valueFrom: - expansion: - expand: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" + value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" restartPolicy: Never ``` @@ -383,9 +381,7 @@ spec: fieldRef: fieldPath: "metadata.namespace" - name: PUBLIC_URL - valueFrom: - expansion: - expand: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" + value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" restartPolicy: Never ``` -- cgit v1.2.3 From 4bdc5177692b9c9b946c039aad134a8d7fbda5e0 Mon Sep 17 00:00:00 2001 From: Ben McCann Date: Mon, 1 Jun 2015 20:10:45 -0700 Subject: Document how a secrets server like Vault or Keywhiz might fit into Kubernetes --- secrets.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/secrets.md b/secrets.md index 5f8cb501..e96b0d89 100644 --- a/secrets.md +++ b/secrets.md @@ -148,7 +148,8 @@ have different preferences for the central store of secret data. Some possibili 1. An etcd collection alongside the storage for other API resources 2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) -3. An external datastore such as an external etcd, RDBMS, etc. +3. A secrets server like [Vault](https://www.vaultproject.io/) or [Keywhiz](https://square.github.io/keywhiz/) +4. An external datastore such as an external etcd, RDBMS, etc. #### Size limit for secrets -- cgit v1.2.3 From bd8e7d842472ad28f3b56e6a425939ad48a1274d Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Thu, 28 May 2015 17:21:32 -0700 Subject: Explain that file-based pods cannot use secrets. --- secrets.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/secrets.md b/secrets.md index 5f8cb501..cbf93ee2 100644 --- a/secrets.md +++ b/secrets.md @@ -1,4 +1,3 @@ -# Secret Distribution ## Abstract @@ -184,6 +183,11 @@ For now, we will not implement validations around these limits. Cluster operato much node storage is allocated to secrets. It will be the operator's responsibility to ensure that the allocated storage is sufficient for the workload scheduled onto a node. +For now, kubelets will only attach secrets to api-sourced pods, and not file- or http-sourced +ones. Doing so would: + - confuse the secrets admission controller in the case of mirror pods. + - create an apiserver-liveness dependency -- avoiding this dependency is a main reason to use non-api-source pods. + ### Use-Case: Kubelet read of secrets for node The use-case where the kubelet reads secrets has several additional requirements: -- cgit v1.2.3 From 1bb3ed53eeed2d0cc493f7974b09e4168c35ad9f Mon Sep 17 00:00:00 2001 From: Scott Konzem Date: Fri, 5 Jun 2015 11:35:17 -0400 Subject: Fix misspellings in documentation --- secrets.md | 2 +- service_accounts.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/secrets.md b/secrets.md index cbf93ee2..0bacb8d4 100644 --- a/secrets.md +++ b/secrets.md @@ -122,7 +122,7 @@ We should consider what the best way to allow this is; there are a few different 3. Give secrets attributes that allow the user to express that the secret should be presented to the container as an environment variable. The container's environment would contain the - desired values and the software in the container could use them without accomodation the + desired values and the software in the container could use them without accommodation the command or setup script. For our initial work, we will treat all secrets as files to narrow the problem space. There will diff --git a/service_accounts.md b/service_accounts.md index 9e6bc099..72a10207 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -149,7 +149,7 @@ First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no `P then it copies in the referenced securityContext and secrets references for the corresponding `serviceAccount`. Second, if ServiceAccount definitions change, it may take some actions. -**TODO**: decide what actions it takes when a serviceAccount defintion changes. Does it stop pods, or just +**TODO**: decide what actions it takes when a serviceAccount definition changes. Does it stop pods, or just allow someone to list ones that out out of spec? In general, people may want to customize this? Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include -- cgit v1.2.3 From db372e1f640d4b903c228d6e0d536f55206a2bd2 Mon Sep 17 00:00:00 2001 From: Kris Rousey Date: Fri, 5 Jun 2015 12:47:15 -0700 Subject: Updating docs/ to v1 --- expansion.md | 4 ++-- namespaces.md | 8 ++++---- persistent-storage.md | 6 +++--- secrets.md | 16 ++++++++-------- 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/expansion.md b/expansion.md index b3ef161b..f4c85e8d 100644 --- a/expansion.md +++ b/expansion.md @@ -348,7 +348,7 @@ No other variables are defined. Notice the `$(var)` syntax. ```yaml -apiVersion: v1beta3 +apiVersion: v1 kind: Pod metadata: name: expansion-pod @@ -366,7 +366,7 @@ spec: #### In a pod: building a URL using downward API ```yaml -apiVersion: v1beta3 +apiVersion: v1 kind: Pod metadata: name: expansion-pod diff --git a/namespaces.md b/namespaces.md index c4a1a90d..0fef2bed 100644 --- a/namespaces.md +++ b/namespaces.md @@ -231,7 +231,7 @@ OpenShift creates a Namespace in Kubernetes ``` { - "apiVersion":"v1beta3", + "apiVersion":"v1", "kind": "Namespace", "metadata": { "name": "development", @@ -256,7 +256,7 @@ User deletes the Namespace in Kubernetes, and Namespace now has following state: ``` { - "apiVersion":"v1beta3", + "apiVersion":"v1", "kind": "Namespace", "metadata": { "name": "development", @@ -281,7 +281,7 @@ removing *kubernetes* from the list of finalizers: ``` { - "apiVersion":"v1beta3", + "apiVersion":"v1", "kind": "Namespace", "metadata": { "name": "development", @@ -309,7 +309,7 @@ This results in the following state: ``` { - "apiVersion":"v1beta3", + "apiVersion":"v1", "kind": "Namespace", "metadata": { "name": "development", diff --git a/persistent-storage.md b/persistent-storage.md index b52e6b71..21a5650d 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -98,7 +98,7 @@ An administrator provisions storage by posting PVs to the API. Various way to a POST: kind: PersistentVolume -apiVersion: v1beta3 +apiVersion: v1 metadata: name: pv0001 spec: @@ -128,7 +128,7 @@ The user must be within a namespace to create PVCs. POST: kind: PersistentVolumeClaim -apiVersion: v1beta3 +apiVersion: v1 metadata: name: myclaim-1 spec: @@ -179,7 +179,7 @@ The claim holder owns the claim and its data for as long as the claim exists. T POST: kind: Pod -apiVersion: v1beta3 +apiVersion: v1 metadata: name: mypod spec: diff --git a/secrets.md b/secrets.md index 0bacb8d4..4d74e68b 100644 --- a/secrets.md +++ b/secrets.md @@ -394,7 +394,7 @@ To create a pod that uses an ssh key stored as a secret, we first need to create ```json { "kind": "Secret", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "ssh-key-secret" }, @@ -414,7 +414,7 @@ Now we can create a pod which references the secret with the ssh key and consume ```json { "kind": "Pod", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "secret-test-pod", "labels": { @@ -464,12 +464,12 @@ The secrets: ```json { - "apiVersion": "v1beta3", + "apiVersion": "v1", "kind": "List", "items": [{ "kind": "Secret", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "prod-db-secret" }, @@ -480,7 +480,7 @@ The secrets: }, { "kind": "Secret", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "test-db-secret" }, @@ -496,12 +496,12 @@ The pods: ```json { - "apiVersion": "v1beta3", + "apiVersion": "v1", "kind": "List", "items": [{ "kind": "Pod", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "prod-db-client-pod", "labels": { @@ -534,7 +534,7 @@ The pods: }, { "kind": "Pod", - "apiVersion": "v1beta3", + "apiVersion": "v1", "metadata": { "name": "test-db-client-pod", "labels": { -- cgit v1.2.3 From 867825849dbdd7105fd037239a64397bf2c8d969 Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Fri, 5 Jun 2015 14:50:11 -0700 Subject: Purge cluster/kubectl.sh from nearly all docs. Mark cluster/kubectl.sh as deprecated. --- persistent-storage.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index 21a5650d..3729f30e 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -110,7 +110,7 @@ spec: -------------------------------------------------- -cluster/kubectl.sh get pv +kubectl get pv NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM pv0001 map[] 10737418240 RWO Pending @@ -140,7 +140,7 @@ spec: -------------------------------------------------- -cluster/kubectl.sh get pvc +kubectl get pvc NAME LABELS STATUS VOLUME @@ -155,13 +155,13 @@ myclaim-1 map[] pending ``` -cluster/kubectl.sh get pv +kubectl get pv NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e -cluster/kubectl.sh get pvc +kubectl get pvc NAME LABELS STATUS VOLUME myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e @@ -205,7 +205,7 @@ When a claim holder is finished with their data, they can delete their claim. ``` -cluster/kubectl.sh delete pvc myclaim-1 +kubectl delete pvc myclaim-1 ``` -- cgit v1.2.3 From a407b64a3d2be8e3ddca9192609c72e92b64a6a9 Mon Sep 17 00:00:00 2001 From: Ed Costello Date: Thu, 11 Jun 2015 01:11:44 -0400 Subject: Copy edits for spelling errors and typos Signed-off-by: Ed Costello --- clustering.md | 2 +- event_compression.md | 2 +- expansion.md | 4 ++-- security.md | 4 ++-- service_accounts.md | 2 +- simple-rolling-update.md | 2 +- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/clustering.md b/clustering.md index d57d631d..4cef06f8 100644 --- a/clustering.md +++ b/clustering.md @@ -38,7 +38,7 @@ The proposed solution will provide a range of options for setting up and maintai The building blocks of an easier solution: -* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly idenitfy the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN. +* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN. * [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate. * **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors. * **Scoped Kubelet Accounts** These accounts are per-minion and (optionally) give a minion permission to register itself. diff --git a/event_compression.md b/event_compression.md index db0337f0..74aba66f 100644 --- a/event_compression.md +++ b/event_compression.md @@ -25,7 +25,7 @@ Instead of a single Timestamp, each event object [contains](https://github.com/G Each binary that generates events: * Maintains a historical record of previously generated events: - * Implmented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/pkg/client/record/events_cache.go). + * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/pkg/client/record/events_cache.go). * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * ```event.Source.Component``` * ```event.Source.Host``` diff --git a/expansion.md b/expansion.md index f4c85e8d..8b31526a 100644 --- a/expansion.md +++ b/expansion.md @@ -55,7 +55,7 @@ available to subsequent expansions. ### Use Case: Variable expansion in command Users frequently need to pass the values of environment variables to a container's command. -Currently, Kubernetes does not perform any expansion of varibles. The workaround is to invoke a +Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a shell in the container's command and have the shell perform the substitution, or to write a wrapper script that sets up the environment and runs the command. This has a number of drawbacks: @@ -116,7 +116,7 @@ expanded, then `$(VARIABLE_NAME)` should be present in the output. Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution supported by many shells, because unexpanded variables are present verbatim in the output, we -expect this will not present a problem to many users. If there is a collision between a varible +expect this will not present a problem to many users. If there is a collision between a variable name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. diff --git a/security.md b/security.md index 26d543c9..6ea611b7 100644 --- a/security.md +++ b/security.md @@ -22,13 +22,13 @@ While Kubernetes today is not primarily a multi-tenant system, the long term evo We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories: -1. k8s admin - administers a kubernetes cluster and has access to the undelying components of the system +1. k8s admin - administers a kubernetes cluster and has access to the underlying components of the system 2. k8s project administrator - administrates the security of a small subset of the cluster 3. k8s developer - launches pods on a kubernetes cluster and consumes cluster resources Automated process users fall into the following categories: -1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources indepedent of the human users attached to a project +1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project 2. k8s infrastructure user - the user that kubernetes infrastructure components use to perform cluster functions with clearly defined roles diff --git a/service_accounts.md b/service_accounts.md index 72a10207..e87e8e6c 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -13,7 +13,7 @@ Processes in Pods may need to call the Kubernetes API. For example: They also may interact with services other than the Kubernetes API, such as: - an image repository, such as docker -- both when the images are pulled to start the containers, and for writing images in the case of pods that generate images. - - accessing other cloud services, such as blob storage, in the context of a larged, integrated, cloud offering (hosted + - accessing other cloud services, such as blob storage, in the context of a large, integrated, cloud offering (hosted or private). - accessing files in an NFS volume attached to the pod diff --git a/simple-rolling-update.md b/simple-rolling-update.md index fed1b84f..e5b47d98 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -22,7 +22,7 @@ The value of that label is the hash of the complete JSON representation of the`` If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replicaController in the ```kubernetes.io/``` annotation namespace: * ```desired-replicas``` The desired number of replicas for this controller (either N or zero) - * ```update-partner``` A pointer to the replicaiton controller resource that is the other half of this update (syntax `````` the namespace is assumed to be identical to the namespace of this replication controller.) + * ```update-partner``` A pointer to the replication controller resource that is the other half of this update (syntax `````` the namespace is assumed to be identical to the namespace of this replication controller.) Recovery is achieved by issuing the same command again: -- cgit v1.2.3 From 16355903a3e2954988791e55864edfdf2d82fd5d Mon Sep 17 00:00:00 2001 From: Marek Biskup Date: Wed, 17 Jun 2015 12:36:19 +0200 Subject: double dash replaced by html mdash --- networking.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/networking.md b/networking.md index cd2bd0c5..66234e6b 100644 --- a/networking.md +++ b/networking.md @@ -10,11 +10,11 @@ With the IP-per-pod model, all user containers within a pod behave as if they ar In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them. -The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host. +The approach does reduce isolation between containers within a pod — ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host. -When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. +When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from — each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. -This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP. +This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short — you can never self-register anything from a container, because a container can not be reached on its private IP. An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms. @@ -53,7 +53,7 @@ GCE itself does not know anything about these IPs, though. These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).) -We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container. +We create a container to use for the pod network namespace — a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container. Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode. @@ -89,7 +89,7 @@ We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), no ### External routability -We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers). +We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 — not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers). We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP. -- cgit v1.2.3 From 9e35c48d4abfa4b1bae2b4ed3a81047d6604985e Mon Sep 17 00:00:00 2001 From: RichieEscarez Date: Tue, 16 Jun 2015 14:48:51 -0700 Subject: Qualified all references to "controller" so that references to "replication controller" are clear. fixes #9404 Also ran hacks/run-gendocs.sh --- access.md | 2 +- service_accounts.md | 2 +- simple-rolling-update.md | 14 +++++++------- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/access.md b/access.md index 647ce552..dd64784e 100644 --- a/access.md +++ b/access.md @@ -193,7 +193,7 @@ K8s authorization should: - Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems. - Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service). - Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults. -- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Controllers, Services, and the identities and policies for those Pods and Controllers. +- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Replication Controllers, Services, and the identities and policies for those Pods and Replication Controllers. - Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies. K8s will implement a relatively simple diff --git a/service_accounts.md b/service_accounts.md index e87e8e6c..63c12a30 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -5,7 +5,7 @@ Processes in Pods may need to call the Kubernetes API. For example: - scheduler - replication controller - - minion controller + - node controller - a map-reduce type framework which has a controller that then tries to make a dynamically determined number of workers and watch them - continuous build and push system - monitoring system diff --git a/simple-rolling-update.md b/simple-rolling-update.md index e5b47d98..0208b609 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -8,20 +8,20 @@ Assume that we have a current replication controller named ```foo``` and it is r ```kubectl rolling-update rc foo [foo-v2] --image=myimage:v2``` -If the user doesn't specify a name for the 'next' controller, then the 'next' controller is renamed to -the name of the original controller. +If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to +the name of the original replication controller. Obviously there is a race here, where if you kill the client between delete foo, and creating the new version of 'foo' you might be surprised about what is there, but I think that's ok. See [Recovery](#recovery) below -If the user does specify a name for the 'next' controller, then the 'next' controller is retained with its existing name, -and the old 'foo' controller is deleted. For the purposes of the rollout, we add a unique-ifying label ```kubernetes.io/deployment``` to both the ```foo``` and ```foo-next``` controllers. -The value of that label is the hash of the complete JSON representation of the```foo-next``` or```foo``` controller. The name of this label can be overridden by the user with the ```--deployment-label-key``` flag. +If the user does specify a name for the 'next' replication controller, then the 'next' replication controller is retained with its existing name, +and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label ```kubernetes.io/deployment``` to both the ```foo``` and ```foo-next``` replication controllers. +The value of that label is the hash of the complete JSON representation of the```foo-next``` or```foo``` replication controller. The name of this label can be overridden by the user with the ```--deployment-label-key``` flag. #### Recovery If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. -To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replicaController in the ```kubernetes.io/``` annotation namespace: - * ```desired-replicas``` The desired number of replicas for this controller (either N or zero) +To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the ```kubernetes.io/``` annotation namespace: + * ```desired-replicas``` The desired number of replicas for this replication controller (either N or zero) * ```update-partner``` A pointer to the replication controller resource that is the other half of this update (syntax `````` the namespace is assumed to be identical to the namespace of this replication controller.) Recovery is achieved by issuing the same command again: -- cgit v1.2.3 From 0b9ca955f4ee89ed39f1d8215ec850ac7bf7bbd0 Mon Sep 17 00:00:00 2001 From: Salvatore Dario Minonne Date: Fri, 26 Jun 2015 09:44:28 +0200 Subject: Adding IANA_SVC_NAME definition to docs/design/identifiers.md --- identifiers.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/identifiers.md b/identifiers.md index b75577c2..23b976d3 100644 --- a/identifiers.md +++ b/identifiers.md @@ -20,6 +20,8 @@ Name [rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID) : A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination +[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) port name (IANA_SVC_NAME) +: An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, with the '-' character allowed anywhere except the first or the last character or adjacent to another '-' character, it must contain at least a (a-z) character ## Objectives for names and UIDs -- cgit v1.2.3 From 7c21cef64b9ea25e0a160c0e7384c3af4ccdd258 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Tue, 30 Jun 2015 00:51:16 -0700 Subject: Initial design doc for scheduler. --- scheduler.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) create mode 100644 scheduler.md diff --git a/scheduler.md b/scheduler.md new file mode 100644 index 00000000..e2a9f35d --- /dev/null +++ b/scheduler.md @@ -0,0 +1,50 @@ + +# The Kubernetes Scheduler + +The Kubernetes scheduler runs as a process alongside the other master +components such as the API server. Its interface to the API server is to watch +for Pods with an empty PodSpec.NodeName, and for each Pod, it posts a Binding +indicating where the Pod should be scheduled. + +## The scheduling process + +The scheduler tries to find a node for each Pod, one at a time, as it notices +these Pods via watch. There are three steps. First it applies a set of "predicates" that filter out +inappropriate nodes. For example, if the PodSpec specifies resource limits, then the scheduler +will filter out nodes that don't have at least that much resources available (computed +as the capacity of the node minus the sum of the resource limits of the containers that +are already running on the node). Second, it applies a set of "priority functions" +that rank the nodes that weren't filtered out by the predicate check. For example, +it tries to spread Pods across nodes while at the same time favoring the least-loaded +nodes (where "load" here is sum of the resource limits of the containers running on the node, +divided by the node's capacity). +Finally, the node with the highest priority is chosen +(or, if there are multiple such nodes, then one of them is chosen at random). The code +for this main scheduling loop is in the function `Schedule()` in +[plugin/pkg/scheduler/generic_scheduler.go](../../plugin/pkg/scheduler/generic_scheduler.go) + +## Scheduler extensibility + +The scheduler is extensible: the cluster administrator can choose which of the pre-defined +scheduling policies to apply, and can add new ones. The built-in predicates and priorities are +defined in [plugin/pkg/scheduler/algorithm/predicates/predicates.go](../../plugin/pkg/scheduler/algorithm/predicates/predicates.go) and +[plugin/pkg/scheduler/algorithm/priorities/priorities.go](../../plugin/pkg/scheduler/algorithm/priorities/priorities.go), respectively. +The policies that are applied when scheduling can be chosen in one of two ways. Normally, +the policies used are selected by the functions `defaultPredicates()` and `defaultPriorities()` in +[plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](../../plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go). +However, the choice of policies +can be overridden by passing the command-line flag `--policy-config-file` to the scheduler, pointing to a JSON +file specifying which scheduling policies to use. See +[examples/scheduler-policy-config.json](../../examples/scheduler-policy-config.json) for an example +config file. (Note that the config file format is versioned; the API is defined in +[plugin/pkg/scheduler/api/](../../plugin/pkg/scheduler/api/)). +Thus to add a new scheduling policy, you should modify predicates.go or priorities.go, +and either register the policy in `defaultPredicates()` or `defaultPriorities()`, or use a policy config file. + +## Exploring the code + +If you want to get a global picture of how the scheduler works, you can start in +[plugin/cmd/kube-scheduler/app/server.go](../../plugin/cmd/kube-scheduler/app/server.go) + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler.md?pixel)]() -- cgit v1.2.3 From 667782f84caa0247eb227350810b414a2d84c3e0 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Tue, 30 Jun 2015 13:27:31 -0700 Subject: Add user-oriented compute resource doc. Adds docs/compute_resources.md with user-oriented explanation of compute resources. Reveals detail gradually and includes examples and troubleshooting. Examples are tested. Moves design-focused docs/resources.md to docs/design/resources.md. Updates links to that. --- access.md | 2 +- resources.md | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+), 1 deletion(-) create mode 100644 resources.md diff --git a/access.md b/access.md index dd64784e..72ca969c 100644 --- a/access.md +++ b/access.md @@ -212,7 +212,7 @@ Policy objects may be applicable only to a single namespace or to all namespaces ## Accounting -The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources.md](/docs/resources.md)). +The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.md)). Initially: - a `quota` object is immutable. diff --git a/resources.md b/resources.md new file mode 100644 index 00000000..17bb5c18 --- /dev/null +++ b/resources.md @@ -0,0 +1,216 @@ +**Note: this is a design doc, which describes features that have not been completely implemented. +User documentation of the current state is [here](../resources.md). The tracking issue for +implementation of this model is +[#168](https://github.com/GoogleCloudPlatform/kubernetes/issues/168). Currently, only memory and +cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in +milli-cores.** + +# The Kubernetes resource model + +To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document. + +The resource model aims to be: +* simple, for common cases; +* extensible, to accommodate future growth; +* regular, with few special cases; and +* precise, to avoid misunderstandings and promote pod portability. + +## The resource model +A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth. + +Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_. + +Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later. + +### Resource types + +All resources have a _type_ that is identified by their _typename_ (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots). + +A fully-qualified resource typename is constructed from a DNS-style _subdomain_, followed by a slash `/`, followed by a name. +* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) (e.g., `kubernetes.io`, `example.com`). +* The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the `-`, `_`, and `.` characters allowed anywhere except the first or last character. +* As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes _namespace_, `kubernetes.io/` in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace. + +Some example typenames include `memory` (which will be fully-qualified as `kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. + +For future reference, note that some resources, such as CPU and network bandwidth, are _compressible_, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are _incompressible_, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources. + +### Resource quantities + +Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?). + +Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources. + +To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m"). + * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people. + * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1. + * These conventions only apply to resource quantities, not arbitrary values. + +Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not. + * Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations. + * Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs. + +### Resource specifications + +Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now. + +Resource requirements for a container or pod should have the following form: +``` +resourceRequirementSpec: [ + request: [ cpu: 2.5, memory: "40Mi" ], + limit: [ cpu: 4.0, memory: "99Mi" ], +] +``` +Where: +* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses. + +* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory. + +Total capacity for a node should have a similar structure: +``` +resourceCapacitySpec: [ + total: [ cpu: 12, memory: "128Gi" ] +] +``` +Where: +* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes. + +#### Notes + + * It is an error to specify the same resource type more than once in each list. + + * It is an error for the _request_ or _limit_ values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.) + + * If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a _request_, it's probably a good idea. (Note that the _request_ could be filled in by an automated system that is observing actual usage and/or historical data.) + + * Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet. + + + +## Kubernetes-defined resource types +The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet. + +### Processor cycles + * Name: `cpu` (or `kubernetes.io/cpu`) + * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU") + * Internal representation: milli-KCUs + * Compressible? yes + * Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](https://github.com/GoogleCloudPlatform/kubernetes/issues/147) + * [future] `schedulingLatency`: as per lmctfy + * [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0). + +To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code. + +Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource _qualities_ (a future feature). + + +### Memory + * Name: `memory` (or `kubernetes.io/memory`) + * Units: bytes + * Compressible? no (at least initially) + +The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying `memcg` mechanisms, support, and definitions. + +Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities +rather than decimal ones: "64MiB" rather than "64MB". + + +## Resource metadata +A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example: +``` +resourceTypes: [ + "kubernetes.io/memory": [ + isCompressible: false, ... + ] + "kubernetes.io/cpu": [ + isCompressible: true, internalScaleExponent: 3, ... + ] + "kubernetes.io/disk-space": [ ... } +] +``` + +Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value. + +The defined properties are as follows: + +| field name | type | contents | +| ---------- | ---- | -------- | +| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) | +| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) | +| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". | +| isCompressible | bool, default=false | true if the resource type is compressible | +| defaultRequest | string, default=none | in the same format as a user-supplied value | +| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). | + + +# Appendix: future extensions + +The following are planned future extensions to the resource model, included here to encourage comments. + +## Usage data + +Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. + +Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: + +``` +resourceStatus: [ + usage: [ cpu: , memory: ], + maxusage: [ cpu: , memory: ], + predicted: [ cpu: , memory: ], +] +``` + +where a `` or `` structure looks like this: +``` +{ + mean: # arithmetic mean + max: # minimum value + min: # maximum value + count: # number of data points + percentiles: [ # map from %iles to values + "10": <10th-percentile-value>, + "50": , + "99": <99th-percentile-value>, + "99.9": <99.9th-percentile-value>, + ... + ] + } +``` +All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_ +and predicted + +## Future resource types + +### _[future] Network bandwidth_ + * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`) + * Units: bytes per second + * Compressible? yes + +### _[future] Network operations_ + * Name: "network-iops" (or `kubernetes.io/network-iops`) + * Units: operations (messages) per second + * Compressible? yes + +### _[future] Storage space_ + * Name: "storage-space" (or `kubernetes.io/storage-space`) + * Units: bytes + * Compressible? no + +The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work. + +### _[future] Storage time_ + * Name: storage-time (or `kubernetes.io/storage-time`) + * Units: seconds per second of disk time + * Internal representation: milli-units + * Compressible? yes + +This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second. + +### _[future] Storage operations_ + * Name: "storage-iops" (or `kubernetes.io/storage-iops`) + * Units: operations per second + * Compressible? yes + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() -- cgit v1.2.3 From b2a3f3fbbed6e3fc57151e8016e9de02782f822b Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Mon, 6 Jul 2015 15:58:00 -0700 Subject: De-dup,overhaul networking docs --- networking.md | 247 +++++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 157 insertions(+), 90 deletions(-) diff --git a/networking.md b/networking.md index 66234e6b..8bf03437 100644 --- a/networking.md +++ b/networking.md @@ -1,107 +1,174 @@ # Networking -## Model and motivation - -Kubernetes deviates from the default Docker networking model. The goal is for each pod to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration. - -OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems. - -With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses. - -In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them. - -The approach does reduce isolation between containers within a pod — ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host. - -When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from — each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. - -This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short — you can never self-register anything from a container, because a container can not be reached on its private IP. - -An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms. - -## Current implementation - -For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network. - -Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15). - -We start Docker with: - DOCKER_OPTS="--bridge cbr0 --iptables=false" - -We set up this bridge on each node with SaltStack, in [container_bridge.py](cluster/saltbase/salt/_states/container_bridge.py). - - cbr0: - container_bridge.ensure: - - cidr: {{ grains['cbr-cidr'] }} - ... - grains: - roles: - - kubernetes-pool - cbr-cidr: $MINION_IP_RANGE - -We make these addresses routable in GCE: - - gcloud compute routes add "${MINION_NAMES[$i]}" \ - --project "${PROJECT}" \ - --destination-range "${MINION_IP_RANGES[$i]}" \ - --network "${NETWORK}" \ - --next-hop-instance "${MINION_NAMES[$i]}" \ - --next-hop-instance-zone "${ZONE}" & - -The minion IP ranges are /24s in the 10-dot space. +There are 4 distinct networking problems to solve: +1. Highly-coupled container-to-container communications +2. Pod-to-Pod communications +3. Pod-to-Service communications +4. External-to-internal communications -GCE itself does not know anything about these IPs, though. - -These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).) - -We create a container to use for the pod network namespace — a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container. - -Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode. - -1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name. - - creates a new network namespace (netns) and loopback device - - creates a new pair of veth devices and binds them to the netns - - auto-assigns an IP from docker’s IP range - -2. Create the user containers and specify the name of the pod infra container as their “POD” argument. Docker finds the PID of the command running in the pod infra container and attaches to the netns and ipcns of that PID. +## Model and motivation -### Other networking implementation examples -With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE. +Kubernetes deviates from the default Docker networking model (though as of +Docker 1.8 their network plugins are getting closer). The goal is for each pod +to have an IP in a flat shared networking namespace that has full communication +with other physical computers and containers across the network. IP-per-pod +creates a clean, backward-compatible model where pods can be treated much like +VMs or physical hosts from the perspectives of port allocation, networking, +naming, service discovery, load balancing, application configuration, and +migration. + +Dynamic port allocation, on the other hand, requires supporting both static +ports (e.g., for externally accessible services) and dynamically allocated +ports, requires partitioning centrally allocated and locally acquired dynamic +ports, complicates scheduling (since ports are a scarce resource), is +inconvenient for users, complicates application configuration, is plagued by +port conflicts and reuse and exhaustion, requires non-standard approaches to +naming (e.g. consul or etcd rather than DNS), requires proxies and/or +redirection for programs using standard naming/addressing mechanisms (e.g. web +browsers), requires watching and cache invalidation for address/port changes +for instances in addition to watching group membership changes, and obstructs +container/pod migration (e.g. using CRIU). NAT introduces additional complexity +by fragmenting the addressing space, which breaks self-registration mechanisms, +among other problems. + +## Container to container + +All containers within a pod behave as if they are on the same host with regard +to networking. They can all reach each other’s ports on localhost. This offers +simplicity (static ports know a priori), security (ports bound to localhost +are visible within the pod but never outside it), and performance. This also +reduces friction for applications moving from the world of uncontainerized apps +on physical or virtual hosts. People running application stacks together on +the same host have already figured out how to make ports not conflict and have +arranged for clients to find them. + +The approach does reduce isolation between containers within a pod — +ports could conflict, and there can be no container-private ports, but these +seem to be relatively minor issues with plausible future workarounds. Besides, +the premise of pods is that containers within a pod share some resources +(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. +Additionally, the user can control what containers belong to the same pod +whereas, in general, they don't control what pods land together on a host. + +## Pod to pod + +Because every pod gets a "real" (not machine-private) IP address, pods can +communicate without proxies or translations. The can use well-known port +numbers and can avoid the use of higher-level service discovery systems like +DNS-SD, Consul, or Etcd. + +When any container calls ioctl(SIOCGIFADDR) (get the address of an interface), +it sees the same IP that any peer container would see them coming from — +each pod has its own IP address that other pods can know. By making IP addresses +and ports the same both inside and outside the pods, we create a NAT-less, flat +address space. Running "ip addr show" should work as expected. This would enable +all existing naming/discovery mechanisms to work out of the box, including +self-registration mechanisms and applications that distribute IP addresses. We +should be optimizing for inter-pod network communication. Within a pod, +containers are more likely to use communication through volumes (e.g., tmpfs) or +IPC. + +This is different from the standard Docker model. In that mode, each container +gets an IP in the 172-dot space and would only see that 172-dot address from +SIOCGIFADDR. If these containers connect to another container the peer would see +the connect coming from a different IP than the container itself knows. In short +— you can never self-register anything from a container, because a +container can not be reached on its private IP. + +An alternative we considered was an additional layer of addressing: pod-centric +IP per container. Each container would have its own local IP address, visible +only within that pod. This would perhaps make it easier for containerized +applications to move from physical/virtual hosts to pods, but would be more +complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) +and to reason about, due to the additional layer of address translation, and +would break self-registration and IP distribution mechanisms. + +Like Docker, ports can still be published to the host node's interface(s), but +the need for this is radically diminished. + +## Implementation + +For the Google Compute Engine cluster configuration scripts, we use [advanced +routing rules](https://developers.google.com/compute/docs/networking#routing) +and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that +get routed to it. This is in addition to the 'main' IP address assigned to the +VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to +differentiate it from `docker0`) is set up outside of Docker proper. + +Example of GCE's advanced routing rules: + +``` +gcloud compute routes add "${MINION_NAMES[$i]}" \ + --project "${PROJECT}" \ + --destination-range "${MINION_IP_RANGES[$i]}" \ + --network "${NETWORK}" \ + --next-hop-instance "${MINION_NAMES[$i]}" \ + --next-hop-instance-zone "${ZONE}" & +``` + +GCE itself does not know anything about these IPs, though. This means that when +a pod tries to egress beyond GCE's project the packets must be SNAT'ed +(masqueraded) to the VM's IP, which GCE recognizes and allows. + +### Other implementations + +With the primary aim of providing IP-per-pod-model, other implementations exist +to serve the purpose outside of GCE. - [OpenVSwitch with GRE/VxLAN](../ovs-networking.md) - [Flannel](https://github.com/coreos/flannel#flannel) + - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) + ("With Linux Bridge devices" section) + - [Weave](https://github.com/zettio/weave) is yet another way to build an + overlay network, primarily aiming at Docker integration. + - [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real + container IPs. + +## Pod to service + +The [service](../services.md) abstraction provides a way to group pods under a +common access policy (e.g. load-balanced). The implementation of this creates a +virtual IP which clients can access and which is transparantly proxied to the +pods in a Service. Each node runs a kube-proxy process which programs +`iptables` rules to trap access to service IPs and redirect them to the correct +backends. This provides a highly-available load-balancing solution with low +performance overhead by balancing client traffic from a node on that same node. + +## External to internal + +So far the discussion has been about how to access a pod or service from within +the cluster. Accessing a pod from outside the cluster is a bit more tricky. We +want to offer highly-available, high-performance load balancing to target +Kubernetes Services. Most public cloud providers are simply not flexible enough +yet. + +The way this is generally implemented is to set up external load balancers (e.g. +GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When +traffic arrives at a node it is recognized as being part of a particular Service +and routed to an appropriate backend Pod. This does mean that some traffic will +get double-bounced on the network. Once cloud providers have better offerings +we can take advantage of those. ## Challenges and future work ### Docker API -Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow. +Right now, docker inspect doesn't show the networking configuration of the +containers, since they derive it from another container. That information should +be exposed somehow. ### External IP assignment -We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across pod infra container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the pod infra container dies, all the user containers must be stopped and restarted because the netns of the pod infra container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). - -### Naming, discovery, and load balancing - -In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically. - -[Service](http://docs.k8s.io/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service IP](http://docs.k8s.io/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service's IP in DNS, and for that to become the preferred resolution protocol. - -We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier. - -### External routability - -We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 — not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers). - -We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP. - -So we end up with 3 cases: - -1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT. - -2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1. - -3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port. - -Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address. +We want to be able to assign IP addresses externally from Docker +[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need +to statically allocate fixed-size IP ranges to each node, so that IP addresses +can be made stable across pod infra container restarts +([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate +pod migration. Right now, if the pod infra container dies, all the user +containers must be stopped and restarted because the netns of the pod infra +container will change on restart, and any subsequent user container restart +will join that new netns, thereby not being able to see its peers. +Additionally, a change in IP address would encounter DNS caching/TTL problems. +External IP assignment would also simplify DNS support (see below). ### IPv6 -- cgit v1.2.3 From b4354021c3968c7fb46996e48e540750a246fdfb Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Tue, 7 Jul 2015 13:06:19 -0700 Subject: Move scheduler overview from docs/design/ to docs/devel/ --- scheduler.md | 50 -------------------------------------------------- 1 file changed, 50 deletions(-) delete mode 100644 scheduler.md diff --git a/scheduler.md b/scheduler.md deleted file mode 100644 index e2a9f35d..00000000 --- a/scheduler.md +++ /dev/null @@ -1,50 +0,0 @@ - -# The Kubernetes Scheduler - -The Kubernetes scheduler runs as a process alongside the other master -components such as the API server. Its interface to the API server is to watch -for Pods with an empty PodSpec.NodeName, and for each Pod, it posts a Binding -indicating where the Pod should be scheduled. - -## The scheduling process - -The scheduler tries to find a node for each Pod, one at a time, as it notices -these Pods via watch. There are three steps. First it applies a set of "predicates" that filter out -inappropriate nodes. For example, if the PodSpec specifies resource limits, then the scheduler -will filter out nodes that don't have at least that much resources available (computed -as the capacity of the node minus the sum of the resource limits of the containers that -are already running on the node). Second, it applies a set of "priority functions" -that rank the nodes that weren't filtered out by the predicate check. For example, -it tries to spread Pods across nodes while at the same time favoring the least-loaded -nodes (where "load" here is sum of the resource limits of the containers running on the node, -divided by the node's capacity). -Finally, the node with the highest priority is chosen -(or, if there are multiple such nodes, then one of them is chosen at random). The code -for this main scheduling loop is in the function `Schedule()` in -[plugin/pkg/scheduler/generic_scheduler.go](../../plugin/pkg/scheduler/generic_scheduler.go) - -## Scheduler extensibility - -The scheduler is extensible: the cluster administrator can choose which of the pre-defined -scheduling policies to apply, and can add new ones. The built-in predicates and priorities are -defined in [plugin/pkg/scheduler/algorithm/predicates/predicates.go](../../plugin/pkg/scheduler/algorithm/predicates/predicates.go) and -[plugin/pkg/scheduler/algorithm/priorities/priorities.go](../../plugin/pkg/scheduler/algorithm/priorities/priorities.go), respectively. -The policies that are applied when scheduling can be chosen in one of two ways. Normally, -the policies used are selected by the functions `defaultPredicates()` and `defaultPriorities()` in -[plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](../../plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go). -However, the choice of policies -can be overridden by passing the command-line flag `--policy-config-file` to the scheduler, pointing to a JSON -file specifying which scheduling policies to use. See -[examples/scheduler-policy-config.json](../../examples/scheduler-policy-config.json) for an example -config file. (Note that the config file format is versioned; the API is defined in -[plugin/pkg/scheduler/api/](../../plugin/pkg/scheduler/api/)). -Thus to add a new scheduling policy, you should modify predicates.go or priorities.go, -and either register the policy in `defaultPredicates()` or `defaultPriorities()`, or use a policy config file. - -## Exploring the code - -If you want to get a global picture of how the scheduler works, you can start in -[plugin/cmd/kube-scheduler/app/server.go](../../plugin/cmd/kube-scheduler/app/server.go) - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler.md?pixel)]() -- cgit v1.2.3 From 1b3281d5d27c495145931aaebdd034c79b55717f Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 2 Jul 2015 09:42:49 -0700 Subject: Make docs links be relative so we can version them --- secrets.md | 6 +++--- security.md | 20 ++++++++++---------- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/secrets.md b/secrets.md index 423ce529..d91a950a 100644 --- a/secrets.md +++ b/secrets.md @@ -71,7 +71,7 @@ service would also consume the secrets associated with the MySQL service. ### Use-Case: Secrets associated with service accounts -[Service Accounts](http://docs.k8s.io/design/service_accounts.md) are proposed as a +[Service Accounts](./service_accounts.md) are proposed as a mechanism to decouple capabilities and security contexts from individual human users. A `ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and @@ -241,7 +241,7 @@ memory overcommit on the node. #### Secret data on the node: isolation -Every pod will have a [security context](http://docs.k8s.io/design/security_context.md). +Every pod will have a [security context](./security_context.md). Secret data on the node should be isolated according to the security context of the container. The Kubelet volume plugin API will be changed so that a volume plugin receives the security context of a volume along with the volume spec. This will allow volume plugins to implement setting the @@ -253,7 +253,7 @@ Several proposals / upstream patches are notable as background for this proposal 1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) 2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](http://docs.k8s.io/design/service_accounts.md) +3. [Kubernetes service account proposal](./service_accounts.md) 4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) 5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) diff --git a/security.md b/security.md index 6ea611b7..733f6818 100644 --- a/security.md +++ b/security.md @@ -63,14 +63,14 @@ Automated process users fall into the following categories: A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*. -1. The API should authenticate and authorize user actions [authn and authz](http://docs.k8s.io/design/access.md) +1. The API should authenticate and authorize user actions [authn and authz](./access.md) 2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. 3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) -4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](http://docs.k8s.io/design/service_accounts.md) +4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](./service_accounts.md) 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action -5. When container processes run on the cluster, they should run in a [security context](http://docs.k8s.io/design/security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. +5. When container processes run on the cluster, they should run in a [security context](./security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions @@ -79,7 +79,7 @@ A pod runs in a *security context* under a *service account* that is defined by 6. Developers may need to ensure their images work within higher security requirements specified by administrators 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes -6. Developers should be able to define [secrets](http://docs.k8s.io/design/secrets.md) that are automatically added to the containers when pods are run +6. Developers should be able to define [secrets](./secrets.md) that are automatically added to the containers when pods are run 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: 1. An SSH private key for git cloning remote data 2. A client certificate for accessing a remote system @@ -93,12 +93,12 @@ A pod runs in a *security context* under a *service account* that is defined by ### Related design discussion -* Authorization and authentication http://docs.k8s.io/design/access.md -* Secret distribution via files https://github.com/GoogleCloudPlatform/kubernetes/pull/2030 -* Docker secrets https://github.com/docker/docker/pull/6697 -* Docker vault https://github.com/docker/docker/issues/10310 -* Service Accounts: http://docs.k8s.io/design/service_accounts.md -* Secret volumes https://github.com/GoogleCloudPlatform/kubernetes/4126 +* [Authorization and authentication](./access.md) +* [Secret distribution via files](https://github.com/GoogleCloudPlatform/kubernetes/pull/2030) +* [Docker secrets](https://github.com/docker/docker/pull/6697) +* [Docker vault](https://github.com/docker/docker/issues/10310) +* [Service Accounts:](./service_accounts.md) +* [Secret volumes](https://github.com/GoogleCloudPlatform/kubernetes/pull/4126) ## Specific Design Points -- cgit v1.2.3 From 47df5ae18f4f16f0909a1299bb1d4a599dc63879 Mon Sep 17 00:00:00 2001 From: Janet Kuo Date: Wed, 8 Jul 2015 13:19:38 -0700 Subject: Update kubectl output in doc --- persistent-storage.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/persistent-storage.md b/persistent-storage.md index 3729f30e..8e7c6765 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -112,7 +112,7 @@ spec: kubectl get pv -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON pv0001 map[] 10737418240 RWO Pending @@ -157,7 +157,7 @@ myclaim-1 map[] pending kubectl get pv -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e -- cgit v1.2.3 From 4cc2b50a4497051324ceacf0d1fd7acb92d274d4 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Wed, 8 Jul 2015 16:26:20 -0400 Subject: Change remaining instances of hostDir in docs to hostPath --- security.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security.md b/security.md index 6ea611b7..1f43772e 100644 --- a/security.md +++ b/security.md @@ -54,7 +54,7 @@ Automated process users fall into the following categories: * are less focused on application security. Focused on operating system security. * protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers. * comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. - * decides who can use which Linux Capabilities, run privileged containers, use hostDir, etc. + * decides who can use which Linux Capabilities, run privileged containers, use hostPath, etc. * e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not. -- cgit v1.2.3 From 7c1abe54bef9502d91d4b929497cc2c6d1a85c08 Mon Sep 17 00:00:00 2001 From: jiangyaoguo Date: Wed, 8 Jul 2015 01:37:40 +0800 Subject: change get minions cmd in docs --- clustering.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clustering.md b/clustering.md index 4cef06f8..442cb4b6 100644 --- a/clustering.md +++ b/clustering.md @@ -41,7 +41,7 @@ The building blocks of an easier solution: * **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN. * [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate. * **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors. -* **Scoped Kubelet Accounts** These accounts are per-minion and (optionally) give a minion permission to register itself. +* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give a node permission to register itself. * To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic. * [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master. -- cgit v1.2.3 From 66f367dcbb7a54e35c176b9737419e729b9eabea Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 9 Jul 2015 18:02:10 -0700 Subject: Auto-fixed docs --- access.md | 4 ++-- resources.md | 2 +- secrets.md | 6 +++--- security.md | 12 ++++++------ security_context.md | 2 +- service_accounts.md | 4 ++-- 6 files changed, 15 insertions(+), 15 deletions(-) diff --git a/access.md b/access.md index 72ca969c..85b9c8ec 100644 --- a/access.md +++ b/access.md @@ -141,7 +141,7 @@ Improvements: ###Namespaces K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. -Namespaces are described in [namespace.md](namespaces.md). +Namespaces are described in [namespaces.md](namespaces.md). In the Enterprise Profile: - a `userAccount` may have permission to access several `namespace`s. @@ -151,7 +151,7 @@ In the Simple Profile: Namespaces versus userAccount vs Labels: - `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. -- `labels` (see [docs/labels.md](/docs/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. +- `labels` (see [docs/labels.md](../../docs/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. - `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people. diff --git a/resources.md b/resources.md index 17bb5c18..8c29a1f6 100644 --- a/resources.md +++ b/resources.md @@ -149,7 +149,7 @@ The following are planned future extensions to the resource model, included here ## Usage data -Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. +Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: diff --git a/secrets.md b/secrets.md index d91a950a..979c07f0 100644 --- a/secrets.md +++ b/secrets.md @@ -71,7 +71,7 @@ service would also consume the secrets associated with the MySQL service. ### Use-Case: Secrets associated with service accounts -[Service Accounts](./service_accounts.md) are proposed as a +[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple capabilities and security contexts from individual human users. A `ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and @@ -241,7 +241,7 @@ memory overcommit on the node. #### Secret data on the node: isolation -Every pod will have a [security context](./security_context.md). +Every pod will have a [security context](security_context.md). Secret data on the node should be isolated according to the security context of the container. The Kubelet volume plugin API will be changed so that a volume plugin receives the security context of a volume along with the volume spec. This will allow volume plugins to implement setting the @@ -253,7 +253,7 @@ Several proposals / upstream patches are notable as background for this proposal 1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) 2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](./service_accounts.md) +3. [Kubernetes service account proposal](service_accounts.md) 4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) 5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) diff --git a/security.md b/security.md index c8f9bec7..4ea7d755 100644 --- a/security.md +++ b/security.md @@ -63,14 +63,14 @@ Automated process users fall into the following categories: A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*. -1. The API should authenticate and authorize user actions [authn and authz](./access.md) +1. The API should authenticate and authorize user actions [authn and authz](access.md) 2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. 3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) -4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](./service_accounts.md) +4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](service_accounts.md) 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action -5. When container processes run on the cluster, they should run in a [security context](./security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. +5. When container processes run on the cluster, they should run in a [security context](security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions @@ -79,7 +79,7 @@ A pod runs in a *security context* under a *service account* that is defined by 6. Developers may need to ensure their images work within higher security requirements specified by administrators 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes -6. Developers should be able to define [secrets](./secrets.md) that are automatically added to the containers when pods are run +6. Developers should be able to define [secrets](secrets.md) that are automatically added to the containers when pods are run 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: 1. An SSH private key for git cloning remote data 2. A client certificate for accessing a remote system @@ -93,11 +93,11 @@ A pod runs in a *security context* under a *service account* that is defined by ### Related design discussion -* [Authorization and authentication](./access.md) +* [Authorization and authentication](access.md) * [Secret distribution via files](https://github.com/GoogleCloudPlatform/kubernetes/pull/2030) * [Docker secrets](https://github.com/docker/docker/pull/6697) * [Docker vault](https://github.com/docker/docker/issues/10310) -* [Service Accounts:](./service_accounts.md) +* [Service Accounts:](service_accounts.md) * [Secret volumes](https://github.com/GoogleCloudPlatform/kubernetes/pull/4126) ## Specific Design Points diff --git a/security_context.md b/security_context.md index fdacb173..61641297 100644 --- a/security_context.md +++ b/security_context.md @@ -32,7 +32,7 @@ Processes in pods will need to have consistent UID/GID/SELinux category labels i * The concept of a security context should not be tied to a particular security mechanism or platform (ie. SELinux, AppArmor) * Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for - [service accounts](./service_accounts.md). + [service accounts](service_accounts.md). ## Use Cases diff --git a/service_accounts.md b/service_accounts.md index 63c12a30..bd10336f 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -21,9 +21,9 @@ They also may interact with services other than the Kubernetes API, such as: A service account binds together several things: - a *name*, understood by users, and perhaps by peripheral systems, for an identity - a *principal* that can be authenticated and [authorized](../authorization.md) - - a [security context](./security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other + - a [security context](security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other capabilities and controls on interaction with the file system and OS. - - a set of [secrets](./secrets.md), which a container may use to + - a set of [secrets](secrets.md), which a container may use to access various networked resources. ## Design Discussion -- cgit v1.2.3 From c36dd173e4ae2fee9d20fa198d118244f681f6b3 Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 9 Jul 2015 18:31:29 -0700 Subject: manual fixes --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 8c29a1f6..bb3c05e9 100644 --- a/resources.md +++ b/resources.md @@ -1,5 +1,5 @@ **Note: this is a design doc, which describes features that have not been completely implemented. -User documentation of the current state is [here](../resources.md). The tracking issue for +User documentation of the current state is [here](../compute_resources.md). The tracking issue for implementation of this model is [#168](https://github.com/GoogleCloudPlatform/kubernetes/issues/168). Currently, only memory and cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in -- cgit v1.2.3 From ee97e734b55c5f605944bcc4e324bdc09bd4c476 Mon Sep 17 00:00:00 2001 From: Akshay Aurora Date: Sun, 12 Jul 2015 03:49:01 +0530 Subject: Fix formatting in networking.md --- networking.md | 1 + 1 file changed, 1 insertion(+) diff --git a/networking.md b/networking.md index 8bf03437..af64ed8d 100644 --- a/networking.md +++ b/networking.md @@ -1,6 +1,7 @@ # Networking There are 4 distinct networking problems to solve: + 1. Highly-coupled container-to-container communications 2. Pod-to-Pod communications 3. Pod-to-Service communications -- cgit v1.2.3 From b4deb49a719e9d5c7ece5c930dec4ff225409466 Mon Sep 17 00:00:00 2001 From: Ed Costello Date: Sun, 12 Jul 2015 22:03:06 -0400 Subject: Copy edits for typos --- networking.md | 2 +- security.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/networking.md b/networking.md index af64ed8d..210d10e5 100644 --- a/networking.md +++ b/networking.md @@ -128,7 +128,7 @@ to serve the purpose outside of GCE. The [service](../services.md) abstraction provides a way to group pods under a common access policy (e.g. load-balanced). The implementation of this creates a -virtual IP which clients can access and which is transparantly proxied to the +virtual IP which clients can access and which is transparently proxied to the pods in a Service. Each node runs a kube-proxy process which programs `iptables` rules to trap access to service IPs and redirect them to the correct backends. This provides a highly-available load-balancing solution with low diff --git a/security.md b/security.md index 4ea7d755..c2fd092e 100644 --- a/security.md +++ b/security.md @@ -78,7 +78,7 @@ A pod runs in a *security context* under a *service account* that is defined by 5. Developers should be able to run their own images or images from the community and expect those images to run correctly 6. Developers may need to ensure their images work within higher security requirements specified by administrators 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. - 8. When application developers want to share filesytem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes + 8. When application developers want to share filesystem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes 6. Developers should be able to define [secrets](secrets.md) that are automatically added to the containers when pods are run 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: 1. An SSH private key for git cloning remote data -- cgit v1.2.3 From a43748bb04d34c4916fe138a2e8a72e9d2f5914f Mon Sep 17 00:00:00 2001 From: Marek Biskup Date: Mon, 13 Jul 2015 15:09:26 +0200 Subject: kubectl-rolling-update-doc-fix --- simple-rolling-update.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 0208b609..fb21c096 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -6,7 +6,7 @@ Complete execution flow can be found [here](#execution-details). ### Lightweight rollout Assume that we have a current replication controller named ```foo``` and it is running image ```image:v1``` -```kubectl rolling-update rc foo [foo-v2] --image=myimage:v2``` +```kubectl rolling-update foo [foo-v2] --image=myimage:v2``` If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to the name of the original replication controller. @@ -27,7 +27,7 @@ To facilitate recovery in the case of a crash of the updating process itself, we Recovery is achieved by issuing the same command again: ``` -kubectl rolling-update rc foo [foo-v2] --image=myimage:v2 +kubectl rolling-update foo [foo-v2] --image=myimage:v2 ``` Whenever the rolling update command executes, the kubectl client looks for replication controllers called ```foo``` and ```foo-next```, if they exist, an attempt is @@ -38,11 +38,11 @@ it is assumed that the rollout is nearly completed, and ```foo-next``` is rename ### Aborting a rollout Abort is assumed to want to reverse a rollout in progress. -```kubectl rolling-update rc foo [foo-v2] --rollback``` +```kubectl rolling-update foo [foo-v2] --rollback``` This is really just semantic sugar for: -```kubectl rolling-update rc foo-v2 foo``` +```kubectl rolling-update foo-v2 foo``` With the added detail that it moves the ```desired-replicas``` annotation from ```foo-v2``` to ```foo``` -- cgit v1.2.3 From 0fc797704e628d863d0154a599e191dadfb3ce67 Mon Sep 17 00:00:00 2001 From: Ed Costello Date: Mon, 13 Jul 2015 10:11:07 -0400 Subject: Copy edits to remove doubled words --- service_accounts.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/service_accounts.md b/service_accounts.md index bd10336f..c6ceb6b2 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -90,7 +90,7 @@ The distinction is useful for a number of reasons: Pod Object. The `secrets` field is a list of references to /secret objects that an process started as that service account should -have access to to be able to assert that role. +have access to be able to assert that role. The secrets are not inline with the serviceAccount object. This way, most or all users can have permission to `GET /serviceAccounts` so they can remind themselves what serviceAccounts are available for use. @@ -150,7 +150,7 @@ then it copies in the referenced securityContext and secrets references for the Second, if ServiceAccount definitions change, it may take some actions. **TODO**: decide what actions it takes when a serviceAccount definition changes. Does it stop pods, or just -allow someone to list ones that out out of spec? In general, people may want to customize this? +allow someone to list ones that are out of spec? In general, people may want to customize this? Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include a new username (e.g. `NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), a new -- cgit v1.2.3 From d4ee5006858aec1fa1fecff18bfda3dfeeb162ff Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Sat, 11 Jul 2015 21:04:52 -0700 Subject: Run gendocs and munges --- README.md | 14 ++++++++++++++ access.md | 14 ++++++++++++++ admission_control.md | 14 ++++++++++++++ admission_control_limit_range.md | 14 ++++++++++++++ admission_control_resource_quota.md | 14 ++++++++++++++ architecture.md | 14 ++++++++++++++ clustering.md | 14 ++++++++++++++ clustering/README.md | 14 ++++++++++++++ command_execution_port_forwarding.md | 14 ++++++++++++++ event_compression.md | 14 ++++++++++++++ expansion.md | 14 ++++++++++++++ identifiers.md | 14 ++++++++++++++ namespaces.md | 14 ++++++++++++++ networking.md | 14 ++++++++++++++ persistent-storage.md | 14 ++++++++++++++ principles.md | 14 ++++++++++++++ resources.md | 14 ++++++++++++++ secrets.md | 14 ++++++++++++++ security.md | 14 ++++++++++++++ security_context.md | 14 ++++++++++++++ service_accounts.md | 14 ++++++++++++++ simple-rolling-update.md | 14 ++++++++++++++ 22 files changed, 308 insertions(+) diff --git a/README.md b/README.md index b70c5615..66265b99 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Kubernetes Design Overview Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. diff --git a/access.md b/access.md index 85b9c8ec..98bf2bdf 100644 --- a/access.md +++ b/access.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # K8s Identity and Access Management Sketch This document suggests a direction for identity and access management in the Kubernetes system. diff --git a/admission_control.md b/admission_control.md index 749e949e..4094156b 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Kubernetes Proposal - Admission Control **Related PR:** diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index daddb425..c1914478 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Admission control plugin: LimitRanger ## Background diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index b2dfbe85..cd9282df 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Admission control plugin: ResourceQuota ## Background diff --git a/architecture.md b/architecture.md index ebfb4964..6c82896e 100644 --- a/architecture.md +++ b/architecture.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Kubernetes architecture A running Kubernetes cluster contains node agents (kubelet) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making kubelet itself (all our components, really) run within containers, and making the scheduler 100% pluggable. diff --git a/clustering.md b/clustering.md index 4cef06f8..f88157aa 100644 --- a/clustering.md +++ b/clustering.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Clustering in Kubernetes diff --git a/clustering/README.md b/clustering/README.md index 09d2c4e1..dfd55e96 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + This directory contains diagrams for the clustering design doc. This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 3e548d40..056814e7 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Container Command Execution & Port Forwarding in Kubernetes ## Abstract diff --git a/event_compression.md b/event_compression.md index 74aba66f..4178393c 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Kubernetes Event Compression This document captures the design of event compression. diff --git a/expansion.md b/expansion.md index 8b31526a..01a774cb 100644 --- a/expansion.md +++ b/expansion.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Variable expansion in pod command, args, and env ## Abstract diff --git a/identifiers.md b/identifiers.md index 23b976d3..e192b1ed 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Identifiers and Names in Kubernetes A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199). diff --git a/namespaces.md b/namespaces.md index 0fef2bed..547d040b 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Namespaces ## Abstract diff --git a/networking.md b/networking.md index 210d10e5..5a4a5835 100644 --- a/networking.md +++ b/networking.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Networking There are 4 distinct networking problems to solve: diff --git a/persistent-storage.md b/persistent-storage.md index 8e7c6765..9cc92b42 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Persistent Storage This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. diff --git a/principles.md b/principles.md index cf8833a4..e1bd97da 100644 --- a/principles.md +++ b/principles.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Design Principles Principles to follow when extending Kubernetes. diff --git a/resources.md b/resources.md index bb3c05e9..9539bed2 100644 --- a/resources.md +++ b/resources.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + **Note: this is a design doc, which describes features that have not been completely implemented. User documentation of the current state is [here](../compute_resources.md). The tracking issue for implementation of this model is diff --git a/secrets.md b/secrets.md index 979c07f0..a6d2591f 100644 --- a/secrets.md +++ b/secrets.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + ## Abstract diff --git a/security.md b/security.md index c2fd092e..1d1373d2 100644 --- a/security.md +++ b/security.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Security in Kubernetes Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster. diff --git a/security_context.md b/security_context.md index 61641297..cbf525a8 100644 --- a/security_context.md +++ b/security_context.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + # Security Contexts ## Abstract A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)): diff --git a/service_accounts.md b/service_accounts.md index c6ceb6b2..896bd68e 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + #Service Accounts ## Motivation diff --git a/simple-rolling-update.md b/simple-rolling-update.md index fb21c096..45005353 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,3 +1,17 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + ## Simple rolling update This is a lightweight design document for simple rolling update in ```kubectl``` -- cgit v1.2.3 From 8601b6ff40148c7be7a02a4a70ccfd1d9e231c33 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Mon, 13 Jul 2015 11:11:34 -0700 Subject: Remove colon from end of doc heading. --- secrets.md | 2 +- security.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/secrets.md b/secrets.md index a6d2591f..c1643a9d 100644 --- a/secrets.md +++ b/secrets.md @@ -261,7 +261,7 @@ Kubelet volume plugin API will be changed so that a volume plugin receives the s a volume along with the volume spec. This will allow volume plugins to implement setting the security context of volumes they manage. -## Community work: +## Community work Several proposals / upstream patches are notable as background for this proposal: diff --git a/security.md b/security.md index 1d1373d2..90dc3237 100644 --- a/security.md +++ b/security.md @@ -32,7 +32,7 @@ While Kubernetes today is not primarily a multi-tenant system, the long term evo ## Use cases -### Roles: +### Roles We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories: @@ -46,7 +46,7 @@ Automated process users fall into the following categories: 2. k8s infrastructure user - the user that kubernetes infrastructure components use to perform cluster functions with clearly defined roles -### Description of roles: +### Description of roles * Developers: * write pod specs. -- cgit v1.2.3 From d3293eb75835fbdb3f50dde82e513eff752ca82d Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Mon, 13 Jul 2015 17:13:09 -0700 Subject: Apply mungedocs changes --- README.md | 2 ++ access.md | 2 ++ admission_control.md | 2 ++ admission_control_limit_range.md | 2 ++ admission_control_resource_quota.md | 2 ++ architecture.md | 2 ++ clustering.md | 2 ++ clustering/README.md | 3 +++ command_execution_port_forwarding.md | 3 +++ event_compression.md | 2 ++ expansion.md | 3 +++ identifiers.md | 2 ++ namespaces.md | 3 +++ networking.md | 2 ++ persistent-storage.md | 2 ++ principles.md | 2 ++ resources.md | 2 ++ secrets.md | 2 ++ security.md | 2 ++ security_context.md | 3 ++- service_accounts.md | 3 ++- simple-rolling-update.md | 2 ++ 22 files changed, 48 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 66265b99..5a5b0497 100644 --- a/README.md +++ b/README.md @@ -31,4 +31,6 @@ Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS p For more about the Kubernetes architecture, see [architecture](architecture.md). + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() + diff --git a/access.md b/access.md index 98bf2bdf..912f93aa 100644 --- a/access.md +++ b/access.md @@ -262,4 +262,6 @@ Improvements: - Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() + diff --git a/admission_control.md b/admission_control.md index 4094156b..5870a601 100644 --- a/admission_control.md +++ b/admission_control.md @@ -93,4 +93,6 @@ will ensure the following: If at any step, there is an error, the request is canceled. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() + diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index c1914478..e5363cea 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -146,4 +146,6 @@ It is expected we will want to define limits for particular pods or containers b To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() + diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index cd9282df..754e5a00 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -167,4 +167,6 @@ services 3 5 ``` + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() + diff --git a/architecture.md b/architecture.md index 6c82896e..71d606a1 100644 --- a/architecture.md +++ b/architecture.md @@ -58,4 +58,6 @@ All other cluster-level functions are currently performed by the Controller Mana The [`replicationcontroller`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() + diff --git a/clustering.md b/clustering.md index 95ff3ccc..3e9972ce 100644 --- a/clustering.md +++ b/clustering.md @@ -74,4 +74,6 @@ This flow has the admin manually approving the kubelet signing requests. This i ![Dynamic Sequence Diagram](clustering/dynamic.png) + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() + diff --git a/clustering/README.md b/clustering/README.md index dfd55e96..07dcc7b3 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -39,4 +39,7 @@ If you are using boot2docker and get warnings about clock skew (or if things are If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`. + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() + diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 056814e7..7d110c3f 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -157,4 +157,7 @@ access. Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts. + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() + diff --git a/event_compression.md b/event_compression.md index 4178393c..40dc9e52 100644 --- a/event_compression.md +++ b/event_compression.md @@ -92,4 +92,6 @@ This demonstrates what would have been 20 separate entries (indicating schedulin * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() + diff --git a/expansion.md b/expansion.md index 01a774cb..4f4511ce 100644 --- a/expansion.md +++ b/expansion.md @@ -399,4 +399,7 @@ spec: restartPolicy: Never ``` + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() + diff --git a/identifiers.md b/identifiers.md index e192b1ed..49068cc8 100644 --- a/identifiers.md +++ b/identifiers.md @@ -106,4 +106,6 @@ objectives. 1. This may correspond to Docker's container ID. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() + diff --git a/namespaces.md b/namespaces.md index 547d040b..cd8b5280 100644 --- a/namespaces.md +++ b/namespaces.md @@ -348,4 +348,7 @@ to remove that Namespace from the storage. At this point, all content associated with that Namespace, and the Namespace itself are gone. + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() + diff --git a/networking.md b/networking.md index 5a4a5835..35248a71 100644 --- a/networking.md +++ b/networking.md @@ -190,4 +190,6 @@ External IP assignment would also simplify DNS support (see below). IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() + diff --git a/persistent-storage.md b/persistent-storage.md index 9cc92b42..585cd281 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -228,4 +228,6 @@ The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() + diff --git a/principles.md b/principles.md index e1bd97da..5071e89d 100644 --- a/principles.md +++ b/principles.md @@ -69,4 +69,6 @@ TODO * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() + diff --git a/resources.md b/resources.md index 9539bed2..229e9b76 100644 --- a/resources.md +++ b/resources.md @@ -227,4 +227,6 @@ This is the amount of time a container spends accessing disk, including actuator * Compressible? yes + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() + diff --git a/secrets.md b/secrets.md index c1643a9d..2fdee537 100644 --- a/secrets.md +++ b/secrets.md @@ -590,4 +590,6 @@ source. Both containers will have the following files present on their filesyst /etc/secret-volume/password + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() + diff --git a/security.md b/security.md index 90dc3237..bbb735eb 100644 --- a/security.md +++ b/security.md @@ -131,4 +131,6 @@ The controller manager for Replication Controllers and other future controllers The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() + diff --git a/security_context.md b/security_context.md index cbf525a8..ad83a6bd 100644 --- a/security_context.md +++ b/security_context.md @@ -170,5 +170,6 @@ will be denied by default. In the future the admission plugin will base this de configurable policies that reside within the [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). - + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() + diff --git a/service_accounts.md b/service_accounts.md index 896bd68e..61237853 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -177,5 +177,6 @@ Finally, it may provide an interface to automate creation of new serviceAccounts to GET serviceAccounts to see what has been created. - + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() + diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 45005353..0f2fe9e6 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -105,4 +105,6 @@ then ```foo-next``` is synthesized using the pattern ```- [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() + -- cgit v1.2.3 From 9b2fc6d4e38015acae5713a75e7e9e0ea07bb549 Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 9 Jul 2015 13:33:48 -0700 Subject: move admin related docs into docs/admin --- README.md | 2 +- architecture.md | 2 +- namespaces.md | 2 +- networking.md | 2 +- service_accounts.md | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 5a5b0497..2a7c153c 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ Kubernetes enables users to ask a cluster to run a set of containers. The system Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts. -A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the availability doc](../availability.md) and [cluster federation proposal](../proposals/federation.md) for more details). +A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the availability doc](../admin/availability.md) and [cluster federation proposal](../proposals/federation.md) for more details). Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner. diff --git a/architecture.md b/architecture.md index 71d606a1..22d61b27 100644 --- a/architecture.md +++ b/architecture.md @@ -33,7 +33,7 @@ The **Kubelet** manages [pods](../pods.md) and their containers, their images, t Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. -Service endpoints are currently found via [DNS](../dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. +Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. ## The Kubernetes Control Plane diff --git a/namespaces.md b/namespaces.md index cd8b5280..b33b8c4a 100644 --- a/namespaces.md +++ b/namespaces.md @@ -86,7 +86,7 @@ distinguish distinct entities, and reference particular entities across operatio A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*. -See [Authorization plugins](../authorization.md) +See [Authorization plugins](../admin/authorization.md) ### Limit Resource Consumption diff --git a/networking.md b/networking.md index 35248a71..1ebc3d47 100644 --- a/networking.md +++ b/networking.md @@ -129,7 +129,7 @@ a pod tries to egress beyond GCE's project the packets must be SNAT'ed With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE. - - [OpenVSwitch with GRE/VxLAN](../ovs-networking.md) + - [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md) - [Flannel](https://github.com/coreos/flannel#flannel) - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) ("With Linux Bridge devices" section) diff --git a/service_accounts.md b/service_accounts.md index 61237853..3b9e6ed9 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -34,7 +34,7 @@ They also may interact with services other than the Kubernetes API, such as: ## Design Overview A service account binds together several things: - a *name*, understood by users, and perhaps by peripheral systems, for an identity - - a *principal* that can be authenticated and [authorized](../authorization.md) + - a *principal* that can be authenticated and [authorized](../admin/authorization.md) - a [security context](security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other capabilities and controls on interaction with the file system and OS. - a set of [secrets](secrets.md), which a container may use to -- cgit v1.2.3 From 8e5a970d432f598962d97ecd9d6ce4b07d8f79bc Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Fri, 10 Jul 2015 12:39:25 -0700 Subject: standardize on - instead of _ in file names --- resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resources.md b/resources.md index 229e9b76..437aac09 100644 --- a/resources.md +++ b/resources.md @@ -13,7 +13,7 @@ certainly want the docs that go with that version. **Note: this is a design doc, which describes features that have not been completely implemented. -User documentation of the current state is [here](../compute_resources.md). The tracking issue for +User documentation of the current state is [here](../compute-resources.md). The tracking issue for implementation of this model is [#168](https://github.com/GoogleCloudPlatform/kubernetes/issues/168). Currently, only memory and cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in -- cgit v1.2.3 From 6cf8654ed142562a5b7f0c7a947fd06925439046 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Mon, 13 Jul 2015 16:25:16 -0700 Subject: Move versioning.md to design/ -- not user-focused. --- versioning.md | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 versioning.md diff --git a/versioning.md b/versioning.md new file mode 100644 index 00000000..4d17a939 --- /dev/null +++ b/versioning.md @@ -0,0 +1,64 @@ + + + + +

*** PLEASE NOTE: This document applies to the HEAD of the source +tree only. If you are using a released version of Kubernetes, you almost +certainly want the docs that go with that version.

+ +Documentation for specific releases can be found at +[releases.k8s.io](http://releases.k8s.io). + + + + +# Kubernetes API and Release Versioning + +Legend: + +* **Kube <major>.<minor>.<patch>** refers to the version of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. +* **API vX[betaY]** refers to the version of the HTTP API. + +## Release Timeline + +### Minor version timeline + +* Kube 1.0.0 +* Kube 1.0.x: We create a 1.0-patch branch and backport critical bugs and security issues to it. Patch releases occur as needed. +* Kube 1.1-alpha1: Cut from HEAD, smoke tested and released two weeks after Kube 1.0's release. Roughly every two weeks a new alpha is released from HEAD. The timeline is flexible; for example, if there is a critical bugfix, a new alpha can be released ahead of schedule. (This applies to the beta and rc releases as well.) +* Kube 1.1-beta1: When HEAD is feature complete, we create a 1.1-snapshot branch and release it as a beta. (The 1.1-snapshot branch may be created earlier if something that definitely won't be in 1.1 needs to be merged to HEAD.) This should occur 6-8 weeks after Kube 1.0. Development continues at HEAD and only fixes are backported to 1.1-snapshot. +* Kube 1.1-rc1: Released from 1.1-snapshot when it is considered stable and ready for testing. Most users should be able to upgrade to this version in production. +* Kube 1.1: Final release. Should occur between 3 and 4 months after 1.0. + +### Major version timeline + +There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date. + +## Release versions as related to API versions + +Here is an example major release cycle: + +* **Kube 1.0 should have API v1 without v1beta\* API versions** + * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have the stable v1 API. This enables you to migrate all your objects off of the beta API versions of the API and allows us to remove those beta API versions in Kube 1.0 with no effect. There will be tooling to help you detect and migrate any v1beta\* data versions or calls to v1 before you do the upgrade. +* **Kube 1.x may have API v2beta*** + * The first incarnation of a new (backwards-incompatible) API in HEAD is v2beta1. By default this will be unregistered in apiserver, so it can change freely. Once it is available by default in apiserver (which may not happen for several minor releases), it cannot change ever again because we serialize objects in versioned form, and we always need to be able to deserialize any objects that are saved in etcd, even between alpha versions. If further changes to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x versions. +* **Kube 1.y (where y is the last version of the 1.x series) must have final API v2** + * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades. + * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis. + +## Rationale for API v2 being complete before v2.0's release + +It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options. + +# Upgrades + +* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.) +* No hard breaking changes over version boundaries. + * For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x. + +There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() + -- cgit v1.2.3 From 53eee3533f5d877d5dbbca69f7171784dacc7fbb Mon Sep 17 00:00:00 2001 From: Mike Danese Date: Tue, 14 Jul 2015 09:37:37 -0700 Subject: automated link fixes --- access.md | 2 +- architecture.md | 6 +++--- networking.md | 2 +- resources.md | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/access.md b/access.md index 912f93aa..6192792d 100644 --- a/access.md +++ b/access.md @@ -165,7 +165,7 @@ In the Simple Profile: Namespaces versus userAccount vs Labels: - `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. -- `labels` (see [docs/labels.md](../../docs/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. +- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. - `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people. diff --git a/architecture.md b/architecture.md index 22d61b27..d2f9d942 100644 --- a/architecture.md +++ b/architecture.md @@ -27,11 +27,11 @@ The Kubernetes node has the services necessary to run application containers and Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers. ### Kubelet -The **Kubelet** manages [pods](../pods.md) and their containers, their images, their volumes, etc. +The **Kubelet** manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. ### Kube-Proxy -Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. +Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. @@ -55,7 +55,7 @@ The scheduler binds unscheduled pods to nodes via the `/binding` API. The schedu All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable. -The [`replicationcontroller`](../replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. +The [`replicationcontroller`](../user-guide/replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. diff --git a/networking.md b/networking.md index 1ebc3d47..c13daa1b 100644 --- a/networking.md +++ b/networking.md @@ -140,7 +140,7 @@ to serve the purpose outside of GCE. ## Pod to service -The [service](../services.md) abstraction provides a way to group pods under a +The [service](../user-guide/services.md) abstraction provides a way to group pods under a common access policy (e.g. load-balanced). The implementation of this creates a virtual IP which clients can access and which is transparently proxied to the pods in a Service. Each node runs a kube-proxy process which programs diff --git a/resources.md b/resources.md index 437aac09..fb147fa5 100644 --- a/resources.md +++ b/resources.md @@ -13,7 +13,7 @@ certainly want the docs that go with that version. **Note: this is a design doc, which describes features that have not been completely implemented. -User documentation of the current state is [here](../compute-resources.md). The tracking issue for +User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for implementation of this model is [#168](https://github.com/GoogleCloudPlatform/kubernetes/issues/168). Currently, only memory and cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in @@ -163,7 +163,7 @@ The following are planned future extensions to the resource model, included here ## Usage data -Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. +Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: -- cgit v1.2.3 From be001476aac7c7b13cf50083bfc46a27d9d8f08c Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Mon, 13 Jul 2015 15:15:35 -0700 Subject: Run gendocs --- README.md | 10 +++++++++- access.md | 10 +++++++++- admission_control.md | 10 +++++++++- admission_control_limit_range.md | 10 +++++++++- admission_control_resource_quota.md | 10 +++++++++- architecture.md | 10 +++++++++- clustering.md | 10 +++++++++- clustering/README.md | 10 +++++++++- command_execution_port_forwarding.md | 10 +++++++++- event_compression.md | 10 +++++++++- expansion.md | 10 +++++++++- identifiers.md | 10 +++++++++- namespaces.md | 10 +++++++++- networking.md | 10 +++++++++- persistent-storage.md | 10 +++++++++- principles.md | 10 +++++++++- resources.md | 10 +++++++++- secrets.md | 10 +++++++++- security.md | 10 +++++++++- security_context.md | 10 +++++++++- service_accounts.md | 10 +++++++++- simple-rolling-update.md | 10 +++++++++- versioning.md | 10 +++++++++- 23 files changed, 207 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 2a7c153c..8d98c34a 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/access.md b/access.md index 6192792d..d8060025 100644 --- a/access.md +++ b/access.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/admission_control.md b/admission_control.md index 5870a601..ac488e6f 100644 --- a/admission_control.md +++ b/admission_control.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index e5363cea..b95f87a5 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 754e5a00..22825e8d 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/architecture.md b/architecture.md index d2f9d942..5202147f 100644 --- a/architecture.md +++ b/architecture.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/clustering.md b/clustering.md index 3e9972ce..a2ea1139 100644 --- a/clustering.md +++ b/clustering.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/clustering/README.md b/clustering/README.md index 07dcc7b3..3e390f37 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 7d110c3f..fc06c5d3 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/event_compression.md b/event_compression.md index 40dc9e52..d8984e13 100644 --- a/event_compression.md +++ b/event_compression.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/expansion.md b/expansion.md index 4f4511ce..eb2a78b5 100644 --- a/expansion.md +++ b/expansion.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/identifiers.md b/identifiers.md index 49068cc8..daadc90f 100644 --- a/identifiers.md +++ b/identifiers.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/namespaces.md b/namespaces.md index b33b8c4a..ff2ceb91 100644 --- a/namespaces.md +++ b/namespaces.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/networking.md b/networking.md index c13daa1b..235f8f19 100644 --- a/networking.md +++ b/networking.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/persistent-storage.md b/persistent-storage.md index 585cd281..5a18fa55 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/principles.md b/principles.md index 5071e89d..af831a07 100644 --- a/principles.md +++ b/principles.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/resources.md b/resources.md index fb147fa5..70420ec2 100644 --- a/resources.md +++ b/resources.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/secrets.md b/secrets.md index 2fdee537..d0728044 100644 --- a/secrets.md +++ b/secrets.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/security.md b/security.md index bbb735eb..4f9ed395 100644 --- a/security.md +++ b/security.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/security_context.md b/security_context.md index ad83a6bd..ca12db00 100644 --- a/security_context.md +++ b/security_context.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/service_accounts.md b/service_accounts.md index 3b9e6ed9..e877b880 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 0f2fe9e6..b8473682 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + diff --git a/versioning.md b/versioning.md index 4d17a939..fff6cbd7 100644 --- a/versioning.md +++ b/versioning.md @@ -2,13 +2,21 @@ -

*** PLEASE NOTE: This document applies to the HEAD of the source +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + +

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) + -- cgit v1.2.3 From 43bcff9826eb3752ae4ce19c2f76323387de3f9d Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Tue, 14 Jul 2015 17:28:47 -0700 Subject: Run gendocs --- README.md | 12 ++++++------ access.md | 12 ++++++------ admission_control.md | 12 ++++++------ admission_control_limit_range.md | 12 ++++++------ admission_control_resource_quota.md | 12 ++++++------ architecture.md | 12 ++++++------ clustering.md | 12 ++++++------ clustering/README.md | 12 ++++++------ command_execution_port_forwarding.md | 12 ++++++------ event_compression.md | 12 ++++++------ expansion.md | 12 ++++++------ identifiers.md | 12 ++++++------ namespaces.md | 12 ++++++------ networking.md | 12 ++++++------ persistent-storage.md | 12 ++++++------ principles.md | 12 ++++++------ resources.md | 12 ++++++------ secrets.md | 12 ++++++------ security.md | 12 ++++++------ security_context.md | 12 ++++++------ service_accounts.md | 12 ++++++------ simple-rolling-update.md | 12 ++++++------ versioning.md | 12 ++++++------ 23 files changed, 138 insertions(+), 138 deletions(-) diff --git a/README.md b/README.md index 8d98c34a..1f850ffb 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/access.md b/access.md index d8060025..c3ac41a0 100644 --- a/access.md +++ b/access.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/admission_control.md b/admission_control.md index ac488e6f..a80de2b2 100644 --- a/admission_control.md +++ b/admission_control.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index b95f87a5..125c6d06 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 22825e8d..d80f38bf 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/architecture.md b/architecture.md index 5202147f..3deeb3aa 100644 --- a/architecture.md +++ b/architecture.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/clustering.md b/clustering.md index a2ea1139..e5307fd7 100644 --- a/clustering.md +++ b/clustering.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/clustering/README.md b/clustering/README.md index 3e390f37..cf5a3d50 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index fc06c5d3..998d1cbd 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/event_compression.md b/event_compression.md index d8984e13..32e52607 100644 --- a/event_compression.md +++ b/event_compression.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/expansion.md b/expansion.md index eb2a78b5..f81db3c4 100644 --- a/expansion.md +++ b/expansion.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/identifiers.md b/identifiers.md index daadc90f..e66d2d7a 100644 --- a/identifiers.md +++ b/identifiers.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/namespaces.md b/namespaces.md index ff2ceb91..70f5e860 100644 --- a/namespaces.md +++ b/namespaces.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/networking.md b/networking.md index 235f8f19..052ec128 100644 --- a/networking.md +++ b/networking.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/persistent-storage.md b/persistent-storage.md index 5a18fa55..9639a521 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/principles.md b/principles.md index af831a07..83a1ae91 100644 --- a/principles.md +++ b/principles.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/resources.md b/resources.md index 70420ec2..4172cdb4 100644 --- a/resources.md +++ b/resources.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/secrets.md b/secrets.md index d0728044..33433dc0 100644 --- a/secrets.md +++ b/secrets.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/security.md b/security.md index 4f9ed395..e2ab4fb7 100644 --- a/security.md +++ b/security.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/security_context.md b/security_context.md index ca12db00..6b0601e6 100644 --- a/security_context.md +++ b/security_context.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/service_accounts.md b/service_accounts.md index e877b880..ddb127f2 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/simple-rolling-update.md b/simple-rolling-update.md index b8473682..ed2e5349 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) diff --git a/versioning.md b/versioning.md index fff6cbd7..85e3f56f 100644 --- a/versioning.md +++ b/versioning.md @@ -2,9 +2,9 @@ -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png)

PLEASE NOTE: This document applies to the HEAD of the source tree only. If you are using a released version of Kubernetes, you almost @@ -13,9 +13,9 @@ certainly want the docs that go with that version.

Documentation for specific releases can be found at [releases.k8s.io](http://releases.k8s.io). -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) -![WARNING](http://releases.k8s.io/HEAD/docs/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) +![WARNING](http://kubernetes.io/img/warning.png) -- cgit v1.2.3 From 8b825dd62649854c64185070384955f7f59b371c Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Tue, 14 Jul 2015 22:07:44 -0700 Subject: Move some docs from docs/ top-level into docs/{admin/,devel/,user-guide/}. --- principles.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/principles.md b/principles.md index 83a1ae91..212f04bd 100644 --- a/principles.md +++ b/principles.md @@ -26,7 +26,7 @@ Principles to follow when extending Kubernetes. ## API -See also the [API conventions](../api-conventions.md). +See also the [API conventions](../devel/api-conventions.md). * All APIs should be declarative. * API objects should be complementary and composable, not opaque wrappers. -- cgit v1.2.3 From 915691255225423c86d92c2625dea9567a32f930 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Tue, 14 Jul 2015 23:56:51 -0700 Subject: Move diagrams out of top-level docs/ directory and merge docs/devel/developer-guide.md into docs/devel/README.md --- architecture.dia | Bin 0 -> 6522 bytes architecture.md | 2 +- architecture.png | Bin 0 -> 222407 bytes architecture.svg | 499 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 500 insertions(+), 1 deletion(-) create mode 100644 architecture.dia create mode 100644 architecture.png create mode 100644 architecture.svg diff --git a/architecture.dia b/architecture.dia new file mode 100644 index 00000000..26e0eed2 Binary files /dev/null and b/architecture.dia differ diff --git a/architecture.md b/architecture.md index 3deeb3aa..1591068f 100644 --- a/architecture.md +++ b/architecture.md @@ -24,7 +24,7 @@ certainly want the docs that go with that version. A running Kubernetes cluster contains node agents (kubelet) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making kubelet itself (all our components, really) run within containers, and making the scheduler 100% pluggable. -![Architecture Diagram](../architecture.png?raw=true "Architecture overview") +![Architecture Diagram](architecture.png?raw=true "Architecture overview") ## The Kubernetes Node diff --git a/architecture.png b/architecture.png new file mode 100644 index 00000000..fa39039a Binary files /dev/null and b/architecture.png differ diff --git a/architecture.svg b/architecture.svg new file mode 100644 index 00000000..825c0ace --- /dev/null +++ b/architecture.svg @@ -0,0 +1,499 @@ + + + + + + + + + + + + + Node + + + + + + kubelet + + + + + + + + + + + container + + + + + + + container + + + + + + + cAdvisor + + + + + + + Pod + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + Proxy + + + + + + + kubectl (user commands) + + + + + + + + + + + + + + + Firewall + + + + + + + Internet + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + replication controller + + + + + + + Scheduler + + + + + + + Scheduler + + + + Master components + Colocated, or spread across machines, + as dictated by cluster size. + + + + + + + + + + + + REST + (pods, services, + rep. controllers) + + + + + + + authorization + authentication + + + + + + + scheduling + actuator + + + + APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + docker + + + + + + + + .. + + + ... + + + + + + + + + + + + + + + + + + + + + + + + Node + + + + + + kubelet + + + + + + + + + + + container + + + + + + + container + + + + + + + cAdvisor + + + + + + + Pod + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + Proxy + + + + + + + + + + + + + + + + + + + docker + + + + + + + + .. + + + ... + + + + + + + + + + + + + + + + + + + + + + + + + + Distributed + Watchable + Storage + + (implemented via etcd) + + + -- cgit v1.2.3 From 304af47c459ddc8041c0f03278e2f8d043d87360 Mon Sep 17 00:00:00 2001 From: Mike Danese Date: Wed, 15 Jul 2015 10:42:59 -0700 Subject: point kubectl -f examples to correct paths --- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 125c6d06..2420a274 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -131,7 +131,7 @@ For example, ```shell $ kubectl namespace myspace -$ kubectl create -f examples/limitrange/limit-range.json +$ kubectl create -f docs/user-guide/limitrange/limits.yaml $ kubectl get limits NAME limits diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index d80f38bf..7a323689 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -158,7 +158,7 @@ For example, ``` $ kubectl namespace myspace -$ kubectl create -f examples/resourcequota/resource-quota.json +$ kubectl create -f docs/user-guide/resourcequota/quota.yaml $ kubectl get quota NAME quota -- cgit v1.2.3 From 7b8b2772975fc34673eee47d44f49fc90b03d089 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Thu, 16 Jul 2015 02:20:30 -0700 Subject: Take availability.md doc and - extract the portion related to multi-cluster operation into a new multi-cluster.md doc - merge the remainder (that was basically high-level troubleshooting advice) into cluster-troubleshooting.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1f850ffb..2c0455da 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Kubernetes enables users to ask a cluster to run a set of containers. The system Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts. -A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the availability doc](../admin/availability.md) and [cluster federation proposal](../proposals/federation.md) for more details). +A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) for more details). Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner. -- cgit v1.2.3 From c198491ead87e5a970a17f75c25a7f3843006f2a Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 16 Jul 2015 14:54:28 -0700 Subject: (mostly) auto fixed links --- event_compression.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/event_compression.md b/event_compression.md index 32e52607..294d3f41 100644 --- a/event_compression.md +++ b/event_compression.md @@ -35,7 +35,7 @@ Each binary that generates events (for example, ```kubelet```) should keep track Event compression should be best effort (not guaranteed). Meaning, in the worst case, ```n``` identical (minus timestamp) events may still result in ```n``` event entries. ## Design -Instead of a single Timestamp, each event object [contains](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/api/types.go#L1111) the following fields: +Instead of a single Timestamp, each event object [contains](../../pkg/api/types.go#L1111) the following fields: * ```FirstTimestamp util.Time``` * The date/time of the first occurrence of the event. * ```LastTimestamp util.Time``` @@ -47,7 +47,7 @@ Instead of a single Timestamp, each event object [contains](https://github.com/G Each binary that generates events: * Maintains a historical record of previously generated events: - * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](https://github.com/GoogleCloudPlatform/kubernetes/tree/master/pkg/client/record/events_cache.go). + * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](../../pkg/client/record/events_cache.go). * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * ```event.Source.Component``` * ```event.Source.Host``` @@ -59,7 +59,7 @@ Each binary that generates events: * ```event.Reason``` * ```event.Message``` * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [```pkg/client/record/event.go```](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/client/record/event.go)). + * When an event is generated, the previously generated events cache is checked (see [```pkg/client/record/event.go```](../../pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). -- cgit v1.2.3 From 9f528c422685de25aafd8c76cdaef0125c005855 Mon Sep 17 00:00:00 2001 From: Janet Kuo Date: Wed, 15 Jul 2015 17:28:59 -0700 Subject: Ensure all docs and examples in user guide are reachable --- admission_control_limit_range.md | 3 +++ admission_control_resource_quota.md | 3 +++ persistent-storage.md | 2 +- secrets.md | 4 ++-- simple-rolling-update.md | 4 ++-- 5 files changed, 11 insertions(+), 5 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 2420a274..addd8483 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -153,6 +153,9 @@ It is expected we will want to define limits for particular pods or containers b To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. +## Example +See the [example of Limit Range](../user-guide/limitrange) for more information. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 7a323689..ec2cb20d 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -174,6 +174,9 @@ resourcequotas 1 1 services 3 5 ``` +## More information +See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota) for more information. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index 9639a521..1cbed771 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -28,7 +28,7 @@ This document proposes a model for managing persistent, cluster-scoped storage f Two new API kinds: -A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. +A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it. A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod. diff --git a/secrets.md b/secrets.md index 33433dc0..b4bc8385 100644 --- a/secrets.md +++ b/secrets.md @@ -23,8 +23,8 @@ certainly want the docs that go with that version. ## Abstract -A proposal for the distribution of secrets (passwords, keys, etc) to the Kubelet and to -containers inside Kubernetes using a custom volume type. +A proposal for the distribution of [secrets](../user-guide/secrets.md) (passwords, keys, etc) to the Kubelet and to +containers inside Kubernetes using a custom [volume](../user-guide/volumes.md#secrets) type. See the [secrets example](../user-guide/secrets/) for more information. ## Motivation diff --git a/simple-rolling-update.md b/simple-rolling-update.md index ed2e5349..b74264d6 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -21,9 +21,9 @@ certainly want the docs that go with that version. ## Simple rolling update -This is a lightweight design document for simple rolling update in ```kubectl``` +This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in ```kubectl```. -Complete execution flow can be found [here](#execution-details). +Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. ### Lightweight rollout Assume that we have a current replication controller named ```foo``` and it is running image ```image:v1``` -- cgit v1.2.3 From 4510fab29da1d028321cf708a5665a14757b7ca7 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Thu, 16 Jul 2015 10:02:26 -0700 Subject: Better scary message --- README.md | 38 +++++++++++++++++++++++------------- access.md | 38 +++++++++++++++++++++++------------- admission_control.md | 38 +++++++++++++++++++++++------------- admission_control_limit_range.md | 38 +++++++++++++++++++++++------------- admission_control_resource_quota.md | 38 +++++++++++++++++++++++------------- architecture.md | 38 +++++++++++++++++++++++------------- clustering.md | 38 +++++++++++++++++++++++------------- clustering/README.md | 38 +++++++++++++++++++++++------------- command_execution_port_forwarding.md | 38 +++++++++++++++++++++++------------- event_compression.md | 38 +++++++++++++++++++++++------------- expansion.md | 38 +++++++++++++++++++++++------------- identifiers.md | 38 +++++++++++++++++++++++------------- namespaces.md | 38 +++++++++++++++++++++++------------- networking.md | 38 +++++++++++++++++++++++------------- persistent-storage.md | 38 +++++++++++++++++++++++------------- principles.md | 38 +++++++++++++++++++++++------------- resources.md | 38 +++++++++++++++++++++++------------- secrets.md | 38 +++++++++++++++++++++++------------- security.md | 38 +++++++++++++++++++++++------------- security_context.md | 38 +++++++++++++++++++++++------------- service_accounts.md | 38 +++++++++++++++++++++++------------- simple-rolling-update.md | 38 +++++++++++++++++++++++------------- versioning.md | 38 +++++++++++++++++++++++------------- 23 files changed, 552 insertions(+), 322 deletions(-) diff --git a/README.md b/README.md index 2c0455da..b0f3115a 100644 --- a/README.md +++ b/README.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/access.md b/access.md index c3ac41a0..e42d7859 100644 --- a/access.md +++ b/access.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/access.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/admission_control.md b/admission_control.md index a80de2b2..aaa6ed16 100644 --- a/admission_control.md +++ b/admission_control.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/admission_control.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index addd8483..824d4a35 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/admission_control_limit_range.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index ec2cb20d..e262eb2d 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/admission_control_resource_quota.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/architecture.md b/architecture.md index 1591068f..2e4afc62 100644 --- a/architecture.md +++ b/architecture.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/architecture.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/clustering.md b/clustering.md index e5307fd7..8673284f 100644 --- a/clustering.md +++ b/clustering.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/clustering.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/clustering/README.md b/clustering/README.md index cf5a3d50..f05168d6 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/clustering/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 998d1cbd..c7408b58 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/command_execution_port_forwarding.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/event_compression.md b/event_compression.md index 294d3f41..0b458c8d 100644 --- a/event_compression.md +++ b/event_compression.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/event_compression.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/expansion.md b/expansion.md index f81db3c4..5cc08c6c 100644 --- a/expansion.md +++ b/expansion.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/expansion.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/identifiers.md b/identifiers.md index e66d2d7a..eda7254b 100644 --- a/identifiers.md +++ b/identifiers.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/identifiers.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/namespaces.md b/namespaces.md index 70f5e860..7bd7ab67 100644 --- a/namespaces.md +++ b/namespaces.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/namespaces.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/networking.md b/networking.md index 052ec128..ac6e5794 100644 --- a/networking.md +++ b/networking.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/networking.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/persistent-storage.md b/persistent-storage.md index 1cbed771..f919baa9 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/persistent-storage.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/principles.md b/principles.md index 212f04bd..1ae3bc3a 100644 --- a/principles.md +++ b/principles.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/principles.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/resources.md b/resources.md index 4172cdb4..0457eb44 100644 --- a/resources.md +++ b/resources.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/resources.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/secrets.md b/secrets.md index b4bc8385..8aab1088 100644 --- a/secrets.md +++ b/secrets.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/secrets.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/security.md b/security.md index e2ab4fb7..2989148b 100644 --- a/security.md +++ b/security.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/security.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/security_context.md b/security_context.md index 6b0601e6..6940aae2 100644 --- a/security_context.md +++ b/security_context.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/security_context.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/service_accounts.md b/service_accounts.md index ddb127f2..c53b4633 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/service_accounts.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/simple-rolling-update.md b/simple-rolling-update.md index b74264d6..b142c6e5 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/simple-rolling-update.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- diff --git a/versioning.md b/versioning.md index 85e3f56f..3f9bf614 100644 --- a/versioning.md +++ b/versioning.md @@ -2,20 +2,30 @@ -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) - -

PLEASE NOTE: This document applies to the HEAD of the source -tree only. If you are using a released version of Kubernetes, you almost -certainly want the docs that go with that version.

- -Documentation for specific releases can be found at -[releases.k8s.io](http://releases.k8s.io). - -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) -![WARNING](http://kubernetes.io/img/warning.png) +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/versioning.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- -- cgit v1.2.3 From 60cec0f5fa87f28f2a7f1357817d06db433b1e75 Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 16 Jul 2015 19:01:02 -0700 Subject: apply changes --- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- event_compression.md | 1 + resources.md | 7 +++++++ security_context.md | 1 + service_accounts.md | 1 + 6 files changed, 12 insertions(+), 2 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 824d4a35..90329815 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -164,7 +164,7 @@ It is expected we will want to define limits for particular pods or containers b To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. ## Example -See the [example of Limit Range](../user-guide/limitrange) for more information. +See the [example of Limit Range](../user-guide/limitrange/) for more information. diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index e262eb2d..d5cdc9a1 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -185,7 +185,7 @@ services 3 5 ``` ## More information -See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota) for more information. +See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota/) for more information. diff --git a/event_compression.md b/event_compression.md index 0b458c8d..af823972 100644 --- a/event_compression.md +++ b/event_compression.md @@ -84,6 +84,7 @@ Each binary that generates events: ## Example Sample kubectl output + ``` FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet. diff --git a/resources.md b/resources.md index 0457eb44..2effb5cf 100644 --- a/resources.md +++ b/resources.md @@ -87,23 +87,27 @@ Internally (i.e., everywhere else), Kubernetes will represent resource quantitie Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now. Resource requirements for a container or pod should have the following form: + ``` resourceRequirementSpec: [ request: [ cpu: 2.5, memory: "40Mi" ], limit: [ cpu: 4.0, memory: "99Mi" ], ] ``` + Where: * _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses. * _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory. Total capacity for a node should have a similar structure: + ``` resourceCapacitySpec: [ total: [ cpu: 12, memory: "128Gi" ] ] ``` + Where: * _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes. @@ -149,6 +153,7 @@ rather than decimal ones: "64MiB" rather than "64MB". ## Resource metadata A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example: + ``` resourceTypes: [ "kubernetes.io/memory": [ @@ -194,6 +199,7 @@ resourceStatus: [ ``` where a `` or `` structure looks like this: + ``` { mean: # arithmetic mean @@ -209,6 +215,7 @@ where a `` or `` structure looks like this: ] } ``` + All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_ and predicted diff --git a/security_context.md b/security_context.md index 6940aae2..bc76495a 100644 --- a/security_context.md +++ b/security_context.md @@ -179,6 +179,7 @@ type SELinuxOptions struct { Level string } ``` + ### Admission It is up to an admission plugin to determine if the security context is acceptable or not. At the diff --git a/service_accounts.md b/service_accounts.md index c53b4633..c6acbd24 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -61,6 +61,7 @@ A service account binds together several things: ## Design Discussion A new object Kind is added: + ```go type ServiceAccount struct { TypeMeta `json:",inline" yaml:",inline"` -- cgit v1.2.3 From fabd20afce30e947425346fa2938ad0edfa8b867 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Fri, 17 Jul 2015 15:35:41 -0700 Subject: Run gendocs --- README.md | 1 + access.md | 15 ++++++++++++--- admission_control.md | 1 + admission_control_limit_range.md | 2 ++ admission_control_resource_quota.md | 2 ++ architecture.md | 2 ++ clustering.md | 2 ++ clustering/README.md | 1 + command_execution_port_forwarding.md | 3 +++ event_compression.md | 6 ++++++ expansion.md | 1 + identifiers.md | 1 + namespaces.md | 1 + networking.md | 1 + persistent-storage.md | 1 + principles.md | 1 + resources.md | 10 ++++++++++ security.md | 1 + security_context.md | 6 ++++++ service_accounts.md | 6 +++++- simple-rolling-update.md | 9 +++++++++ versioning.md | 1 + 22 files changed, 70 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b0f3115a..62946cb6 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Kubernetes Design Overview Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. diff --git a/access.md b/access.md index e42d7859..9a0c0d3d 100644 --- a/access.md +++ b/access.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # K8s Identity and Access Management Sketch This document suggests a direction for identity and access management in the Kubernetes system. @@ -43,6 +44,7 @@ High level goals are: - Ease integration with existing enterprise and hosted scenarios. ### Actors + Each of these can act as normal users or attackers. - External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access. - K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods) @@ -51,6 +53,7 @@ Each of these can act as normal users or attackers. - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. ### Threats + Both intentional attacks and accidental use of privilege are concerns. For both cases it may be useful to think about these categories differently: @@ -81,6 +84,7 @@ K8s Cluster assets: This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins. ### Usage environments + Cluster in Small organization: - K8s Admins may be the same people as K8s Users. - few K8s Admins. @@ -112,6 +116,7 @@ Pods configs should be largely portable between Org-run and hosted configuration # Design + Related discussion: - https://github.com/GoogleCloudPlatform/kubernetes/issues/442 - https://github.com/GoogleCloudPlatform/kubernetes/issues/443 @@ -125,7 +130,9 @@ K8s distribution should include templates of config, and documentation, for simp Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00. ## Identity -###userAccount + +### userAccount + K8s will have a `userAccount` API object. - `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs. - `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field. @@ -158,7 +165,8 @@ Enterprise Profile: - each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`) - automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file. -###Unix accounts +### Unix accounts + A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity. Initially: @@ -170,7 +178,8 @@ Improvements: - requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids. - any features that help users avoid use of privileged containers (https://github.com/GoogleCloudPlatform/kubernetes/issues/391) -###Namespaces +### Namespaces + K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. Namespaces are described in [namespaces.md](namespaces.md). diff --git a/admission_control.md b/admission_control.md index aaa6ed16..c75d5535 100644 --- a/admission_control.md +++ b/admission_control.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Kubernetes Proposal - Admission Control **Related PR:** diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 90329815..ccdb44d8 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Admission control plugin: LimitRanger ## Background @@ -164,6 +165,7 @@ It is expected we will want to define limits for particular pods or containers b To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. ## Example + See the [example of Limit Range](../user-guide/limitrange/) for more information. diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index d5cdc9a1..99d5431a 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Admission control plugin: ResourceQuota ## Background @@ -185,6 +186,7 @@ services 3 5 ``` ## More information + See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota/) for more information. diff --git a/architecture.md b/architecture.md index 2e4afc62..f7c55171 100644 --- a/architecture.md +++ b/architecture.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Kubernetes architecture A running Kubernetes cluster contains node agents (kubelet) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making kubelet itself (all our components, really) run within containers, and making the scheduler 100% pluggable. @@ -45,6 +46,7 @@ The Kubernetes node has the services necessary to run application containers and Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers. ### Kubelet + The **Kubelet** manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. ### Kube-Proxy diff --git a/clustering.md b/clustering.md index 8673284f..1fcb8aa3 100644 --- a/clustering.md +++ b/clustering.md @@ -30,10 +30,12 @@ Documentation for other releases can be found at + # Clustering in Kubernetes ## Overview + The term "clustering" refers to the process of having all members of the kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address. Once a cluster is established, the following is true: diff --git a/clustering/README.md b/clustering/README.md index f05168d6..53649a31 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -41,6 +41,7 @@ pip install seqdiag Just call `make` to regenerate the diagrams. ## Building with Docker + If you are on a Mac or your pip install is messed up, you can easily build with docker. ``` diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index c7408b58..1d319adf 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Container Command Execution & Port Forwarding in Kubernetes ## Abstract @@ -87,12 +88,14 @@ won't be able to work with this mechanism, unless adapters can be written. ## Process Flow ### Remote Command Execution Flow + 1. The client connects to the Kubernetes Master to initiate a remote command execution request 2. The Master proxies the request to the Kubelet where the container lives 3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container ### Port Forwarding Flow + 1. The client connects to the Kubernetes Master to initiate a remote command execution request 2. The Master proxies the request to the Kubelet where the container lives diff --git a/event_compression.md b/event_compression.md index af823972..29e65917 100644 --- a/event_compression.md +++ b/event_compression.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Kubernetes Event Compression This document captures the design of event compression. @@ -40,11 +41,13 @@ This document captures the design of event compression. Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate ```image_not_existing``` and ```container_is_waiting``` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](https://github.com/GoogleCloudPlatform/kubernetes/issues/3853)). ## Proposal + Each binary that generates events (for example, ```kubelet```) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. Event compression should be best effort (not guaranteed). Meaning, in the worst case, ```n``` identical (minus timestamp) events may still result in ```n``` event entries. ## Design + Instead of a single Timestamp, each event object [contains](../../pkg/api/types.go#L1111) the following fields: * ```FirstTimestamp util.Time``` * The date/time of the first occurrence of the event. @@ -78,11 +81,13 @@ Each binary that generates events: * An entry for the event is also added to the previously generated events cache. ## Issues/Risks + * Compression is not guaranteed, because each component keeps track of event history in memory * An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts. * Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd. ## Example + Sample kubectl output ``` @@ -104,6 +109,7 @@ Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries. ## Related Pull Requests/Issues + * Issue [#4073](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress duplicate events * PR [#4157](https://github.com/GoogleCloudPlatform/kubernetes/issues/4157): Add "Update Event" to Kubernetes API * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event diff --git a/expansion.md b/expansion.md index 5cc08c6c..096b8a9d 100644 --- a/expansion.md +++ b/expansion.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Variable expansion in pod command, args, and env ## Abstract diff --git a/identifiers.md b/identifiers.md index eda7254b..9e269993 100644 --- a/identifiers.md +++ b/identifiers.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Identifiers and Names in Kubernetes A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199). diff --git a/namespaces.md b/namespaces.md index 7bd7ab67..1f1a767c 100644 --- a/namespaces.md +++ b/namespaces.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Namespaces ## Abstract diff --git a/networking.md b/networking.md index ac6e5794..d7822d4d 100644 --- a/networking.md +++ b/networking.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Networking There are 4 distinct networking problems to solve: diff --git a/persistent-storage.md b/persistent-storage.md index f919baa9..3e9edd3e 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Persistent Storage This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. diff --git a/principles.md b/principles.md index 1ae3bc3a..c208fb6b 100644 --- a/principles.md +++ b/principles.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Design Principles Principles to follow when extending Kubernetes. diff --git a/resources.md b/resources.md index 2effb5cf..055c5d86 100644 --- a/resources.md +++ b/resources.md @@ -48,6 +48,7 @@ The resource model aims to be: * precise, to avoid misunderstandings and promote pod portability. ## The resource model + A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth. Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_. @@ -124,9 +125,11 @@ Where: ## Kubernetes-defined resource types + The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet. ### Processor cycles + * Name: `cpu` (or `kubernetes.io/cpu`) * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU") * Internal representation: milli-KCUs @@ -141,6 +144,7 @@ Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will ### Memory + * Name: `memory` (or `kubernetes.io/memory`) * Units: bytes * Compressible? no (at least initially) @@ -152,6 +156,7 @@ rather than decimal ones: "64MiB" rather than "64MB". ## Resource metadata + A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example: ``` @@ -222,16 +227,19 @@ and predicted ## Future resource types ### _[future] Network bandwidth_ + * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`) * Units: bytes per second * Compressible? yes ### _[future] Network operations_ + * Name: "network-iops" (or `kubernetes.io/network-iops`) * Units: operations (messages) per second * Compressible? yes ### _[future] Storage space_ + * Name: "storage-space" (or `kubernetes.io/storage-space`) * Units: bytes * Compressible? no @@ -239,6 +247,7 @@ and predicted The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work. ### _[future] Storage time_ + * Name: storage-time (or `kubernetes.io/storage-time`) * Units: seconds per second of disk time * Internal representation: milli-units @@ -247,6 +256,7 @@ The amount of secondary storage space available to a container. The main target This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second. ### _[future] Storage operations_ + * Name: "storage-iops" (or `kubernetes.io/storage-iops`) * Units: operations per second * Compressible? yes diff --git a/security.md b/security.md index 2989148b..522ff4ca 100644 --- a/security.md +++ b/security.md @@ -30,6 +30,7 @@ Documentation for other releases can be found at + # Security in Kubernetes Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster. diff --git a/security_context.md b/security_context.md index bc76495a..03213927 100644 --- a/security_context.md +++ b/security_context.md @@ -30,8 +30,11 @@ Documentation for other releases can be found at + # Security Contexts + ## Abstract + A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)): 1. Ensure a clear isolation between container and the underlying host it runs on @@ -53,11 +56,13 @@ to the container process. Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers. ### External integration with shared storage + In order to support external integration with shared storage, processes running in a Kubernetes cluster should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks. ## Constraints and Assumptions + * It is out of the scope of this document to prescribe a specific set of constraints to isolate containers from their host. Different use cases need different settings. @@ -96,6 +101,7 @@ be addressed with security contexts: ## Proposed Design ### Overview + A *security context* consists of a set of constraints that determine how a container is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance diff --git a/service_accounts.md b/service_accounts.md index c6acbd24..d9535de5 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -30,7 +30,8 @@ Documentation for other releases can be found at -#Service Accounts + +# Service Accounts ## Motivation @@ -50,6 +51,7 @@ They also may interact with services other than the Kubernetes API, such as: - accessing files in an NFS volume attached to the pod ## Design Overview + A service account binds together several things: - a *name*, understood by users, and perhaps by peripheral systems, for an identity - a *principal* that can be authenticated and [authorized](../admin/authorization.md) @@ -137,6 +139,7 @@ are added to the map of tokens used by the authentication process in the apiserv might have some types that do not do anything on apiserver but just get pushed to the kubelet.) ### Pods + The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account Finalizer (see below). @@ -144,6 +147,7 @@ Service Account Finalizer (see below). TBD: how policy limits which users can make pods with which service accounts. ### Authorization + Kubernetes API Authorization Policies refer to users. Pods created with a `Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to authenticate to the Kubernetes APIserver as a particular user. So any policy that is desired can be applied to them. diff --git a/simple-rolling-update.md b/simple-rolling-update.md index b142c6e5..80bc6566 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -30,12 +30,15 @@ Documentation for other releases can be found at + ## Simple rolling update + This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in ```kubectl```. Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. ### Lightweight rollout + Assume that we have a current replication controller named ```foo``` and it is running image ```image:v1``` ```kubectl rolling-update foo [foo-v2] --image=myimage:v2``` @@ -51,6 +54,7 @@ and the old 'foo' replication controller is deleted. For the purposes of the ro The value of that label is the hash of the complete JSON representation of the```foo-next``` or```foo``` replication controller. The name of this label can be overridden by the user with the ```--deployment-label-key``` flag. #### Recovery + If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the ```kubernetes.io/``` annotation namespace: * ```desired-replicas``` The desired number of replicas for this replication controller (either N or zero) @@ -68,6 +72,7 @@ it is assumed that the rollout is nearly completed, and ```foo-next``` is rename ### Aborting a rollout + Abort is assumed to want to reverse a rollout in progress. ```kubectl rolling-update foo [foo-v2] --rollback``` @@ -87,6 +92,7 @@ If the user doesn't specify a ```foo-next``` name, then it is either discovered then ```foo-next``` is synthesized using the pattern ```-``` #### Initialization + * If ```foo``` and ```foo-next``` do not exist: * Exit, and indicate an error to the user, that the specified controller doesn't exist. * If ```foo``` exists, but ```foo-next``` does not: @@ -102,6 +108,7 @@ then ```foo-next``` is synthesized using the pattern ```- 0 @@ -109,11 +116,13 @@ then ```foo-next``` is synthesized using the pattern ```- + # Kubernetes API and Release Versioning Legend: -- cgit v1.2.3 From 33ff550b17290853b10e8106492b05d184c3b98e Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Sun, 19 Jul 2015 08:46:02 +0000 Subject: Improve design docs syntax highlighting. --- admission_control_limit_range.md | 4 ++-- admission_control_resource_quota.md | 4 ++-- clustering/README.md | 4 ++-- event_compression.md | 2 +- namespaces.md | 14 +++++++------- networking.md | 2 +- persistent-storage.md | 38 ++++++++++++++----------------------- resources.md | 17 +++++++++-------- simple-rolling-update.md | 2 +- 9 files changed, 39 insertions(+), 48 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index ccdb44d8..48a7880f 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -128,7 +128,7 @@ The server is updated to be aware of **LimitRange** objects. The constraints are only enforced if the kube-apiserver is started as follows: -``` +```console $ kube-apiserver -admission_control=LimitRanger ``` @@ -140,7 +140,7 @@ kubectl is modified to support the **LimitRange** resource. For example, -```shell +```console $ kubectl namespace myspace $ kubectl create -f docs/user-guide/limitrange/limits.yaml $ kubectl get limits diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 99d5431a..a3781d64 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -140,7 +140,7 @@ The server is updated to be aware of **ResourceQuota** objects. The quota is only enforced if the kube-apiserver is started as follows: -``` +```console $ kube-apiserver -admission_control=ResourceQuota ``` @@ -167,7 +167,7 @@ kubectl is modified to support the **ResourceQuota** resource. For example, -``` +```console $ kubectl namespace myspace $ kubectl create -f docs/user-guide/resourcequota/quota.yaml $ kubectl get quota diff --git a/clustering/README.md b/clustering/README.md index 53649a31..d02b7d50 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -34,7 +34,7 @@ This directory contains diagrams for the clustering design doc. This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with -```bash +```sh pip install seqdiag ``` @@ -44,7 +44,7 @@ Just call `make` to regenerate the diagrams. If you are on a Mac or your pip install is messed up, you can easily build with docker. -``` +```sh make docker ``` diff --git a/event_compression.md b/event_compression.md index 29e65917..3b988048 100644 --- a/event_compression.md +++ b/event_compression.md @@ -90,7 +90,7 @@ Each binary that generates events: Sample kubectl output -``` +```console FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet. diff --git a/namespaces.md b/namespaces.md index 1f1a767c..da3bb2c5 100644 --- a/namespaces.md +++ b/namespaces.md @@ -74,7 +74,7 @@ The Namespace provides a unique scope for: A *Namespace* defines a logically named group for multiple *Kind*s of resources. -``` +```go type Namespace struct { TypeMeta `json:",inline"` ObjectMeta `json:"metadata,omitempty"` @@ -125,7 +125,7 @@ See [Admission control: Resource Quota](admission_control_resource_quota.md) Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects. -``` +```go type FinalizerName string // These are internal finalizers to Kubernetes, must be qualified name unless defined here @@ -154,7 +154,7 @@ set by default. A *Namespace* may exist in the following phases. -``` +```go type NamespacePhase string const( NamespaceActive NamespacePhase = "Active" @@ -262,7 +262,7 @@ to take part in Namespace termination. OpenShift creates a Namespace in Kubernetes -``` +```json { "apiVersion":"v1", "kind": "Namespace", @@ -287,7 +287,7 @@ own storage associated with the "development" namespace unknown to Kubernetes. User deletes the Namespace in Kubernetes, and Namespace now has following state: -``` +```json { "apiVersion":"v1", "kind": "Namespace", @@ -312,7 +312,7 @@ and begins to terminate all of the content in the namespace that it knows about. success, it executes a *finalize* action that modifies the *Namespace* by removing *kubernetes* from the list of finalizers: -``` +```json { "apiVersion":"v1", "kind": "Namespace", @@ -340,7 +340,7 @@ from the list of finalizers. This results in the following state: -``` +```json { "apiVersion":"v1", "kind": "Namespace", diff --git a/networking.md b/networking.md index d7822d4d..b1d5a460 100644 --- a/networking.md +++ b/networking.md @@ -131,7 +131,7 @@ differentiate it from `docker0`) is set up outside of Docker proper. Example of GCE's advanced routing rules: -``` +```sh gcloud compute routes add "${MINION_NAMES[$i]}" \ --project "${PROJECT}" \ --destination-range "${MINION_IP_RANGES[$i]}" \ diff --git a/persistent-storage.md b/persistent-storage.md index 3e9edd3e..9b0cd0d7 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -127,7 +127,7 @@ Events that communicate the state of a mounted volume are left to the volume plu An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs. -``` +```yaml POST: kind: PersistentVolume @@ -140,15 +140,13 @@ spec: persistentDisk: pdName: "abc123" fsType: "ext4" +``` --------------------------------------------------- - -kubectl get pv +```console +$ kubectl get pv NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON pv0001 map[] 10737418240 RWO Pending - - ``` #### Users request storage @@ -157,9 +155,9 @@ A user requests storage by posting a PVC to the API. Their request contains the The user must be within a namespace to create PVCs. -``` - +```yaml POST: + kind: PersistentVolumeClaim apiVersion: v1 metadata: @@ -170,15 +168,13 @@ spec: resources: requests: storage: 3 +``` --------------------------------------------------- - -kubectl get pvc - +```console +$ kubectl get pvc NAME LABELS STATUS VOLUME myclaim-1 map[] pending - ``` @@ -186,9 +182,8 @@ myclaim-1 map[] pending The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found. -``` - -kubectl get pv +```console +$ kubectl get pv NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e @@ -198,8 +193,6 @@ kubectl get pvc NAME LABELS STATUS VOLUME myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e - - ``` #### Claim usage @@ -208,7 +201,7 @@ The claim holder can use their claim as a volume. The ```PersistentVolumeClaimV The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods. -``` +```yaml POST: kind: Pod @@ -229,17 +222,14 @@ spec: accessMode: ReadWriteOnce claimRef: name: myclaim-1 - ``` #### Releasing a claim and Recycling a volume When a claim holder is finished with their data, they can delete their claim. -``` - -kubectl delete pvc myclaim-1 - +```console +$ kubectl delete pvc myclaim-1 ``` The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. diff --git a/resources.md b/resources.md index 055c5d86..7bcce84a 100644 --- a/resources.md +++ b/resources.md @@ -89,7 +89,7 @@ Both users and a number of system components, such as schedulers, (horizontal) a Resource requirements for a container or pod should have the following form: -``` +```yaml resourceRequirementSpec: [ request: [ cpu: 2.5, memory: "40Mi" ], limit: [ cpu: 4.0, memory: "99Mi" ], @@ -103,7 +103,7 @@ Where: Total capacity for a node should have a similar structure: -``` +```yaml resourceCapacitySpec: [ total: [ cpu: 12, memory: "128Gi" ] ] @@ -159,15 +159,16 @@ rather than decimal ones: "64MiB" rather than "64MB". A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example: -``` +```yaml resourceTypes: [ "kubernetes.io/memory": [ isCompressible: false, ... ] "kubernetes.io/cpu": [ - isCompressible: true, internalScaleExponent: 3, ... + isCompressible: true, + internalScaleExponent: 3, ... ] - "kubernetes.io/disk-space": [ ... } + "kubernetes.io/disk-space": [ ... ] ] ``` @@ -195,7 +196,7 @@ Because resource usage and related metrics change continuously, need to be track Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: -``` +```yaml resourceStatus: [ usage: [ cpu: , memory: ], maxusage: [ cpu: , memory: ], @@ -205,7 +206,7 @@ resourceStatus: [ where a `` or `` structure looks like this: -``` +```yaml { mean: # arithmetic mean max: # minimum value @@ -218,7 +219,7 @@ where a `` or `` structure looks like this: "99.9": <99.9th-percentile-value>, ... ] - } +} ``` All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_ diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 80bc6566..f5ef348a 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -62,7 +62,7 @@ To facilitate recovery in the case of a crash of the updating process itself, we Recovery is achieved by issuing the same command again: -``` +```sh kubectl rolling-update foo [foo-v2] --image=myimage:v2 ``` -- cgit v1.2.3 From 4bef20df2177f38a04f0cab82d8d1ca5abe8be5c Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Sun, 19 Jul 2015 05:58:13 +0000 Subject: Replace ``` with ` when emphasizing something inline in docs/ --- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- event_compression.md | 34 +++++++++--------- persistent-storage.md | 2 +- secrets.md | 4 +-- simple-rolling-update.md | 72 ++++++++++++++++++------------------- 6 files changed, 58 insertions(+), 58 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index ccdb44d8..d7a478ab 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -136,7 +136,7 @@ $ kube-apiserver -admission_control=LimitRanger kubectl is modified to support the **LimitRange** resource. -```kubectl describe``` provides a human-readable output of limits. +`kubectl describe` provides a human-readable output of limits. For example, diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 99d5431a..9ac3dd80 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -163,7 +163,7 @@ this being the resource most closely running at the prescribed quota limits. kubectl is modified to support the **ResourceQuota** resource. -```kubectl describe``` provides a human-readable output of quota. +`kubectl describe` provides a human-readable output of quota. For example, diff --git a/event_compression.md b/event_compression.md index 29e65917..bbc63155 100644 --- a/event_compression.md +++ b/event_compression.md @@ -38,41 +38,41 @@ This document captures the design of event compression. ## Background -Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate ```image_not_existing``` and ```container_is_waiting``` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](https://github.com/GoogleCloudPlatform/kubernetes/issues/3853)). +Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](https://github.com/GoogleCloudPlatform/kubernetes/issues/3853)). ## Proposal -Each binary that generates events (for example, ```kubelet```) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. +Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. -Event compression should be best effort (not guaranteed). Meaning, in the worst case, ```n``` identical (minus timestamp) events may still result in ```n``` event entries. +Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries. ## Design Instead of a single Timestamp, each event object [contains](../../pkg/api/types.go#L1111) the following fields: - * ```FirstTimestamp util.Time``` + * `FirstTimestamp util.Time` * The date/time of the first occurrence of the event. - * ```LastTimestamp util.Time``` + * `LastTimestamp util.Time` * The date/time of the most recent occurrence of the event. * On first occurrence, this is equal to the FirstTimestamp. - * ```Count int``` + * `Count int` * The number of occurrences of this event between FirstTimestamp and LastTimestamp * On first occurrence, this is 1. Each binary that generates events: * Maintains a historical record of previously generated events: - * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [```pkg/client/record/events_cache.go```](../../pkg/client/record/events_cache.go). + * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: - * ```event.Source.Component``` - * ```event.Source.Host``` - * ```event.InvolvedObject.Kind``` - * ```event.InvolvedObject.Namespace``` - * ```event.InvolvedObject.Name``` - * ```event.InvolvedObject.UID``` - * ```event.InvolvedObject.APIVersion``` - * ```event.Reason``` - * ```event.Message``` + * `event.Source.Component` + * `event.Source.Host` + * `event.InvolvedObject.Kind` + * `event.InvolvedObject.Namespace` + * `event.InvolvedObject.Name` + * `event.InvolvedObject.UID` + * `event.InvolvedObject.APIVersion` + * `event.Reason` + * `event.Message` * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [```pkg/client/record/event.go```](../../pkg/client/record/event.go)). + * When an event is generated, the previously generated events cache is checked (see [`pkg/client/record/event.go`](../../pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). diff --git a/persistent-storage.md b/persistent-storage.md index 3e9edd3e..d064e701 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -65,7 +65,7 @@ Kubernetes makes no guarantees at runtime that the underlying storage exists or #### Describe available storage -Cluster administrators use the API to manage *PersistentVolumes*. A custom store ```NewPersistentVolumeOrderedIndex``` will index volumes by access modes and sort by storage capacity. The ```PersistentVolumeClaimBinder``` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. +Cluster administrators use the API to manage *PersistentVolumes*. A custom store `NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. PVs are system objects and, thus, have no namespace. diff --git a/secrets.md b/secrets.md index 8aab1088..876a9390 100644 --- a/secrets.md +++ b/secrets.md @@ -297,7 +297,7 @@ storing it. Secrets contain multiple pieces of data that are presented as differ the secret volume (example: SSH key pair). In order to remove the burden from the end user in specifying every file that a secret consists of, -it should be possible to mount all files provided by a secret with a single ```VolumeMount``` entry +it should be possible to mount all files provided by a secret with a single `VolumeMount` entry in the container specification. ### Secret API Resource @@ -349,7 +349,7 @@ finer points of secrets and resource allocation are fleshed out. ### Secret Volume Source -A new `SecretSource` type of volume source will be added to the ```VolumeSource``` struct in the +A new `SecretSource` type of volume source will be added to the `VolumeSource` struct in the API: ```go diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 80bc6566..be38f20e 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -33,15 +33,15 @@ Documentation for other releases can be found at ## Simple rolling update -This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in ```kubectl```. +This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. ### Lightweight rollout -Assume that we have a current replication controller named ```foo``` and it is running image ```image:v1``` +Assume that we have a current replication controller named `foo` and it is running image `image:v1` -```kubectl rolling-update foo [foo-v2] --image=myimage:v2``` +`kubectl rolling-update foo [foo-v2] --image=myimage:v2` If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to the name of the original replication controller. @@ -50,15 +50,15 @@ Obviously there is a race here, where if you kill the client between delete foo, See [Recovery](#recovery) below If the user does specify a name for the 'next' replication controller, then the 'next' replication controller is retained with its existing name, -and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label ```kubernetes.io/deployment``` to both the ```foo``` and ```foo-next``` replication controllers. -The value of that label is the hash of the complete JSON representation of the```foo-next``` or```foo``` replication controller. The name of this label can be overridden by the user with the ```--deployment-label-key``` flag. +and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label `kubernetes.io/deployment` to both the `foo` and `foo-next` replication controllers. +The value of that label is the hash of the complete JSON representation of the`foo-next` or`foo` replication controller. The name of this label can be overridden by the user with the `--deployment-label-key` flag. #### Recovery If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. -To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the ```kubernetes.io/``` annotation namespace: - * ```desired-replicas``` The desired number of replicas for this replication controller (either N or zero) - * ```update-partner``` A pointer to the replication controller resource that is the other half of this update (syntax `````` the namespace is assumed to be identical to the namespace of this replication controller.) +To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the `kubernetes.io/` annotation namespace: + * `desired-replicas` The desired number of replicas for this replication controller (either N or zero) + * `update-partner` A pointer to the replication controller resource that is the other half of this update (syntax `` the namespace is assumed to be identical to the namespace of this replication controller.) Recovery is achieved by issuing the same command again: @@ -66,70 +66,70 @@ Recovery is achieved by issuing the same command again: kubectl rolling-update foo [foo-v2] --image=myimage:v2 ``` -Whenever the rolling update command executes, the kubectl client looks for replication controllers called ```foo``` and ```foo-next```, if they exist, an attempt is -made to roll ```foo``` to ```foo-next```. If ```foo-next``` does not exist, then it is created, and the rollout is a new rollout. If ```foo``` doesn't exist, then -it is assumed that the rollout is nearly completed, and ```foo-next``` is renamed to ```foo```. Details of the execution flow are given below. +Whenever the rolling update command executes, the kubectl client looks for replication controllers called `foo` and `foo-next`, if they exist, an attempt is +made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is created, and the rollout is a new rollout. If `foo` doesn't exist, then +it is assumed that the rollout is nearly completed, and `foo-next` is renamed to `foo`. Details of the execution flow are given below. ### Aborting a rollout Abort is assumed to want to reverse a rollout in progress. -```kubectl rolling-update foo [foo-v2] --rollback``` +`kubectl rolling-update foo [foo-v2] --rollback` This is really just semantic sugar for: -```kubectl rolling-update foo-v2 foo``` +`kubectl rolling-update foo-v2 foo` -With the added detail that it moves the ```desired-replicas``` annotation from ```foo-v2``` to ```foo``` +With the added detail that it moves the `desired-replicas` annotation from `foo-v2` to `foo` ### Execution Details -For the purposes of this example, assume that we are rolling from ```foo``` to ```foo-next``` where the only change is an image update from `v1` to `v2` +For the purposes of this example, assume that we are rolling from `foo` to `foo-next` where the only change is an image update from `v1` to `v2` -If the user doesn't specify a ```foo-next``` name, then it is either discovered from the ```update-partner``` annotation on ```foo```. If that annotation doesn't exist, -then ```foo-next``` is synthesized using the pattern ```-``` +If the user doesn't specify a `foo-next` name, then it is either discovered from the `update-partner` annotation on `foo`. If that annotation doesn't exist, +then `foo-next` is synthesized using the pattern `-` #### Initialization - * If ```foo``` and ```foo-next``` do not exist: + * If `foo` and `foo-next` do not exist: * Exit, and indicate an error to the user, that the specified controller doesn't exist. - * If ```foo``` exists, but ```foo-next``` does not: - * Create ```foo-next``` populate it with the ```v2``` image, set ```desired-replicas``` to ```foo.Spec.Replicas``` + * If `foo` exists, but `foo-next` does not: + * Create `foo-next` populate it with the `v2` image, set `desired-replicas` to `foo.Spec.Replicas` * Goto Rollout - * If ```foo-next``` exists, but ```foo``` does not: + * If `foo-next` exists, but `foo` does not: * Assume that we are in the rename phase. * Goto Rename - * If both ```foo``` and ```foo-next``` exist: + * If both `foo` and `foo-next` exist: * Assume that we are in a partial rollout - * If ```foo-next``` is missing the ```desired-replicas``` annotation - * Populate the ```desired-replicas``` annotation to ```foo-next``` using the current size of ```foo``` + * If `foo-next` is missing the `desired-replicas` annotation + * Populate the `desired-replicas` annotation to `foo-next` using the current size of `foo` * Goto Rollout #### Rollout - * While size of ```foo-next``` < ```desired-replicas``` annotation on ```foo-next``` - * increase size of ```foo-next``` - * if size of ```foo``` > 0 - decrease size of ```foo``` + * While size of `foo-next` < `desired-replicas` annotation on `foo-next` + * increase size of `foo-next` + * if size of `foo` > 0 + decrease size of `foo` * Goto Rename #### Rename - * delete ```foo``` - * create ```foo``` that is identical to ```foo-next``` - * delete ```foo-next``` + * delete `foo` + * create `foo` that is identical to `foo-next` + * delete `foo-next` #### Abort - * If ```foo-next``` doesn't exist + * If `foo-next` doesn't exist * Exit and indicate to the user that they may want to simply do a new rollout with the old version - * If ```foo``` doesn't exist + * If `foo` doesn't exist * Exit and indicate not found to the user - * Otherwise, ```foo-next``` and ```foo``` both exist - * Set ```desired-replicas``` annotation on ```foo``` to match the annotation on ```foo-next``` - * Goto Rollout with ```foo``` and ```foo-next``` trading places. + * Otherwise, `foo-next` and `foo` both exist + * Set `desired-replicas` annotation on `foo` to match the annotation on `foo-next` + * Goto Rollout with `foo` and `foo-next` trading places. -- cgit v1.2.3 From 0302cf3c1a511e975f8be11395603a508c52d348 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Mon, 20 Jul 2015 00:25:07 -0700 Subject: Absolutize links that leave the docs/ tree to go anywhere other than to examples/ or back to docs/ --- event_compression.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/event_compression.md b/event_compression.md index 5dfb0311..aea04e41 100644 --- a/event_compression.md +++ b/event_compression.md @@ -48,7 +48,7 @@ Event compression should be best effort (not guaranteed). Meaning, in the worst ## Design -Instead of a single Timestamp, each event object [contains](../../pkg/api/types.go#L1111) the following fields: +Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: * `FirstTimestamp util.Time` * The date/time of the first occurrence of the event. * `LastTimestamp util.Time` @@ -72,7 +72,7 @@ Each binary that generates events: * `event.Reason` * `event.Message` * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [`pkg/client/record/event.go`](../../pkg/client/record/event.go)). + * When an event is generated, the previously generated events cache is checked (see [`pkg/client/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). -- cgit v1.2.3 From 19a1346560fc7b5681e29427e9c1899b5c551b24 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Mon, 20 Jul 2015 09:40:32 -0700 Subject: Collected markedown fixes around syntax. --- admission_control_resource_quota.md | 1 - event_compression.md | 1 - 2 files changed, 2 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 1cc81771..c86577ac 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -100,7 +100,6 @@ type ResourceQuotaList struct { // Items is a list of ResourceQuota objects Items []ResourceQuota `json:"items"` } - ``` ## AdmissionControl plugin: ResourceQuota diff --git a/event_compression.md b/event_compression.md index aea04e41..bfa2c5d6 100644 --- a/event_compression.md +++ b/event_compression.md @@ -103,7 +103,6 @@ Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal - ``` This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries. -- cgit v1.2.3 From 51f581c03534c250238c6ec0531fc2c1f0f70f95 Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Mon, 20 Jul 2015 13:45:36 -0700 Subject: Fix capitalization of Kubernetes in the documentation. --- access.md | 2 +- clustering.md | 2 +- expansion.md | 2 +- secrets.md | 4 ++-- security.md | 6 +++--- service_accounts.md | 12 ++++++------ 6 files changed, 14 insertions(+), 14 deletions(-) diff --git a/access.md b/access.md index 9a0c0d3d..d2fe44ca 100644 --- a/access.md +++ b/access.md @@ -200,7 +200,7 @@ Namespaces versus userAccount vs Labels: Goals for K8s authentication: - Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required. -- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users. +- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The Kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users. - For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication. - So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver. - For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal. diff --git a/clustering.md b/clustering.md index 1fcb8aa3..757c1f0b 100644 --- a/clustering.md +++ b/clustering.md @@ -36,7 +36,7 @@ Documentation for other releases can be found at ## Overview -The term "clustering" refers to the process of having all members of the kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address. +The term "clustering" refers to the process of having all members of the Kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address. Once a cluster is established, the following is true: diff --git a/expansion.md b/expansion.md index 096b8a9d..75c748ca 100644 --- a/expansion.md +++ b/expansion.md @@ -94,7 +94,7 @@ script that sets up the environment and runs the command. This has a number of 1. Solutions that require a shell are unfriendly to images that do not contain a shell 2. Wrapper scripts make it harder to use images as base images -3. Wrapper scripts increase coupling to kubernetes +3. Wrapper scripts increase coupling to Kubernetes Users should be able to do the 80% case of variable expansion in command without writing a wrapper script or adding a shell invocation to their containers' commands. diff --git a/secrets.md b/secrets.md index 876a9390..f5793133 100644 --- a/secrets.md +++ b/secrets.md @@ -81,7 +81,7 @@ Goals of this design: the kubelet implement some reserved behaviors based on the types of secrets the service account consumes: 1. Use credentials for a docker registry to pull the pod's docker image - 2. Present kubernetes auth token to the pod or transparently decorate traffic between the pod + 2. Present Kubernetes auth token to the pod or transparently decorate traffic between the pod and master service 4. As a user, I want to be able to indicate that a secret expires and for that secret's value to be rotated once it expires, so that the system can help me follow good practices @@ -112,7 +112,7 @@ other system components to take action based on the secret's type. #### Example: service account consumes auth token secret As an example, the service account proposal discusses service accounts consuming secrets which -contain kubernetes auth tokens. When a Kubelet starts a pod associated with a service account +contain Kubernetes auth tokens. When a Kubelet starts a pod associated with a service account which consumes this type of secret, the Kubelet may take a number of actions: 1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's diff --git a/security.md b/security.md index 522ff4ca..1d73a529 100644 --- a/security.md +++ b/security.md @@ -55,14 +55,14 @@ While Kubernetes today is not primarily a multi-tenant system, the long term evo We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories: -1. k8s admin - administers a kubernetes cluster and has access to the underlying components of the system +1. k8s admin - administers a Kubernetes cluster and has access to the underlying components of the system 2. k8s project administrator - administrates the security of a small subset of the cluster -3. k8s developer - launches pods on a kubernetes cluster and consumes cluster resources +3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster resources Automated process users fall into the following categories: 1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project -2. k8s infrastructure user - the user that kubernetes infrastructure components use to perform cluster functions with clearly defined roles +2. k8s infrastructure user - the user that Kubernetes infrastructure components use to perform cluster functions with clearly defined roles ### Description of roles diff --git a/service_accounts.md b/service_accounts.md index d9535de5..8e63e045 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -76,7 +76,7 @@ type ServiceAccount struct { ``` The name ServiceAccount is chosen because it is widely used already (e.g. by Kerberos and LDAP) -to refer to this type of account. Note that it has no relation to kubernetes Service objects. +to refer to this type of account. Note that it has no relation to Kubernetes Service objects. The ServiceAccount object does not include any information that could not be defined separately: - username can be defined however users are defined. @@ -90,12 +90,12 @@ These features are explained later. ### Names -From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to kubernetes API. +From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to Kubernetes API. This includes a human running `kubectl` on her desktop and a container in a Pod on a Node making API calls. -There is already a notion of a username in kubernetes, which is populated into a request context after authentication. +There is already a notion of a username in Kubernetes, which is populated into a request context after authentication. However, there is no API object representing a user. While this may evolve, it is expected that in mature installations, -the canonical storage of user identifiers will be handled by a system external to kubernetes. +the canonical storage of user identifiers will be handled by a system external to Kubernetes. Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity ( @@ -104,7 +104,7 @@ accounts (e.g. `alice@example.com` vs `build-service-account-a3b7f0@foo-namespac but Kubernetes does not require this. Kubernetes also does not require that there be a distinction between human and Pod users. It will be possible -to setup a cluster where Alice the human talks to the kubernetes API as username `alice` and starts pods that +to setup a cluster where Alice the human talks to the Kubernetes API as username `alice` and starts pods that also talk to the API as user `alice` and write files to NFS as user `alice`. But, this is not recommended. Instead, it is recommended that Pods and Humans have distinct identities, and reference implementations will @@ -153,7 +153,7 @@ get a `Secret` which allows them to authenticate to the Kubernetes APIserver as policy that is desired can be applied to them. A higher level workflow is needed to coordinate creation of serviceAccounts, secrets and relevant policy objects. -Users are free to extend kubernetes to put this business logic wherever is convenient for them, though the +Users are free to extend Kubernetes to put this business logic wherever is convenient for them, though the Service Account Finalizer is one place where this can happen (see below). ### Kubelet -- cgit v1.2.3 From 39c004737b4cd86da696231aa55c4d8eabb11994 Mon Sep 17 00:00:00 2001 From: Brian Grant Date: Wed, 22 Jul 2015 20:16:41 +0000 Subject: Update post-1.0 release versioning proposal. --- versioning.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/versioning.md b/versioning.md index 8a547242..5d1cec0e 100644 --- a/versioning.md +++ b/versioning.md @@ -40,14 +40,13 @@ Legend: ## Release Timeline -### Minor version timeline - -* Kube 1.0.0 -* Kube 1.0.x: We create a 1.0-patch branch and backport critical bugs and security issues to it. Patch releases occur as needed. -* Kube 1.1-alpha1: Cut from HEAD, smoke tested and released two weeks after Kube 1.0's release. Roughly every two weeks a new alpha is released from HEAD. The timeline is flexible; for example, if there is a critical bugfix, a new alpha can be released ahead of schedule. (This applies to the beta and rc releases as well.) -* Kube 1.1-beta1: When HEAD is feature complete, we create a 1.1-snapshot branch and release it as a beta. (The 1.1-snapshot branch may be created earlier if something that definitely won't be in 1.1 needs to be merged to HEAD.) This should occur 6-8 weeks after Kube 1.0. Development continues at HEAD and only fixes are backported to 1.1-snapshot. -* Kube 1.1-rc1: Released from 1.1-snapshot when it is considered stable and ready for testing. Most users should be able to upgrade to this version in production. -* Kube 1.1: Final release. Should occur between 3 and 4 months after 1.0. +### Minor version scheme and timeline + +* Kube 1.0.0, 1.0.1 -- DONE! +* Kube 1.0.X (X>1): Standard operating procedure. We patch the release-1.0 branch as needed and increment the patch number. +* Kube 1.1alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. (This applies to the beta releases as well.) +* Kube 1.1beta.X: When HEAD is feature-complete, we go into code freeze 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. Releases continue to be cut from HEAD until we're essentially done. +* Kube 1.1.0: Final release. Should occur between 3 and 4 months after 1.0. ### Major version timeline -- cgit v1.2.3 From d6200d0d492377b82ca3afbec822230f857732d8 Mon Sep 17 00:00:00 2001 From: Janet Kuo Date: Wed, 22 Jul 2015 17:16:28 -0700 Subject: Fix doc typos --- networking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/networking.md b/networking.md index b1d5a460..dfe0f93e 100644 --- a/networking.md +++ b/networking.md @@ -87,7 +87,7 @@ whereas, in general, they don't control what pods land together on a host. ## Pod to pod Because every pod gets a "real" (not machine-private) IP address, pods can -communicate without proxies or translations. The can use well-known port +communicate without proxies or translations. The pod can use well-known port numbers and can avoid the use of higher-level service discovery systems like DNS-SD, Consul, or Etcd. -- cgit v1.2.3 From bd03d6d49788d5dd62e686dcaa3f641b964cea58 Mon Sep 17 00:00:00 2001 From: Brian Grant Date: Thu, 23 Jul 2015 00:42:03 +0000 Subject: Change to semantic versioning. --- versioning.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/versioning.md b/versioning.md index 5d1cec0e..9009dc59 100644 --- a/versioning.md +++ b/versioning.md @@ -44,8 +44,8 @@ Legend: * Kube 1.0.0, 1.0.1 -- DONE! * Kube 1.0.X (X>1): Standard operating procedure. We patch the release-1.0 branch as needed and increment the patch number. -* Kube 1.1alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. (This applies to the beta releases as well.) -* Kube 1.1beta.X: When HEAD is feature-complete, we go into code freeze 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. Releases continue to be cut from HEAD until we're essentially done. +* Kube 1.1.0-alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. (This applies to the beta releases as well.) +* Kube 1.1.0-beta.X: When HEAD is feature-complete, we go into code freeze 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. Releases continue to be cut from HEAD until we're essentially done. * Kube 1.1.0: Final release. Should occur between 3 and 4 months after 1.0. ### Major version timeline -- cgit v1.2.3 From 39eedfac6dd5d287611b2b21d60af7a19560aae8 Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Mon, 20 Jul 2015 11:45:58 -0500 Subject: Rewrite how the munger works The basic idea is that in the main mungedocs we run the entirefile and create an annotated set of lines about that file. All mungers then act on a struct mungeLines instead of on a bytes array. Making use of the metadata where appropriete. Helper functions exist to make updating a 'macro block' extremely easy. --- security_context.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security_context.md b/security_context.md index 03213927..8a6dd314 100644 --- a/security_context.md +++ b/security_context.md @@ -114,7 +114,7 @@ It is recommended that this design be implemented in two phases: 2. Implement a security context structure that is part of a service account. The default context provider can then be used to apply a security context based on the service account associated with the pod. - + ### Security Context Provider The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container: -- cgit v1.2.3 From b15dad5066d0fb1bd39b514230bfc8b2328ea72c Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Fri, 24 Jul 2015 17:52:18 -0400 Subject: Fix trailing whitespace in all docs --- README.md | 2 +- admission_control_resource_quota.md | 6 +++--- architecture.md | 2 +- event_compression.md | 2 +- expansion.md | 8 ++++---- namespaces.md | 14 +++++++------- persistent-storage.md | 12 ++++++------ principles.md | 8 ++++---- resources.md | 10 +++++----- secrets.md | 6 +++--- security_context.md | 34 +++++++++++++++++----------------- simple-rolling-update.md | 4 ++-- 12 files changed, 54 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index 62946cb6..72d2c662 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ Documentation for other releases can be found at # Kubernetes Design Overview -Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. +Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration. diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index c86577ac..136603d2 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -104,7 +104,7 @@ type ResourceQuotaList struct { ## AdmissionControl plugin: ResourceQuota -The **ResourceQuota** plug-in introspects all incoming admission requests. +The **ResourceQuota** plug-in introspects all incoming admission requests. It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. @@ -125,7 +125,7 @@ Any resource that is not part of core Kubernetes must follow the resource naming This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a -**ResourceQuotaUsage** document to the server to atomically update the observed usage based on the previously read +**ResourceQuotaUsage** document to the server to atomically update the observed usage based on the previously read **ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) into the system. @@ -184,7 +184,7 @@ resourcequotas 1 1 services 3 5 ``` -## More information +## More information See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota/) for more information. diff --git a/architecture.md b/architecture.md index f7c55171..5f829d68 100644 --- a/architecture.md +++ b/architecture.md @@ -47,7 +47,7 @@ Each node runs Docker, of course. Docker takes care of the details of downloadi ### Kubelet -The **Kubelet** manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. +The **Kubelet** manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. ### Kube-Proxy diff --git a/event_compression.md b/event_compression.md index bfa2c5d6..ce8d1ad4 100644 --- a/event_compression.md +++ b/event_compression.md @@ -49,7 +49,7 @@ Event compression should be best effort (not guaranteed). Meaning, in the worst ## Design Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: - * `FirstTimestamp util.Time` + * `FirstTimestamp util.Time` * The date/time of the first occurrence of the event. * `LastTimestamp util.Time` * The date/time of the most recent occurrence of the event. diff --git a/expansion.md b/expansion.md index 75c748ca..24a07f0d 100644 --- a/expansion.md +++ b/expansion.md @@ -87,7 +87,7 @@ available to subsequent expansions. ### Use Case: Variable expansion in command -Users frequently need to pass the values of environment variables to a container's command. +Users frequently need to pass the values of environment variables to a container's command. Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a shell in the container's command and have the shell perform the substitution, or to write a wrapper script that sets up the environment and runs the command. This has a number of drawbacks: @@ -130,7 +130,7 @@ The exact syntax for variable expansion has a large impact on how users perceive feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This syntax is an attractive option on some level, because many people are familiar with it. However, this syntax also has a large number of lesser known features such as the ability to provide -default values for unset variables, perform inline substitution, etc. +default values for unset variables, perform inline substitution, etc. In the interest of preventing conflation of the expansion feature in Kubernetes with the shell feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not @@ -239,7 +239,7 @@ The necessary changes to implement this functionality are: `ObjectReference` and an `EventRecorder` 2. Introduce `third_party/golang/expansion` package that provides: 1. An `Expand(string, func(string) string) string` function - 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function + 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function 3. Make the kubelet expand environment correctly 4. Make the kubelet expand command correctly @@ -311,7 +311,7 @@ func Expand(input string, mapping func(string) string) string { #### Kubelet changes -The Kubelet should be made to correctly expand variables references in a container's environment, +The Kubelet should be made to correctly expand variables references in a container's environment, command, and args. Changes will need to be made to: 1. The `makeEnvironmentVariables` function in the kubelet; this is used by diff --git a/namespaces.md b/namespaces.md index da3bb2c5..596f6f43 100644 --- a/namespaces.md +++ b/namespaces.md @@ -52,7 +52,7 @@ Each user community has its own: A cluster operator may create a Namespace for each unique user community. -The Namespace provides a unique scope for: +The Namespace provides a unique scope for: 1. named resources (to avoid basic naming collisions) 2. delegated management authority to trusted users @@ -142,7 +142,7 @@ type NamespaceSpec struct { A *FinalizerName* is a qualified name. -The API Server enforces that a *Namespace* can only be deleted from storage if and only if +The API Server enforces that a *Namespace* can only be deleted from storage if and only if it's *Namespace.Spec.Finalizers* is empty. A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation. @@ -189,12 +189,12 @@ are known to the cluster. The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one. Admission control blocks creation of new resources in that namespace in order to prevent a race-condition -where the controller could believe all of a given resource type had been deleted from the namespace, +where the controller could believe all of a given resource type had been deleted from the namespace, when in fact some other rogue client agent had created new objects. Using admission control in this scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle. Once all objects known to the *namespace controller* have been deleted, the *namespace controller* -executes a *finalize* operation on the namespace that removes the *kubernetes* value from +executes a *finalize* operation on the namespace that removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list. If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and @@ -245,13 +245,13 @@ In etcd, we want to continue to still support efficient WATCH across namespaces. Resources that persist content in etcd will have storage paths as follows: -/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} +/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}. ### Kubelet -The kubelet will register pod's it sources from a file or http source with a namespace associated with the +The kubelet will register pod's it sources from a file or http source with a namespace associated with the *cluster-id* ### Example: OpenShift Origin managing a Kubernetes Namespace @@ -362,7 +362,7 @@ This results in the following state: At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all -content associated from that namespace has been purged. It performs a final DELETE action +content associated from that namespace has been purged. It performs a final DELETE action to remove that Namespace from the storage. At this point, all content associated with that Namespace, and the Namespace itself are gone. diff --git a/persistent-storage.md b/persistent-storage.md index 51cfce89..bb200811 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -41,11 +41,11 @@ Two new API kinds: A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it. -A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod. +A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod. One new system component: -`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage. +`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage. One new volume: @@ -69,7 +69,7 @@ Cluster administrators use the API to manage *PersistentVolumes*. A custom stor PVs are system objects and, thus, have no namespace. -Many means of dynamic provisioning will be eventually be implemented for various storage types. +Many means of dynamic provisioning will be eventually be implemented for various storage types. ##### PersistentVolume API @@ -116,7 +116,7 @@ TBD #### Events -The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary. +The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary. Events that communicate the state of a mounted volume are left to the volume plugins. @@ -232,9 +232,9 @@ When a claim holder is finished with their data, they can delete their claim. $ kubectl delete pvc myclaim-1 ``` -The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. +The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. -Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. +Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. diff --git a/principles.md b/principles.md index c208fb6b..23a20349 100644 --- a/principles.md +++ b/principles.md @@ -33,7 +33,7 @@ Documentation for other releases can be found at # Design Principles -Principles to follow when extending Kubernetes. +Principles to follow when extending Kubernetes. ## API @@ -44,14 +44,14 @@ See also the [API conventions](../devel/api-conventions.md). * The control plane should be transparent -- there are no hidden internal APIs. * The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior. * Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation. -* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation. +* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation. * Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs). ## Control logic * Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization. * Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them. -* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation. +* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation. * Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions. * Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans. * Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure. @@ -61,7 +61,7 @@ See also the [API conventions](../devel/api-conventions.md). * Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.). * Compromising a single node shouldn't compromise the cluster. * Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage). -* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients. +* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients. * Watch is preferred over polling. ## Extensibility diff --git a/resources.md b/resources.md index 7bcce84a..e006d44d 100644 --- a/resources.md +++ b/resources.md @@ -51,7 +51,7 @@ The resource model aims to be: A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth. -Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_. +Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_. Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later. @@ -70,7 +70,7 @@ For future reference, note that some resources, such as CPU and network bandwidt ### Resource quantities -Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?). +Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?). Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources. @@ -110,7 +110,7 @@ resourceCapacitySpec: [ ``` Where: -* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes. +* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes. #### Notes @@ -194,7 +194,7 @@ The following are planned future extensions to the resource model, included here Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. -Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: +Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: ```yaml resourceStatus: [ @@ -223,7 +223,7 @@ where a `` or `` structure looks like this: ``` All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_ -and predicted +and predicted ## Future resource types diff --git a/secrets.md b/secrets.md index f5793133..3adc57af 100644 --- a/secrets.md +++ b/secrets.md @@ -34,7 +34,7 @@ Documentation for other releases can be found at ## Abstract A proposal for the distribution of [secrets](../user-guide/secrets.md) (passwords, keys, etc) to the Kubelet and to -containers inside Kubernetes using a custom [volume](../user-guide/volumes.md#secrets) type. See the [secrets example](../user-guide/secrets/) for more information. +containers inside Kubernetes using a custom [volume](../user-guide/volumes.md#secrets) type. See the [secrets example](../user-guide/secrets/) for more information. ## Motivation @@ -117,7 +117,7 @@ which consumes this type of secret, the Kubelet may take a number of actions: 1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's file system -2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the +2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the `kubernetes-master` service with the auth token, e. g. by adding a header to the request (see the [LOAS Daemon](https://github.com/GoogleCloudPlatform/kubernetes/issues/2209) proposal) @@ -146,7 +146,7 @@ We should consider what the best way to allow this is; there are a few different export MY_SECRET_ENV=MY_SECRET_VALUE The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for - the image either inline in the command or in an init script, + the image either inline in the command or in an init script, 2. Give secrets an attribute that allows users to express the intent that the platform should generate the above syntax in the file used to present a secret. The user could consume these diff --git a/security_context.md b/security_context.md index 8a6dd314..7a80c01d 100644 --- a/security_context.md +++ b/security_context.md @@ -48,55 +48,55 @@ The problem of securing containers in Kubernetes has come up [before](https://gi ### Container isolation -In order to improve container isolation from host and other containers running on the host, containers should only be -granted the access they need to perform their work. To this end it should be possible to take advantage of Docker -features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) +In order to improve container isolation from host and other containers running on the host, containers should only be +granted the access they need to perform their work. To this end it should be possible to take advantage of Docker +features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) to the container process. Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers. ### External integration with shared storage -In order to support external integration with shared storage, processes running in a Kubernetes cluster -should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. +In order to support external integration with shared storage, processes running in a Kubernetes cluster +should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks. ## Constraints and Assumptions -* It is out of the scope of this document to prescribe a specific set +* It is out of the scope of this document to prescribe a specific set of constraints to isolate containers from their host. Different use cases need different settings. -* The concept of a security context should not be tied to a particular security mechanism or platform +* The concept of a security context should not be tied to a particular security mechanism or platform (ie. SELinux, AppArmor) * Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for [service accounts](service_accounts.md). ## Use Cases -In order of increasing complexity, following are example use cases that would +In order of increasing complexity, following are example use cases that would be addressed with security contexts: 1. Kubernetes is used to run a single cloud application. In order to protect nodes from containers: * All containers run as a single non-root user * Privileged containers are disabled - * All containers run with a particular MCS label + * All containers run with a particular MCS label * Kernel capabilities like CHOWN and MKNOD are removed from containers - + 2. Just like case #1, except that I have more than one application running on the Kubernetes cluster. * Each application is run in its own namespace to avoid name collisions * For each application a different uid and MCS label is used - -3. Kubernetes is used as the base for a PAAS with - multiple projects, each project represented by a namespace. + +3. Kubernetes is used as the base for a PAAS with + multiple projects, each project represented by a namespace. * Each namespace is associated with a range of uids/gids on the node that - are mapped to uids/gids on containers using linux user namespaces. + are mapped to uids/gids on containers using linux user namespaces. * Certain pods in each namespace have special privileges to perform system actions such as talking back to the server for deployment, run docker builds, etc. * External NFS storage is assigned to each namespace and permissions set - using the range of uids/gids assigned to that namespace. + using the range of uids/gids assigned to that namespace. ## Proposed Design @@ -109,7 +109,7 @@ to mutate Docker API calls in order to apply the security context. It is recommended that this design be implemented in two phases: -1. Implement the security context provider extension point in the Kubelet +1. Implement the security context provider extension point in the Kubelet so that a default security context can be applied on container run and creation. 2. Implement a security context structure that is part of a service account. The default context provider can then be used to apply a security context based @@ -137,7 +137,7 @@ type SecurityContextProvider interface { } ``` -If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. +If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. ### Security Context diff --git a/simple-rolling-update.md b/simple-rolling-update.md index d99e7b25..720f4cbf 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -33,9 +33,9 @@ Documentation for other releases can be found at ## Simple rolling update -This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. +This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. -Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. +Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. ### Lightweight rollout -- cgit v1.2.3 From 4a1dcd958ef57876885631f8b19b8cc803e6316e Mon Sep 17 00:00:00 2001 From: Ananya Kumar Date: Thu, 30 Jul 2015 20:02:06 -0700 Subject: Update admission_control.md I tested out a Limit Ranger, and it seems like the admission happens *before* Validation. Please correct me if I'm wrong though, I didn't look at the code in detail. In any case, I think it makes sense for admission to happen before validation because code in admission can change containers. By the way I think it's pretty hard to find flows like this in the code, so it's useful if we add links to code in the design docs (for prospective developers) :) --- admission_control.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/admission_control.md b/admission_control.md index c75d5535..8cc6cf03 100644 --- a/admission_control.md +++ b/admission_control.md @@ -104,9 +104,9 @@ will ensure the following: 1. Incoming request 2. Authenticate user 3. Authorize user -4. If operation=create|update, then validate(object) -5. If operation=create|update|delete, then admission.Admit(requestAttributes) - a. invoke each admission.Interface object in sequence +4. If operation=create|update|delete, then admission.Admit(requestAttributes) + a. invoke each admission.Interface object in sequence +5. If operation=create|update, then validate(object) 6. Object is persisted If at any step, there is an error, the request is canceled. -- cgit v1.2.3 From 0a0fbb58fe67fbfb864145956bf3b8b86625d190 Mon Sep 17 00:00:00 2001 From: Ananya Kumar Date: Mon, 3 Aug 2015 23:00:48 -0700 Subject: Update admission_control.md --- admission_control.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/admission_control.md b/admission_control.md index 8cc6cf03..b84b2543 100644 --- a/admission_control.md +++ b/admission_control.md @@ -98,16 +98,17 @@ func init() { Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations. -This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow -will ensure the following: +This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow will ensure the following: 1. Incoming request 2. Authenticate user 3. Authorize user -4. If operation=create|update|delete, then admission.Admit(requestAttributes) - a. invoke each admission.Interface object in sequence -5. If operation=create|update, then validate(object) -6. Object is persisted +4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes) + - invoke each admission.Interface object in sequence +5. Case on the operation: + - If operation=create|update, then validate(object) and persist + - If operation=delete, delete the object + - If operation=connect, exec If at any step, there is an error, the request is canceled. -- cgit v1.2.3 From 3c23de245b41ab8b3d027af5ca9a4e7cf83fc4d3 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Tue, 4 Aug 2015 10:46:51 -0400 Subject: LimitRange documentation should be under admin --- admission_control_limit_range.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index b1baf1f0..595d72e9 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -142,7 +142,7 @@ For example, ```console $ kubectl namespace myspace -$ kubectl create -f docs/user-guide/limitrange/limits.yaml +$ kubectl create -f docs/admin/limitrange/limits.yaml $ kubectl get limits NAME limits @@ -166,7 +166,7 @@ To make a **LimitRangeItem** more restrictive, we will intend to add these addit ## Example -See the [example of Limit Range](../user-guide/limitrange/) for more information. +See the [example of Limit Range](../admin/limitrange/) for more information. -- cgit v1.2.3 From 94ec57fba832c57a013d9acc9bff51d8b4a42ce3 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Wed, 5 Aug 2015 14:06:36 -0400 Subject: Update resource quota design to align with requests and limits --- admission_control_resource_quota.md | 148 ++++++++++++++++++++++-------------- 1 file changed, 91 insertions(+), 57 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 136603d2..bb7c6e0a 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -35,13 +35,17 @@ Documentation for other releases can be found at ## Background -This document proposes a system for enforcing hard resource usage limits per namespace as part of admission control. +This document describes a system for enforcing hard resource usage limits per namespace as part of admission control. -## Model Changes +## Use cases -A new resource, **ResourceQuota**, is introduced to enumerate hard resource limits in a Kubernetes namespace. +1. Ability to enumerate resource usage limits per namespace. +2. Ability to monitor resource usage for tracked resources. +3. Ability to reject resource usage exceeding hard quotas. -A new resource, **ResourceQuotaUsage**, is introduced to support atomic updates of a **ResourceQuota** status. +## Data Model + +The **ResourceQuota** object is scoped to a **Namespace**. ```go // The following identify resource constants for Kubernetes object types @@ -54,109 +58,139 @@ const ( ResourceReplicationControllers ResourceName = "replicationcontrollers" // ResourceQuotas, number ResourceQuotas ResourceName = "resourcequotas" + // ResourceSecrets, number + ResourceSecrets ResourceName = "secrets" + // ResourcePersistentVolumeClaims, number + ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims" ) // ResourceQuotaSpec defines the desired hard limits to enforce for Quota type ResourceQuotaSpec struct { // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } // ResourceQuotaStatus defines the enforced hard limits and observed use type ResourceQuotaStatus struct { // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` // Used is the current observed total usage of the resource in the namespace - Used ResourceList `json:"used,omitempty"` + Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` } // ResourceQuota sets aggregate quota restrictions enforced per namespace type ResourceQuota struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty"` - - // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty"` -} - -// ResourceQuotaUsage captures system observed quota status per namespace -// It is used to enforce atomic updates of a backing ResourceQuota.Status field in storage -type ResourceQuotaUsage struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty"` + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` } // ResourceQuotaList is a list of ResourceQuota items type ResourceQuotaList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items"` + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } ``` -## AdmissionControl plugin: ResourceQuota +## Quota Tracked Resources -The **ResourceQuota** plug-in introspects all incoming admission requests. +The following resources are supported by the quota system. -It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request -namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. - -The following resource limits are imposed as part of core Kubernetes at the namespace level: - -| ResourceName | Description | +| Resource | Description | | ------------ | ----------- | -| cpu | Total cpu usage | -| memory | Total memory usage | -| pods | Total number of pods | +| cpu | Total requested cpu usage | +| memory | Total requested memory usage | +| pods | Total number of active pods where phase is pending or active. | | services | Total number of services | | replicationcontrollers | Total number of replication controllers | | resourcequotas | Total number of resource quotas | +| secrets | Total number of secrets | +| persistentvolumeclaims | Total number of persistent volume claims | -Any resource that is not part of core Kubernetes must follow the resource naming convention prescribed by Kubernetes. +If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed +by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) -This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) +## Resource Requirements: Requests vs Limits -If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a -**ResourceQuotaUsage** document to the server to atomically update the observed usage based on the previously read -**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) -into the system. +If a resource supports the ability to distinguish between a request and a limit for a resource, +the quota tracking system will only cost the request value against the quota usage. If a resource +is tracked by quota, and no request value is provided, the associated entity is rejected as part of admission. -To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document. As a result, -its encouraged to actually impose a cap on the total number of individual quotas that are tracked in the **Namespace** to 1 by explicitly -capping it in **ResourceQuota** document. +For an example, consider the following scenarios relative to tracking quota on CPU: -## kube-apiserver +| Pod | Container | Request CPU | Limit CPU | Result | +| --- | --------- | ----------- | --------- | ------ | +| X | C1 | 100m | 500m | The quota usage is incremented 100m | +| Y | C2 | 100m | none | The quota usage is incremented 100m | +| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit | +| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. | -The server is updated to be aware of **ResourceQuota** objects. +The rationale for accounting for the requested amount of a resource versus the limit is the belief +that a user should only be charged for what they are scheduled against in the cluster. In addition, +attempting to track usage against actual usage, where request < actual < limit, is considered highly +volatile. -The quota is only enforced if the kube-apiserver is started as follows: +As a consequence of this decision, the user is able to spread its usage of a resource across multiple tiers +of service. Let's demonstrate this via an example with a 4 cpu quota. -```console -$ kube-apiserver -admission_control=ResourceQuota -``` +The quota may be allocated as follows: + +| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage | +| --- | --------- | ----------- | --------- | ---- | ----------- | +| X | C1 | 1 | 4 | Burstable | 1 | +| Y | C2 | 2 | 2 | Guaranteed | 2 | +| Z | C3 | 1 | 3 | Burstable | 1 | -## kube-controller-manager +It is possible that the pods may consume 9 cpu over a given time period depending on the nodes available cpu +that held pod X and Z, but since we scheduled X and Z relative to the request, we only track the requesting +value against their allocated quota. If one wants to restrict the ratio between the request and limit, +it is encouraged that the user define a **LimitRange** with **LimitRequestRatio** to control burst out behavior. +This would in effect, let an administrator keep the difference between request and limit more in line with +tracked usage if desired. -A new controller is defined that runs a synch loop to calculate quota usage across the namespace. +## Status API -**ResourceQuota** usage is only calculated if a namespace has a **ResourceQuota** object. +A REST API endpoint to update the status section of the **ResourceQuota** is exposed. It requires an atomic compare-and-swap +in order to keep resource usage tracking consistent. -If the observed usage is different than the recorded usage, the controller sends a **ResourceQuotaUsage** resource -to the server to atomically update. +## Resource Quota Controller -The synchronization loop frequency will control how quickly DELETE actions are recorded in the system and usage is ticked down. +A resource quota controller monitors observed usage for tracked resources in the **Namespace**. + +If there is observed difference between the current usage stats versus the current **ResourceQuota.Status**, the controller +posts an update of the currently observed usage metrics to the **ResourceQuota** via the /status endpoint. + +The resource quota controller is the only component capable of monitoring and recording usage updates after a DELETE operation +since admission control is incapable of guaranteeing a DELETE request actually succeeded. + +## AdmissionControl plugin: ResourceQuota + +The **ResourceQuota** plug-in introspects all incoming admission requests. + +To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows: + +``` +$ kube-apiserver -admission_control=ResourceQuota +``` + +It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request +namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. + +If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a +**ResourceQuota.Status** document to the server to atomically update the observed usage based on the previously read +**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) +into the system. -To optimize the synchronization loop, this controller will WATCH on Pod resources to track DELETE events, and in response, recalculate -usage. This is because a Pod deletion will have the most impact on observed cpu and memory usage in the system, and we anticipate -this being the resource most closely running at the prescribed quota limits. +To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document in a **Namespace**. As a result, its encouraged to impose a cap on the total number of individual quotas that are tracked in the **Namespace** +to 1 in the **ResourceQuota** document. ## kubectl -- cgit v1.2.3 From a74ffb6a381cf9a7bd8282c8d9806bae41680f3d Mon Sep 17 00:00:00 2001 From: Mike Danese Date: Wed, 5 Aug 2015 18:08:26 -0700 Subject: rewrite all links to issues to k8s links --- access.md | 8 ++++---- admission_control.md | 2 +- command_execution_port_forwarding.md | 6 +++--- event_compression.md | 10 +++++----- identifiers.md | 2 +- principles.md | 2 +- resources.md | 4 ++-- secrets.md | 6 +++--- security_context.md | 2 +- 9 files changed, 21 insertions(+), 21 deletions(-) diff --git a/access.md b/access.md index d2fe44ca..92840f73 100644 --- a/access.md +++ b/access.md @@ -118,8 +118,8 @@ Pods configs should be largely portable between Org-run and hosted configuration # Design Related discussion: -- https://github.com/GoogleCloudPlatform/kubernetes/issues/442 -- https://github.com/GoogleCloudPlatform/kubernetes/issues/443 +- http://issue.k8s.io/442 +- http://issue.k8s.io/443 This doc describes two security profiles: - Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users. @@ -176,7 +176,7 @@ Initially: Improvements: - Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) - requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids. -- any features that help users avoid use of privileged containers (https://github.com/GoogleCloudPlatform/kubernetes/issues/391) +- any features that help users avoid use of privileged containers (http://issue.k8s.io/391) ### Namespaces @@ -253,7 +253,7 @@ Policy objects may be applicable only to a single namespace or to all namespaces ## Accounting -The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.md)). +The API should have a `quota` concept (see http://issue.k8s.io/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.md)). Initially: - a `quota` object is immutable. diff --git a/admission_control.md b/admission_control.md index b84b2543..9245aa7d 100644 --- a/admission_control.md +++ b/admission_control.md @@ -37,7 +37,7 @@ Documentation for other releases can be found at | Topic | Link | | ----- | ---- | -| Separate validation from RESTStorage | https://github.com/GoogleCloudPlatform/kubernetes/issues/2977 | +| Separate validation from RESTStorage | http://issue.k8s.io/2977 | ## Background diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 1d319adf..852e761e 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -44,9 +44,9 @@ This describes an approach for providing support for: There are several related issues/PRs: -- [Support attach](https://github.com/GoogleCloudPlatform/kubernetes/issues/1521) -- [Real container ssh](https://github.com/GoogleCloudPlatform/kubernetes/issues/1513) -- [Provide easy debug network access to services](https://github.com/GoogleCloudPlatform/kubernetes/issues/1863) +- [Support attach](http://issue.k8s.io/1521) +- [Real container ssh](http://issue.k8s.io/1513) +- [Provide easy debug network access to services](http://issue.k8s.io/1863) - [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576) ## Motivation diff --git a/event_compression.md b/event_compression.md index ce8d1ad4..b14d5206 100644 --- a/event_compression.md +++ b/event_compression.md @@ -38,7 +38,7 @@ This document captures the design of event compression. ## Background -Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](https://github.com/GoogleCloudPlatform/kubernetes/issues/3853)). +Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)). ## Proposal @@ -109,10 +109,10 @@ This demonstrates what would have been 20 separate entries (indicating schedulin ## Related Pull Requests/Issues - * Issue [#4073](https://github.com/GoogleCloudPlatform/kubernetes/issues/4073): Compress duplicate events - * PR [#4157](https://github.com/GoogleCloudPlatform/kubernetes/issues/4157): Add "Update Event" to Kubernetes API - * PR [#4206](https://github.com/GoogleCloudPlatform/kubernetes/issues/4206): Modify Event struct to allow compressing multiple recurring events in to a single event - * PR [#4306](https://github.com/GoogleCloudPlatform/kubernetes/issues/4306): Compress recurring events in to a single event to optimize etcd storage + * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events + * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API + * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event + * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map diff --git a/identifiers.md b/identifiers.md index 9e269993..7deff9e9 100644 --- a/identifiers.md +++ b/identifiers.md @@ -33,7 +33,7 @@ Documentation for other releases can be found at # Identifiers and Names in Kubernetes -A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199). +A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](http://issue.k8s.io/199). ## Definitions diff --git a/principles.md b/principles.md index 23a20349..be3dff55 100644 --- a/principles.md +++ b/principles.md @@ -70,7 +70,7 @@ TODO: pluggability ## Bootstrapping -* [Self-hosting](https://github.com/GoogleCloudPlatform/kubernetes/issues/246) of all components is a goal. +* [Self-hosting](http://issue.k8s.io/246) of all components is a goal. * Minimize the number of dependencies, particularly those required for steady-state operation. * Stratify the dependencies that remain via principled layering. * Break any circular dependencies by converting hard dependencies to soft dependencies. diff --git a/resources.md b/resources.md index e006d44d..fe6f0ec7 100644 --- a/resources.md +++ b/resources.md @@ -33,7 +33,7 @@ Documentation for other releases can be found at **Note: this is a design doc, which describes features that have not been completely implemented. User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for implementation of this model is -[#168](https://github.com/GoogleCloudPlatform/kubernetes/issues/168). Currently, only memory and +[#168](http://issue.k8s.io/168). Currently, only memory and cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in milli-cores.** @@ -134,7 +134,7 @@ The following resource types are predefined ("reserved") by Kubernetes in the `k * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU") * Internal representation: milli-KCUs * Compressible? yes - * Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](https://github.com/GoogleCloudPlatform/kubernetes/issues/147) + * Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](http://issue.k8s.io/147) * [future] `schedulingLatency`: as per lmctfy * [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0). diff --git a/secrets.md b/secrets.md index 3adc57af..350d151b 100644 --- a/secrets.md +++ b/secrets.md @@ -119,7 +119,7 @@ which consumes this type of secret, the Kubelet may take a number of actions: file system 2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the `kubernetes-master` service with the auth token, e. g. by adding a header to the request - (see the [LOAS Daemon](https://github.com/GoogleCloudPlatform/kubernetes/issues/2209) proposal) + (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) #### Example: service account consumes docker registry credentials @@ -263,11 +263,11 @@ the right storage size for their installation and configuring their Kubelets cor Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that the cluster-wide storage size be readable from a central configuration store like the one proposed -in [#1553](https://github.com/GoogleCloudPlatform/kubernetes/issues/1553). When such a store +in [#1553](http://issue.k8s.io/1553). When such a store exists, the Kubelet could be modified to read this configuration item from the store. When the Kubelet is modified to advertise node resources (as proposed in -[#4441](https://github.com/GoogleCloudPlatform/kubernetes/issues/4441)), the capacity calculation +[#4441](http://issue.k8s.io/4441)), the capacity calculation for available memory should factor in the potential size of the node-level tmpfs in order to avoid memory overcommit on the node. diff --git a/security_context.md b/security_context.md index 7a80c01d..4704caab 100644 --- a/security_context.md +++ b/security_context.md @@ -42,7 +42,7 @@ A security context is a set of constraints that are applied to a container in or ## Background -The problem of securing containers in Kubernetes has come up [before](https://github.com/GoogleCloudPlatform/kubernetes/issues/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface. +The problem of securing containers in Kubernetes has come up [before](http://issue.k8s.io/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface. ## Motivation -- cgit v1.2.3 From 09d971bc58179999aea2545bd2b922a6f170a3ef Mon Sep 17 00:00:00 2001 From: Mike Danese Date: Wed, 5 Aug 2015 18:09:50 -0700 Subject: rewrite all links to prs to k8s links --- event_compression.md | 2 +- security.md | 4 ++-- security_context.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/event_compression.md b/event_compression.md index b14d5206..1187edb6 100644 --- a/event_compression.md +++ b/event_compression.md @@ -113,7 +113,7 @@ This demonstrates what would have been 20 separate entries (indicating schedulin * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage - * PR [#4444](https://github.com/GoogleCloudPlatform/kubernetes/pull/4444): Switch events history to use LRU cache instead of map + * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache instead of map diff --git a/security.md b/security.md index 1d73a529..5c187d69 100644 --- a/security.md +++ b/security.md @@ -127,11 +127,11 @@ A pod runs in a *security context* under a *service account* that is defined by ### Related design discussion * [Authorization and authentication](access.md) -* [Secret distribution via files](https://github.com/GoogleCloudPlatform/kubernetes/pull/2030) +* [Secret distribution via files](http://pr.k8s.io/2030) * [Docker secrets](https://github.com/docker/docker/pull/6697) * [Docker vault](https://github.com/docker/docker/issues/10310) * [Service Accounts:](service_accounts.md) -* [Secret volumes](https://github.com/GoogleCloudPlatform/kubernetes/pull/4126) +* [Secret volumes](http://pr.k8s.io/4126) ## Specific Design Points diff --git a/security_context.md b/security_context.md index 4704caab..1d2b4f71 100644 --- a/security_context.md +++ b/security_context.md @@ -192,7 +192,7 @@ It is up to an admission plugin to determine if the security context is acceptab time of writing, the admission control plugin for security contexts will only allow a context that has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options will be denied by default. In the future the admission plugin will base this decision upon -configurable policies that reside within the [service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). +configurable policies that reside within the [service account](http://pr.k8s.io/2297). -- cgit v1.2.3 From ecf3f1ba5e7d8399acd5a631c810816d2c9b4fca Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Thu, 6 Aug 2015 00:53:01 -0400 Subject: Fix typo in security context proposal --- security_context.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security_context.md b/security_context.md index 7a80c01d..6f0b92b0 100644 --- a/security_context.md +++ b/security_context.md @@ -145,7 +145,7 @@ A security context resides on the container and represents the runtime parameter be used to create and run the container via container APIs. Following is an example of an initial implementation: ```go -type type Container struct { +type Container struct { ... other fields omitted ... // Optional: SecurityContext defines the security options the pod should be run with SecurityContext *SecurityContext -- cgit v1.2.3 From 2414459b8225d0b3702b3e232b5da3376631eddb Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Thu, 6 Aug 2015 10:58:55 -0400 Subject: Update design for LimitRange to handle requests --- admission_control_limit_range.md | 183 ++++++++++++++++++++++++--------------- 1 file changed, 114 insertions(+), 69 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 595d72e9..885ef664 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -35,139 +35,184 @@ Documentation for other releases can be found at ## Background -This document proposes a system for enforcing min/max limits per resource as part of admission control. +This document proposes a system for enforcing resource requirements constraints as part of admission control. -## Model Changes +## Use cases -A new resource, **LimitRange**, is introduced to enumerate min/max limits for a resource type scoped to a -Kubernetes namespace. +1. Ability to enumerate resource requirement constraints per namespace +2. Ability to enumerate min/max resource constraints for a pod +3. Ability to enumerate min/max resource constraints for a container +4. Ability to specify default resource limits for a container +5. Ability to specify default resource requests for a container +6. Ability to enforce a ratio between request and limit for a resource. + +## Data Model + +The **LimitRange** resource is scoped to a **Namespace**. + +### Type ```go +// A type of object that is limited +type LimitType string + const ( // Limit that applies to all pods in a namespace - LimitTypePod string = "Pod" + LimitTypePod LimitType = "Pod" // Limit that applies to all containers in a namespace - LimitTypeContainer string = "Container" + LimitTypeContainer LimitType = "Container" ) // LimitRangeItem defines a min/max usage limit for any resource that matches on kind type LimitRangeItem struct { // Type of resource that this limit applies to - Type string `json:"type,omitempty"` + Type LimitType `json:"type,omitempty" description:"type of resource that this limit applies to"` // Max usage constraints on this kind by resource name - Max ResourceList `json:"max,omitempty"` + Max ResourceList `json:"max,omitempty" description:"max usage constraints on this kind by resource name"` // Min usage constraints on this kind by resource name - Min ResourceList `json:"min,omitempty"` - // Default usage constraints on this kind by resource name - Default ResourceList `json:"default,omitempty"` + Min ResourceList `json:"min,omitempty" description:"min usage constraints on this kind by resource name"` + // Default resource limits on this kind by resource name + Default ResourceList `json:"default,omitempty" description:"default resource limits values on this kind by resource name if omitted"` + // DefaultRequests resource requests on this kind by resource name + DefaultRequests ResourceList `json:"defaultRequests,omitempty" description:"default resource requests values on this kind by resource name if omitted"` + // LimitRequestRatio is the ratio of limit over request that is the maximum allowed burst for the named resource + LimitRequestRatio ResourceList `json:"limitRequestRatio,omitempty" description:"the ratio of limit over request that is the maximum allowed burst for the named resource. if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value"` } // LimitRangeSpec defines a min/max usage limit for resources that match on kind type LimitRangeSpec struct { // Limits is the list of LimitRangeItem objects that are enforced - Limits []LimitRangeItem `json:"limits"` + Limits []LimitRangeItem `json:"limits" description:"limits is the list of LimitRangeItem objects that are enforced"` } // LimitRange sets resource usage limits for each kind of resource in a Namespace type LimitRange struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Spec defines the limits enforced - Spec LimitRangeSpec `json:"spec,omitempty"` + Spec LimitRangeSpec `json:"spec,omitempty" description:"spec defines the limits enforced; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` } // LimitRangeList is a list of LimitRange items. type LimitRangeList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Items is a list of LimitRange objects - Items []LimitRange `json:"items"` + Items []LimitRange `json:"items" description:"items is a list of LimitRange objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md"` } ``` -## AdmissionControl plugin: LimitRanger +### Validation -The **LimitRanger** plug-in introspects all incoming admission requests. +Validation of a **LimitRange** enforces that for a given named resource the following rules apply: -It makes decisions by evaluating the incoming object against all defined **LimitRange** objects in the request context namespace. +Min (if specified) <= DefaultRequests (if specified) <= Default (if specified) <= Max (if specified) -The following min/max limits are imposed: +### Default Value Behavior -**Type: Container** +The following default value behaviors are applied to a LimitRange for a given named resource. -| ResourceName | Description | -| ------------ | ----------- | -| cpu | Min/Max amount of cpu per container | -| memory | Min/Max amount of memory per container | +``` +if LimitRangeItem.Default[resourceName] is undefined + if LimitRangeItem.Max[resourceName] is defined + LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName] +``` -**Type: Pod** +``` +if LimitRangeItem.DefaultRequests[resourceName] is undefined + if LimitRangeItem.Default[resourceName] is defined + LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Default[resourceName] + else if LimitRangeItem.Min[resourceName] is defined + LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Min[resourceName] +``` -| ResourceName | Description | -| ------------ | ----------- | -| cpu | Min/Max amount of cpu per pod | -| memory | Min/Max amount of memory per pod | +## AdmissionControl plugin: LimitRanger -If a resource specifies a default value, it may get applied on the incoming resource. For example, if a default -value is provided for container cpu, it is set on the incoming container if and only if the incoming container -does not specify a resource requirements limit field. +The **LimitRanger** plug-in introspects all incoming pod requests and evaluates the constraints defined on a LimitRange. -If a resource specifies a min value, it may get applied on the incoming resource. For example, if a min -value is provided for container cpu, it is set on the incoming container if and only if the incoming container does -not specify a resource requirements requests field. +If a constraint is not specified for an enumerated resource, it is not enforced or tracked. -If the incoming object would cause a violation of the enumerated constraints, the request is denied with a set of -messages explaining what constraints were the source of the denial. +To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows: -If a constraint is not enumerated by a **LimitRange** it is not tracked. +```console +$ kube-apiserver -admission_control=LimitRanger +``` -## kube-apiserver +### Enforcement of constraints -The server is updated to be aware of **LimitRange** objects. +**Type: Container** -The constraints are only enforced if the kube-apiserver is started as follows: +Supported Resources: -```console -$ kube-apiserver -admission_control=LimitRanger -``` +1. memory +2. cpu -## kubectl +Supported Constraints: -kubectl is modified to support the **LimitRange** resource. +Per container, the following must hold true -`kubectl describe` provides a human-readable output of limits. +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min <= Request (required) <= Limit (optional) | +| Max | Limit (required) <= Max | +| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) | -For example, +Supported Defaults: -```console -$ kubectl namespace myspace -$ kubectl create -f docs/admin/limitrange/limits.yaml -$ kubectl get limits -NAME -limits -$ kubectl describe limits limits -Name: limits -Type Resource Min Max Default ----- -------- --- --- --- -Pod memory 1Mi 1Gi - -Pod cpu 250m 2 - -Container memory 1Mi 1Gi 1Mi -Container cpu 250m 250m 250m -``` +1. Default - if the named resource has no enumerated value, the Limit is equal to the Default +2. DefaultRequest - if the named resource has no enumerated value, the Request is equal to the DefaultRequest + +**Type: Pod** + +Supported Resources: -## Future Enhancements: Define limits for a particular pod or container. +1. memory +2. cpu -In the current proposal, the **LimitRangeItem** matches purely on **LimitRangeItem.Type** +Supported Constraints: -It is expected we will want to define limits for particular pods or containers by name/uid and label/field selector. +Across all containers in pod, the following must hold true -To make a **LimitRangeItem** more restrictive, we will intend to add these additional restrictions at a future point in time. +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min <= Request (required) <= Limit (optional) | +| Max | Limit (required) <= Max | +| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) | + +## Run-time configuration + +The default ```LimitRange``` that is applied via Salt configuration will be updated as follows: + +``` +apiVersion: "v1" +kind: "LimitRange" +metadata: + name: "limits" + namespace: default +spec: + limits: + - type: "Container" + defaultRequests: + cpu: "100m" +``` ## Example -See the [example of Limit Range](../admin/limitrange/) for more information. +An example LimitRange configuration: + +| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio | +| ---- | -------- | --- | --- | ------- | -------------- | ----------------- | +| Container | cpu | .1 | 1 | 500m | 250m | 4 | +| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | | + +Assuming an incoming container that specified no incoming resource requirements, +the following would happen. +1. The incoming container cpu would request 250m with a limit of 500m. +2. The incoming container memory would request 250Mi with a limit of 500Mi +3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() -- cgit v1.2.3 From 56b54ec64f3062926424ddf36cac20ebdc983b37 Mon Sep 17 00:00:00 2001 From: Ben McCann Date: Fri, 7 Aug 2015 00:13:15 -0700 Subject: Fix the architecture diagram such that the arrow from the api server to the node doesn't go through/under etcd --- architecture.dia | Bin 6522 -> 6519 bytes architecture.png | Bin 222407 -> 223860 bytes architecture.svg | 98 +++++++++++++++++++++++++++---------------------------- 3 files changed, 49 insertions(+), 49 deletions(-) diff --git a/architecture.dia b/architecture.dia index 26e0eed2..441e3563 100644 Binary files a/architecture.dia and b/architecture.dia differ diff --git a/architecture.png b/architecture.png index fa39039a..b03cfe88 100644 Binary files a/architecture.png and b/architecture.png differ diff --git a/architecture.svg b/architecture.svg index 825c0ace..cacc7fbf 100644 --- a/architecture.svg +++ b/architecture.svg @@ -153,9 +153,9 @@ - - - + + + @@ -168,7 +168,7 @@ - + @@ -181,24 +181,24 @@ - - - - replication controller + + + + replication controller - - - - Scheduler + + + + Scheduler - - - - Scheduler + + + + Scheduler @@ -206,11 +206,11 @@ Colocated, or spread across machines, as dictated by cluster size. - - + + - - + + @@ -241,19 +241,19 @@ APIs - - - + + + - - - + + + - - - + + + @@ -261,11 +261,11 @@ - - - - - + + + + + @@ -295,10 +295,10 @@ - + .. - + ... @@ -311,7 +311,7 @@ - + @@ -447,7 +447,7 @@ - + @@ -459,10 +459,10 @@ - + .. - + ... @@ -475,7 +475,7 @@ - + @@ -486,14 +486,14 @@ - - - - Distributed - Watchable - Storage - - (implemented via etcd) + + + + Distributed + Watchable + Storage + + (implemented via etcd) -- cgit v1.2.3 From c89196ac7341fdfbfbed5d27bb568bb66d4eafec Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Tue, 11 Aug 2015 16:29:50 -0400 Subject: Update code to use - in flag names instead of _ --- admission_control.md | 4 ++-- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/admission_control.md b/admission_control.md index 9245aa7d..a2b5700b 100644 --- a/admission_control.md +++ b/admission_control.md @@ -63,8 +63,8 @@ The kube-apiserver takes the following OPTIONAL arguments to enable admission co | Option | Behavior | | ------ | -------- | -| admission_control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | -| admission_control_config_file | File with admission control configuration parameters to boot-strap plug-in. | +| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | +| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. | An **AdmissionControl** plug-in is an implementation of the following interface: diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 885ef664..621fd564 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -137,7 +137,7 @@ If a constraint is not specified for an enumerated resource, it is not enforced To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows: ```console -$ kube-apiserver -admission_control=LimitRanger +$ kube-apiserver --admission-control=LimitRanger ``` ### Enforcement of constraints diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index bb7c6e0a..86fae451 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -178,7 +178,7 @@ The **ResourceQuota** plug-in introspects all incoming admission requests. To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows: ``` -$ kube-apiserver -admission_control=ResourceQuota +$ kube-apiserver --admission-control=ResourceQuota ``` It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request -- cgit v1.2.3 From 4fa0f3a7b2c2d43815967c6f4671713a0c2ffa40 Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Mon, 27 Jul 2015 12:49:06 -0700 Subject: Add initial storage types to the Kubernetes API --- extending-api.md | 222 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 extending-api.md diff --git a/extending-api.md b/extending-api.md new file mode 100644 index 00000000..cca257bd --- /dev/null +++ b/extending-api.md @@ -0,0 +1,222 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/extending-api.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Adding custom resources to the Kubernetes API server + +This document describes the design for implementing the storage of custom API types in the Kubernetes API Server. + + +## Resource Model + +### The ThirdPartyResource + +The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add +to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource, attempting to place it in a resource +will return an error. + +Each `ThirdPartyResource` resource has the following: + * Standard Kubernetes object metadata. + * ResourceKind - The kind of the resources described by this third party resource. + * Description - A free text description of the resource. + * APIGroup - An API group that this resource should be placed into. + * Versions - One or more `Version` objects. + +### The `Version` Object + +The `Version` object describes a single concrete version of a custom resource. The `Version` object currently +only specifies: + * The `Name` of the version. + * The `APIGroup` this version should belong to. + +## Expectations about third party objects + +Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes +compatible [object metadata](../devel/api-conventions.md#metadata). This requirement enables the +Kubernetes API server to provide the following features: + * Filtering lists of objects via LabelQueries + * `resourceVersion`-based optimistic concurrency via compare-and-swap + * Versioned storage + * Event recording + * Integration with basic `kubectl` command line tooling. + * Watch for resource changes. + +The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be +programnatically convertible to the name of the resource using +the following conversion. Kinds are expected to be of the form ``, the +`APIVersion` for the object is expected to be `//`. + +For example `example.com/stable/v1` + +`domain-name` is expected to be a fully qualified domain name. + +'CamelCaseKind' is the specific type name. + +To convert this into the `metadata.name` for the `ThirdPartyResource` resource instance, +the `` is copied verbatim, the `CamelCaseKind` is +then converted +using '-' instead of capitalization ('camel-case'), with the first character being assumed to be +capitalized. In pseudo code: + +```go +var result string +for ix := range kindName { + if isCapital(kindName[ix]) { + result = append(result, '-') + } + result = append(result, toLowerCase(kindName[ix]) +} +``` + +As a concrete example, the resource named `camel-case-kind.example.com` defines resources of Kind `CamelCaseKind`, in +the APIGroup with the prefix `example.com/...`. + +The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information. +This is also the reason why `ThirdPartyResource` is not namespaced. + +## Usage + +When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced +RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects +deleting a namespace, deletes all third party resources in that namespace. + +For example, if a user creates: + +```yaml +metadata: + name: cron-tab.example.com +apiVersion: experimental/v1 +kind: ThirdPartyResource +description: "A specification of a Pod to run on a cron style schedule" +versions: + - name: stable/v1 + - name: experimental/v2 +``` + +Then the API server will program in two new RESTful resource paths: + * `/thirdparty/example.com/stable/v1/namespaces//crontabs/...` + * `/thirdparty/example.com/experimental/v2/namespaces//crontabs/...` + + +Now that this schema has been created, a user can `POST`: + +```json +{ + "metadata": { + "name": "my-new-cron-object" + }, + "apiVersion": "example.com/stable/v1", + "kind": "CronTab", + "cronSpec": "* * * * /5", + "image": "my-awesome-chron-image" +} +``` + +to: `/third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object` + +and the corresponding data will be stored into etcd by the APIServer, so that when the user issues: + +``` +GET /third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object` +``` + +And when they do that, they will get back the same data, but with additional Kubernetes metadata +(e.g. `resourceVersion`, `createdTimestamp`) filled in. + +Likewise, to list all resources, a user can issue: + +``` +GET /third-party/example.com/stable/v1/namespaces/default/crontabs +``` + +and get back: + +```json +{ + "apiVersion": "example.com/stable/v1", + "kind": "CronTabList", + "items": [ + { + "metadata": { + "name": "my-new-cron-object" + }, + "apiVersion": "example.com/stable/v1", + "kind": "CronTab", + "cronSpec": "* * * * /5", + "image": "my-awesome-chron-image" + } + ] +} +``` + +Because all objects are expected to contain standard Kubernetes metdata fileds, these +list operations can also use `Label` queries to filter requests down to specific subsets. + +Likewise, clients can use watch endpoints to watch for changes to stored objects. + + +## Storage + +In order to store custom user data in a versioned fashion inside of etcd, we need to also introduce a +`Codec`-compatible object for persistent storage in etcd. This object is `ThirdPartyResourceData` and it contains: + * Standard API Metadata + * `Data`: The raw JSON data for this custom object. + +### Storage key specification + +Each custom object stored by the API server needs a custom key in storage, this is described below: + +#### Definitions + + * `resource-namespace` : the namespace of the particular resource that is being stored + * `resource-name`: the name of the particular resource being stored + * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored. + * `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored. + +#### Key + +Given the definitions above, the key for a specific third-party object is: + +``` +${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name} +``` + +Thus, listing a third-party resource can be achieved by listing the directory: + +``` +${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/ +``` + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() + -- cgit v1.2.3 From 818f69e30b787f90b8678eef776039bc015b44bc Mon Sep 17 00:00:00 2001 From: He Simei Date: Thu, 30 Jul 2015 14:09:15 +0800 Subject: fix service-account related doc --- secrets.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/secrets.md b/secrets.md index 350d151b..895d9448 100644 --- a/secrets.md +++ b/secrets.md @@ -321,9 +321,9 @@ type Secret struct { type SecretType string const ( - SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) - SecretTypeKubernetesAuthToken SecretType = "KubernetesAuth" // Kubernetes auth token - SecretTypeDockerRegistryAuth SecretType = "DockerRegistryAuth" // Docker registry auth + SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) + SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token + SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth // FUTURE: other type values ) -- cgit v1.2.3 From 4434a3aca668a7dbdef7fc9d7787b3fdf6e69819 Mon Sep 17 00:00:00 2001 From: Kris Rousey Date: Wed, 12 Aug 2015 10:35:07 -0700 Subject: Moving client libs to unversioned dir --- event_compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/event_compression.md b/event_compression.md index 1187edb6..4525c097 100644 --- a/event_compression.md +++ b/event_compression.md @@ -60,7 +60,7 @@ Instead of a single Timestamp, each event object [contains](http://releases.k8s. Each binary that generates events: * Maintains a historical record of previously generated events: - * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). + * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/unversioned/record/events_cache.go`](../../pkg/client/unversioned/record/events_cache.go). * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * `event.Source.Component` * `event.Source.Host` -- cgit v1.2.3 From e9f50fabe0f2a1e869344f955e3063bb12cfe186 Mon Sep 17 00:00:00 2001 From: Ilya Dmitrichenko Date: Wed, 19 Aug 2015 12:01:50 +0100 Subject: Make typography more consistent --- architecture.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/architecture.md b/architecture.md index 5f829d68..b17345ef 100644 --- a/architecture.md +++ b/architecture.md @@ -33,7 +33,7 @@ Documentation for other releases can be found at # Kubernetes architecture -A running Kubernetes cluster contains node agents (kubelet) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making kubelet itself (all our components, really) run within containers, and making the scheduler 100% pluggable. +A running Kubernetes cluster contains node agents (`kubelet`) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making `kubelet` itself (all our components, really) run within containers, and making the scheduler 100% pluggable. ![Architecture Diagram](architecture.png?raw=true "Architecture overview") @@ -45,21 +45,21 @@ The Kubernetes node has the services necessary to run application containers and Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers. -### Kubelet +### `kubelet` -The **Kubelet** manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. +The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. -### Kube-Proxy +### `kube-proxy` Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. -Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy. +Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy. ## The Kubernetes Control Plane The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster. -### etcd +### `etcd` All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes. -- cgit v1.2.3 From 15509db93f1f3ac79e50bd5e18e34216cbd369c3 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Tue, 11 Aug 2015 11:23:56 -0400 Subject: Remove trailing commas --- namespaces.md | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/namespaces.md b/namespaces.md index 596f6f43..bb907c67 100644 --- a/namespaces.md +++ b/namespaces.md @@ -268,16 +268,16 @@ OpenShift creates a Namespace in Kubernetes "kind": "Namespace", "metadata": { "name": "development", + "labels": { + "name": "development" + } }, "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"], + "finalizers": ["openshift.com/origin", "kubernetes"] }, "status": { - "phase": "Active", - }, - "labels": { - "name": "development" - }, + "phase": "Active" + } } ``` @@ -294,16 +294,16 @@ User deletes the Namespace in Kubernetes, and Namespace now has following state: "metadata": { "name": "development", "deletionTimestamp": "..." + "labels": { + "name": "development" + } }, "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"], + "finalizers": ["openshift.com/origin", "kubernetes"] }, "status": { - "phase": "Terminating", - }, - "labels": { - "name": "development" - }, + "phase": "Terminating" + } } ``` @@ -319,16 +319,16 @@ removing *kubernetes* from the list of finalizers: "metadata": { "name": "development", "deletionTimestamp": "..." + "labels": { + "name": "development" + } }, "spec": { - "finalizers": ["openshift.com/origin"], + "finalizers": ["openshift.com/origin"] }, "status": { - "phase": "Terminating", - }, - "labels": { - "name": "development" - }, + "phase": "Terminating" + } } ``` @@ -347,16 +347,16 @@ This results in the following state: "metadata": { "name": "development", "deletionTimestamp": "..." + "labels": { + "name": "development" + } }, "spec": { - "finalizers": [], + "finalizers": [] }, "status": { - "phase": "Terminating", - }, - "labels": { - "name": "development" - }, + "phase": "Terminating" + } } ``` -- cgit v1.2.3 From a80aba14e93201b5dc674e2f0db56cd8aae91772 Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Mon, 24 Aug 2015 15:17:34 -0400 Subject: Use singular, make LimitRequestRatio MaxLimitRequestRatio --- admission_control_limit_range.md | 64 ++++++++++++++++++++++------------------ 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 621fd564..e7c706ef 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -53,7 +53,7 @@ The **LimitRange** resource is scoped to a **Namespace**. ### Type ```go -// A type of object that is limited +// LimitType is a type of object that is limited type LimitType string const ( @@ -63,44 +63,50 @@ const ( LimitTypeContainer LimitType = "Container" ) -// LimitRangeItem defines a min/max usage limit for any resource that matches on kind +// LimitRangeItem defines a min/max usage limit for any resource that matches on kind. type LimitRangeItem struct { - // Type of resource that this limit applies to - Type LimitType `json:"type,omitempty" description:"type of resource that this limit applies to"` - // Max usage constraints on this kind by resource name - Max ResourceList `json:"max,omitempty" description:"max usage constraints on this kind by resource name"` - // Min usage constraints on this kind by resource name - Min ResourceList `json:"min,omitempty" description:"min usage constraints on this kind by resource name"` - // Default resource limits on this kind by resource name - Default ResourceList `json:"default,omitempty" description:"default resource limits values on this kind by resource name if omitted"` - // DefaultRequests resource requests on this kind by resource name - DefaultRequests ResourceList `json:"defaultRequests,omitempty" description:"default resource requests values on this kind by resource name if omitted"` - // LimitRequestRatio is the ratio of limit over request that is the maximum allowed burst for the named resource - LimitRequestRatio ResourceList `json:"limitRequestRatio,omitempty" description:"the ratio of limit over request that is the maximum allowed burst for the named resource. if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value"` + // Type of resource that this limit applies to. + Type LimitType `json:"type,omitempty"` + // Max usage constraints on this kind by resource name. + Max ResourceList `json:"max,omitempty"` + // Min usage constraints on this kind by resource name. + Min ResourceList `json:"min,omitempty"` + // Default resource requirement limit value by resource name if resource limit is omitted. + Default ResourceList `json:"default,omitempty"` + // DefaultRequest is the default resource requirement request value by resource name if resource request is omitted. + DefaultRequest ResourceList `json:"defaultRequest,omitempty"` + // MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource. + MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` } -// LimitRangeSpec defines a min/max usage limit for resources that match on kind +// LimitRangeSpec defines a min/max usage limit for resources that match on kind. type LimitRangeSpec struct { - // Limits is the list of LimitRangeItem objects that are enforced - Limits []LimitRangeItem `json:"limits" description:"limits is the list of LimitRangeItem objects that are enforced"` + // Limits is the list of LimitRangeItem objects that are enforced. + Limits []LimitRangeItem `json:"limits"` } -// LimitRange sets resource usage limits for each kind of resource in a Namespace +// LimitRange sets resource usage limits for each kind of resource in a Namespace. type LimitRange struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + TypeMeta `json:",inline"` + // Standard object's metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + ObjectMeta `json:"metadata,omitempty"` - // Spec defines the limits enforced - Spec LimitRangeSpec `json:"spec,omitempty" description:"spec defines the limits enforced; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + // Spec defines the limits enforced. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + Spec LimitRangeSpec `json:"spec,omitempty"` } // LimitRangeList is a list of LimitRange items. type LimitRangeList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + // Standard list metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + ListMeta `json:"metadata,omitempty"` - // Items is a list of LimitRange objects - Items []LimitRange `json:"items" description:"items is a list of LimitRange objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md"` + // Items is a list of LimitRange objects. + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + Items []LimitRange `json:"items"` } ``` @@ -108,7 +114,7 @@ type LimitRangeList struct { Validation of a **LimitRange** enforces that for a given named resource the following rules apply: -Min (if specified) <= DefaultRequests (if specified) <= Default (if specified) <= Max (if specified) +Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified) ### Default Value Behavior @@ -121,11 +127,11 @@ if LimitRangeItem.Default[resourceName] is undefined ``` ``` -if LimitRangeItem.DefaultRequests[resourceName] is undefined +if LimitRangeItem.DefaultRequest[resourceName] is undefined if LimitRangeItem.Default[resourceName] is defined - LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Default[resourceName] + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] else if LimitRangeItem.Min[resourceName] is defined - LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Min[resourceName] + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] ``` ## AdmissionControl plugin: LimitRanger -- cgit v1.2.3 From b003d62099bfbaed2bf801ed16048ae3a9a57117 Mon Sep 17 00:00:00 2001 From: Ed Costello Date: Tue, 25 Aug 2015 10:47:58 -0400 Subject: Copy edits for typos (resubmitted) --- extending-api.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/extending-api.md b/extending-api.md index cca257bd..bbd02a54 100644 --- a/extending-api.md +++ b/extending-api.md @@ -71,7 +71,7 @@ Kubernetes API server to provide the following features: * Watch for resource changes. The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be -programnatically convertible to the name of the resource using +programmatically convertible to the name of the resource using the following conversion. Kinds are expected to be of the form ``, the `APIVersion` for the object is expected to be `//`. @@ -178,7 +178,7 @@ and get back: } ``` -Because all objects are expected to contain standard Kubernetes metdata fileds, these +Because all objects are expected to contain standard Kubernetes metadata fields, these list operations can also use `Label` queries to filter requests down to specific subsets. Likewise, clients can use watch endpoints to watch for changes to stored objects. -- cgit v1.2.3 From 816629623e6870d81ca1c41e6f6b61ab78b71fb7 Mon Sep 17 00:00:00 2001 From: Max Forbes Date: Wed, 26 Aug 2015 10:31:58 -0700 Subject: Add patch notes to versioning doc. --- versioning.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/versioning.md b/versioning.md index 9009dc59..ede6b450 100644 --- a/versioning.md +++ b/versioning.md @@ -68,6 +68,14 @@ Here is an example major release cycle: It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options. +# Patches + +Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes. + +They should not contain miscellaneous feature additions or improvements, and especially no incompatibilities should be introduced between patch versions of the same minor version (or even major version). + +Dependencies, such as Docker or Etcd, should also not be changed unless absolutely necessary, and also just to fix critical bugs (so, at most patch version changes, not new major nor minor versions). + # Upgrades * Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.) -- cgit v1.2.3 From f78bbabe8344f2f52bc3945c65fee5e4717688da Mon Sep 17 00:00:00 2001 From: Piotr Szczesniak Date: Thu, 27 Aug 2015 10:50:50 +0200 Subject: Revert "LimitRange updates for Resource Requirements Requests" --- admission_control_limit_range.md | 64 ++++++++++++++++++---------------------- 1 file changed, 29 insertions(+), 35 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index e7c706ef..621fd564 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -53,7 +53,7 @@ The **LimitRange** resource is scoped to a **Namespace**. ### Type ```go -// LimitType is a type of object that is limited +// A type of object that is limited type LimitType string const ( @@ -63,50 +63,44 @@ const ( LimitTypeContainer LimitType = "Container" ) -// LimitRangeItem defines a min/max usage limit for any resource that matches on kind. +// LimitRangeItem defines a min/max usage limit for any resource that matches on kind type LimitRangeItem struct { - // Type of resource that this limit applies to. - Type LimitType `json:"type,omitempty"` - // Max usage constraints on this kind by resource name. - Max ResourceList `json:"max,omitempty"` - // Min usage constraints on this kind by resource name. - Min ResourceList `json:"min,omitempty"` - // Default resource requirement limit value by resource name if resource limit is omitted. - Default ResourceList `json:"default,omitempty"` - // DefaultRequest is the default resource requirement request value by resource name if resource request is omitted. - DefaultRequest ResourceList `json:"defaultRequest,omitempty"` - // MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource. - MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` + // Type of resource that this limit applies to + Type LimitType `json:"type,omitempty" description:"type of resource that this limit applies to"` + // Max usage constraints on this kind by resource name + Max ResourceList `json:"max,omitempty" description:"max usage constraints on this kind by resource name"` + // Min usage constraints on this kind by resource name + Min ResourceList `json:"min,omitempty" description:"min usage constraints on this kind by resource name"` + // Default resource limits on this kind by resource name + Default ResourceList `json:"default,omitempty" description:"default resource limits values on this kind by resource name if omitted"` + // DefaultRequests resource requests on this kind by resource name + DefaultRequests ResourceList `json:"defaultRequests,omitempty" description:"default resource requests values on this kind by resource name if omitted"` + // LimitRequestRatio is the ratio of limit over request that is the maximum allowed burst for the named resource + LimitRequestRatio ResourceList `json:"limitRequestRatio,omitempty" description:"the ratio of limit over request that is the maximum allowed burst for the named resource. if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value"` } -// LimitRangeSpec defines a min/max usage limit for resources that match on kind. +// LimitRangeSpec defines a min/max usage limit for resources that match on kind type LimitRangeSpec struct { - // Limits is the list of LimitRangeItem objects that are enforced. - Limits []LimitRangeItem `json:"limits"` + // Limits is the list of LimitRangeItem objects that are enforced + Limits []LimitRangeItem `json:"limits" description:"limits is the list of LimitRangeItem objects that are enforced"` } -// LimitRange sets resource usage limits for each kind of resource in a Namespace. +// LimitRange sets resource usage limits for each kind of resource in a Namespace type LimitRange struct { - TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata - ObjectMeta `json:"metadata,omitempty"` + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - // Spec defines the limits enforced. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status - Spec LimitRangeSpec `json:"spec,omitempty"` + // Spec defines the limits enforced + Spec LimitRangeSpec `json:"spec,omitempty" description:"spec defines the limits enforced; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` } // LimitRangeList is a list of LimitRange items. type LimitRangeList struct { TypeMeta `json:",inline"` - // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds - ListMeta `json:"metadata,omitempty"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - // Items is a list of LimitRange objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md - Items []LimitRange `json:"items"` + // Items is a list of LimitRange objects + Items []LimitRange `json:"items" description:"items is a list of LimitRange objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md"` } ``` @@ -114,7 +108,7 @@ type LimitRangeList struct { Validation of a **LimitRange** enforces that for a given named resource the following rules apply: -Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified) +Min (if specified) <= DefaultRequests (if specified) <= Default (if specified) <= Max (if specified) ### Default Value Behavior @@ -127,11 +121,11 @@ if LimitRangeItem.Default[resourceName] is undefined ``` ``` -if LimitRangeItem.DefaultRequest[resourceName] is undefined +if LimitRangeItem.DefaultRequests[resourceName] is undefined if LimitRangeItem.Default[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] + LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Default[resourceName] else if LimitRangeItem.Min[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] + LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Min[resourceName] ``` ## AdmissionControl plugin: LimitRanger -- cgit v1.2.3 From ddba4b452707662f2fbd8d05c236c552c15cb377 Mon Sep 17 00:00:00 2001 From: qiaolei Date: Fri, 28 Aug 2015 16:40:59 +0800 Subject: Update quota example in admission_control_resource_quota.md Two modifications: 1, The example used in this document is outdated so update it 2, Delete the old `kubectl namespace myspace` since it produces an error `error: namespace has been superceded by the context.namespace field of .kubeconfig files` --- admission_control_resource_quota.md | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 86fae451..1931143c 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -201,21 +201,25 @@ kubectl is modified to support the **ResourceQuota** resource. For example, ```console -$ kubectl namespace myspace -$ kubectl create -f docs/user-guide/resourcequota/quota.yaml -$ kubectl get quota -NAME -quota -$ kubectl describe quota quota -Name: quota -Resource Used Hard --------- ---- ---- -cpu 0m 20 -memory 0 1Gi -pods 5 10 -replicationcontrollers 5 20 -resourcequotas 1 1 -services 3 5 +$ kubectl create -f docs/user-guide/resourcequota/namespace.yaml +namespace "quota-example" created +$ kubectl create -f docs/user-guide/resourcequota/quota.yaml --namespace=quota-example +resourcequota "quota" created +$ kubectl describe quota quota --namespace=quota-example +Name: quota +Namespace: quota-example +Resource Used Hard +-------- ---- ---- +cpu 0 20 +memory 0 1Gi +persistentvolumeclaims 0 10 +pods 0 10 +replicationcontrollers 0 20 +resourcequotas 1 1 +secrets 1 10 +services 0 5 + + ``` ## More information -- cgit v1.2.3 From 76238cf01e317aa95009ab78cd5f12756346cb57 Mon Sep 17 00:00:00 2001 From: Prashanth B Date: Fri, 28 Aug 2015 09:26:36 -0700 Subject: Revert "Revert "LimitRange updates for Resource Requirements Requests"" --- admission_control_limit_range.md | 64 ++++++++++++++++++++++------------------ 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 621fd564..e7c706ef 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -53,7 +53,7 @@ The **LimitRange** resource is scoped to a **Namespace**. ### Type ```go -// A type of object that is limited +// LimitType is a type of object that is limited type LimitType string const ( @@ -63,44 +63,50 @@ const ( LimitTypeContainer LimitType = "Container" ) -// LimitRangeItem defines a min/max usage limit for any resource that matches on kind +// LimitRangeItem defines a min/max usage limit for any resource that matches on kind. type LimitRangeItem struct { - // Type of resource that this limit applies to - Type LimitType `json:"type,omitempty" description:"type of resource that this limit applies to"` - // Max usage constraints on this kind by resource name - Max ResourceList `json:"max,omitempty" description:"max usage constraints on this kind by resource name"` - // Min usage constraints on this kind by resource name - Min ResourceList `json:"min,omitempty" description:"min usage constraints on this kind by resource name"` - // Default resource limits on this kind by resource name - Default ResourceList `json:"default,omitempty" description:"default resource limits values on this kind by resource name if omitted"` - // DefaultRequests resource requests on this kind by resource name - DefaultRequests ResourceList `json:"defaultRequests,omitempty" description:"default resource requests values on this kind by resource name if omitted"` - // LimitRequestRatio is the ratio of limit over request that is the maximum allowed burst for the named resource - LimitRequestRatio ResourceList `json:"limitRequestRatio,omitempty" description:"the ratio of limit over request that is the maximum allowed burst for the named resource. if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value"` + // Type of resource that this limit applies to. + Type LimitType `json:"type,omitempty"` + // Max usage constraints on this kind by resource name. + Max ResourceList `json:"max,omitempty"` + // Min usage constraints on this kind by resource name. + Min ResourceList `json:"min,omitempty"` + // Default resource requirement limit value by resource name if resource limit is omitted. + Default ResourceList `json:"default,omitempty"` + // DefaultRequest is the default resource requirement request value by resource name if resource request is omitted. + DefaultRequest ResourceList `json:"defaultRequest,omitempty"` + // MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource. + MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` } -// LimitRangeSpec defines a min/max usage limit for resources that match on kind +// LimitRangeSpec defines a min/max usage limit for resources that match on kind. type LimitRangeSpec struct { - // Limits is the list of LimitRangeItem objects that are enforced - Limits []LimitRangeItem `json:"limits" description:"limits is the list of LimitRangeItem objects that are enforced"` + // Limits is the list of LimitRangeItem objects that are enforced. + Limits []LimitRangeItem `json:"limits"` } -// LimitRange sets resource usage limits for each kind of resource in a Namespace +// LimitRange sets resource usage limits for each kind of resource in a Namespace. type LimitRange struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + TypeMeta `json:",inline"` + // Standard object's metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + ObjectMeta `json:"metadata,omitempty"` - // Spec defines the limits enforced - Spec LimitRangeSpec `json:"spec,omitempty" description:"spec defines the limits enforced; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + // Spec defines the limits enforced. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + Spec LimitRangeSpec `json:"spec,omitempty"` } // LimitRangeList is a list of LimitRange items. type LimitRangeList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + // Standard list metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + ListMeta `json:"metadata,omitempty"` - // Items is a list of LimitRange objects - Items []LimitRange `json:"items" description:"items is a list of LimitRange objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md"` + // Items is a list of LimitRange objects. + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + Items []LimitRange `json:"items"` } ``` @@ -108,7 +114,7 @@ type LimitRangeList struct { Validation of a **LimitRange** enforces that for a given named resource the following rules apply: -Min (if specified) <= DefaultRequests (if specified) <= Default (if specified) <= Max (if specified) +Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified) ### Default Value Behavior @@ -121,11 +127,11 @@ if LimitRangeItem.Default[resourceName] is undefined ``` ``` -if LimitRangeItem.DefaultRequests[resourceName] is undefined +if LimitRangeItem.DefaultRequest[resourceName] is undefined if LimitRangeItem.Default[resourceName] is defined - LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Default[resourceName] + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] else if LimitRangeItem.Min[resourceName] is defined - LimitRangeItem.DefaultRequests[resourceName] = LimitRangeItem.Min[resourceName] + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] ``` ## AdmissionControl plugin: LimitRanger -- cgit v1.2.3 From 29834b4bb59b41ba2a609cef1d459ccb47e595c7 Mon Sep 17 00:00:00 2001 From: qiaolei Date: Mon, 31 Aug 2015 14:39:44 +0800 Subject: Fix dead link in event_compression.md Where `pkg/client/record/event.go` should be `pkg/client/unversioned/record/event.go` --- event_compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/event_compression.md b/event_compression.md index 4525c097..e8f9775b 100644 --- a/event_compression.md +++ b/event_compression.md @@ -72,7 +72,7 @@ Each binary that generates events: * `event.Reason` * `event.Message` * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [`pkg/client/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). + * When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/unversioned/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). -- cgit v1.2.3 From e3bbb1eb035637aad7118bf5ec21ced68b5284f5 Mon Sep 17 00:00:00 2001 From: qiaolei Date: Wed, 2 Sep 2015 15:11:22 +0800 Subject: Update quota example Update quota example to track latest changes --- admission_control_resource_quota.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 1931143c..4b417ead 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -218,8 +218,6 @@ replicationcontrollers 0 20 resourcequotas 1 1 secrets 1 10 services 0 5 - - ``` ## More information -- cgit v1.2.3 From 8c4c1cb764238293cb3805074b78c70327258865 Mon Sep 17 00:00:00 2001 From: hw-qiaolei Date: Sat, 29 Aug 2015 21:08:46 +0000 Subject: Adjust the architecture diagram Some modifications of the architecture diagram: 1. adjust the order of authz and authn; since the API server usually first authenticate user, if it is a valid user then authorize it 2. adjust the arrow to point to kubelet instead of to node of the second node 3. change `replication controller` to `controller manager(replication controller etc.)` which connects to the REST API Server 4. some tiny adjustments of the arrow position 5. affected files: architecture.svg, architecture.png and architecture.dia --- architecture.dia | Bin 6519 -> 6523 bytes architecture.png | Bin 223860 -> 268126 bytes architecture.svg | 2220 ++++++++++++++++++++++++++++++++++++++++++++---------- 3 files changed, 1832 insertions(+), 388 deletions(-) diff --git a/architecture.dia b/architecture.dia index 441e3563..5c87409f 100644 Binary files a/architecture.dia and b/architecture.dia differ diff --git a/architecture.png b/architecture.png index b03cfe88..0ee8bceb 100644 Binary files a/architecture.png and b/architecture.png differ diff --git a/architecture.svg b/architecture.svg index cacc7fbf..d6b6aab0 100644 --- a/architecture.svg +++ b/architecture.svg @@ -1,499 +1,1943 @@ - - - - - - - - - - - - Node + + + + + image/svg+xml + + + + + + + + + + + + + + + + Node - - - - - kubelet + + + + + kubelet - - - + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - cAdvisor + + + + + cAdvisor - - + + - - Pod + + Pod - - - - + + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - container + + + + + container - - + + - - Pod + + Pod - - - - + + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - container + + + + + container - - + + - - Pod + + Pod - - - - - Proxy + + + + + Proxy - - - - - kubectl (user commands) + + + + + kubectl (user commands) - - + + - - - - - - - - - - Firewall + + + + + + + + + + Firewall - - - - - Internet + + + + + Internet - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - replication controller + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + controller manager + (replication controller etc.) - - - - - Scheduler + + + + + Scheduler - - - - - Scheduler + + + + + Scheduler - - Master components - Colocated, or spread across machines, - as dictated by cluster size. + + Master components + Colocated, or spread across machines, + as dictated by cluster size. - - + + - - + + - - - - - REST - (pods, services, - rep. controllers) + + + + + REST + (pods, services, + rep. controllers) - - - - - authorization - authentication + + + + + authentication + authorization - - - - - scheduling - actuator + + + + + scheduling + actuator - - APIs + + APIs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - docker + + docker - - - - + + + + - - .. + + .. - - ... + + ... - - - - - - - - - - - - + + + + + + + + + + + + - - - + + + - - - + + + - - Node + + Node - - - - - kubelet + + + + + kubelet - - - + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - cAdvisor + + + + + cAdvisor - - + + - - Pod + + Pod - - - - + + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - container + + + + + container - - + + - - Pod + + Pod - - - - + + + + - - - - - container + + + + + container - - - - - container + + + + + container - - - - - container + + + + + container - - + + - - Pod + + Pod - - - - - Proxy + + + + + Proxy - - - - + + + + - - - - + + + + - - - - + + + + - - docker + + docker - - - - + + + + - - .. + + .. - - ... + + ... - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - Distributed - Watchable - Storage - - (implemented via etcd) + + + + + + + + + + + + Distributed + Watchable + Storage + + (implemented via etcd) -- cgit v1.2.3 From 4ad8a68e14a047e5cf7be93b222b00198315882c Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Thu, 3 Sep 2015 10:10:11 -0400 Subject: s|github.com/GoogleCloudPlatform/kubernetes|github.com/kubernetes/kubernetes| --- architecture.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture.md b/architecture.md index b17345ef..2a761dea 100644 --- a/architecture.md +++ b/architecture.md @@ -51,7 +51,7 @@ The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their ### `kube-proxy` -Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/GoogleCloudPlatform/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. +Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy. -- cgit v1.2.3 From 1a62ae0c98bf9f280d27f4ab59d88c03f8d3f3dc Mon Sep 17 00:00:00 2001 From: dinghaiyang Date: Fri, 4 Sep 2015 18:44:56 +0800 Subject: Replace limits with request where appropriate --- resources.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/resources.md b/resources.md index fe6f0ec7..f9bbc8db 100644 --- a/resources.md +++ b/resources.md @@ -33,8 +33,8 @@ Documentation for other releases can be found at **Note: this is a design doc, which describes features that have not been completely implemented. User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for implementation of this model is -[#168](http://issue.k8s.io/168). Currently, only memory and -cpu limits on containers (not pods) are supported. "memory" is in bytes and "cpu" is in +[#168](http://issue.k8s.io/168). Currently, both limits and requests of memory and +cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in milli-cores.** # The Kubernetes resource model @@ -123,7 +123,6 @@ Where: * Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet. - ## Kubernetes-defined resource types The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet. -- cgit v1.2.3 From b1fef7374e6dc465845feb0459b703827d0e4081 Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 3 Sep 2015 14:50:12 -0700 Subject: Manually fixing docs, since gendocs messes up the links. --- event_compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/event_compression.md b/event_compression.md index e8f9775b..3d900e07 100644 --- a/event_compression.md +++ b/event_compression.md @@ -60,7 +60,7 @@ Instead of a single Timestamp, each event object [contains](http://releases.k8s. Each binary that generates events: * Maintains a historical record of previously generated events: - * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/unversioned/record/events_cache.go`](../../pkg/client/unversioned/record/events_cache.go). + * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * `event.Source.Component` * `event.Source.Host` -- cgit v1.2.3 From 15a05b820df9675188b8dc83341014ad3ec3319a Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Tue, 8 Sep 2015 11:03:08 -0400 Subject: Move resource quota doc from user-guide to admin --- admission_control_resource_quota.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 4b417ead..a9de7a9c 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -201,9 +201,9 @@ kubectl is modified to support the **ResourceQuota** resource. For example, ```console -$ kubectl create -f docs/user-guide/resourcequota/namespace.yaml +$ kubectl create -f docs/admin/resourcequota/namespace.yaml namespace "quota-example" created -$ kubectl create -f docs/user-guide/resourcequota/quota.yaml --namespace=quota-example +$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example resourcequota "quota" created $ kubectl describe quota quota --namespace=quota-example Name: quota @@ -222,8 +222,7 @@ services 0 5 ## More information -See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../user-guide/resourcequota/) for more information. - +See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() -- cgit v1.2.3 From f1c3e8db656da678e73bf5f6ef0173e2d943fb57 Mon Sep 17 00:00:00 2001 From: eulerzgy Date: Mon, 14 Sep 2015 17:46:59 +0800 Subject: fix document --- access.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/access.md b/access.md index 92840f73..123516f9 100644 --- a/access.md +++ b/access.md @@ -66,18 +66,18 @@ This document is primarily concerned with K8s API paths, and secondarily with In ### Assets to protect External User assets: - - Personal information like private messages, or images uploaded by External Users - - web server logs + - Personal information like private messages, or images uploaded by External Users. + - web server logs. K8s User assets: - - External User assets of each K8s User + - External User assets of each K8s User. - things private to the K8s app, like: - credentials for accessing other services (docker private repos, storage services, facebook, etc) - SSL certificates for web servers - proprietary data and code K8s Cluster assets: - - Assets of each K8s User + - Assets of each K8s User. - Machine Certificates or secrets. - The value of K8s cluster computing resources (cpu, memory, etc). @@ -104,7 +104,7 @@ Org-run cluster: - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. Hosted cluster: - - Offering K8s API as a service, or offering a Paas or Saas built on K8s + - Offering K8s API as a service, or offering a Paas or Saas built on K8s. - May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure. - May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.) - Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded). @@ -137,7 +137,7 @@ K8s will have a `userAccount` API object. - `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs. - `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field. - `userAccount` is not related to the unix username of processes in Pods created by that userAccount. -- `userAccount` API objects can have labels +- `userAccount` API objects can have labels. The system may associate one or more Authentication Methods with a `userAccount` (but they are not formally part of the userAccount object.) -- cgit v1.2.3 From 744c48405562541de55f5ebb8f54bd94d5129a28 Mon Sep 17 00:00:00 2001 From: eulerzgy Date: Wed, 16 Sep 2015 02:30:42 +0800 Subject: fix the change of minions to nodes --- event_compression.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/event_compression.md b/event_compression.md index 3d900e07..424f9ac2 100644 --- a/event_compression.md +++ b/event_compression.md @@ -96,11 +96,11 @@ Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-2.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no minions available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal ``` -- cgit v1.2.3 From 6583718cd4689f89955e6a86827c7d891b5a694a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20Mart=C3=AD?= Date: Thu, 17 Sep 2015 15:21:55 -0700 Subject: Move pkg/util.Time to pkg/api/unversioned.Time Along with our time.Duration wrapper, as suggested by @lavalamp. --- event_compression.md | 4 ++-- expansion.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/event_compression.md b/event_compression.md index 424f9ac2..b9861717 100644 --- a/event_compression.md +++ b/event_compression.md @@ -49,9 +49,9 @@ Event compression should be best effort (not guaranteed). Meaning, in the worst ## Design Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: - * `FirstTimestamp util.Time` + * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. - * `LastTimestamp util.Time` + * `LastTimestamp unversioned.Time` * The date/time of the most recent occurrence of the event. * On first occurrence, this is equal to the FirstTimestamp. * `Count int` diff --git a/expansion.md b/expansion.md index 24a07f0d..b19731b9 100644 --- a/expansion.md +++ b/expansion.md @@ -265,7 +265,7 @@ type ObjectEventRecorder interface { Eventf(reason, messageFmt string, args ...interface{}) // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. - PastEventf(timestamp util.Time, reason, messageFmt string, args ...interface{}) + PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) } ``` -- cgit v1.2.3 From 05697f05b991d67f13f9063a60fb17a75d487004 Mon Sep 17 00:00:00 2001 From: AnanyaKumar Date: Mon, 31 Aug 2015 00:01:13 -0400 Subject: Add daemon design doc --- daemon.md | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 daemon.md diff --git a/daemon.md b/daemon.md new file mode 100644 index 00000000..a948b78a --- /dev/null +++ b/daemon.md @@ -0,0 +1,128 @@ +# Daemons in Kubernetes + +**Author**: Ananya Kumar (@AnanyaKumar) + +**Status**: Draft proposal; prototype in progress. + +This document presents the design of a daemon controller for Kubernetes, outlines relevant Kubernetes concepts, describes use cases, and lays out milestones for its development. + +## Motivation + +In Kubernetes, a Replication Controller ensures that the specified number of a specified pod are running in the cluster at all times (pods are restarted if they are killed). With the Replication Controller, users cannot control which nodes their pods run on - Kubernetes decides how to schedule the pods onto nodes. However, many users want control over how certain pods are scheduled. In particular, many users have requested for a way to run a daemon on every node in the cluster, or on a certain set of nodes in the cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the daemon controller, a way to conveniently create and manage daemon-like workloads in Kubernetes. + +## Use Cases + +The daemon controller can be used for user-specified system services, cluster level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category. + +### User-Specified System Services: +Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The daemon controller can be used to run a data collection service (for example fluentd) and send the data to a service like ElasticSearch for analysis. + +### Cluster-Level Applications +Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A daemon controller is a convenient way to implement such a datastore. + +For other uses, see the related [feature request](https://github.com/GoogleCloudPlatform/kubernetes/issues/1518) + +## Functionality + +The Daemon Controller will support standard API features: +- create + - The spec for daemon controllers will have a pod template field. + - Using the pod’s node selector field, Daemon controllers can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘database’. You can use a daemon controller to launch a datastore pod on exactly those nodes labeled ‘database’. + - Using the pod's node name field, Daemon controllers can be restricted to operate on a specified node. + - The spec for pod templates that run with the Daemon Controller is the same as the spec for pod templates that run with the Replication Controller, except there will not be a ‘replicas’ field (exactly 1 daemon pod will be launched per node). + - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this controller. + - The Daemon Controller will not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but will make a best effort to launch Daemon pods (like Replication Controllers do with pods) + - A daemon controller named “foo” will add a “controller: foo” annotation to all the pods that it creates + - YAML example: +```YAML + apiVersion: v1 + kind: Daemon + metadata: + labels: + name: datastore + name: datastore + spec: + template: + metadata: + labels: + name: datastore-shard + spec: + node-selector: + name: datastore-node + containers: + name: datastore-shard + image: kubernetes/sharded + ports: + - containerPort: 9042 + name: main +``` + - commands that get info + - get (e.g. kubectl get dc) + - describe + - Modifiers + - delete + - stop: first we turn down the Daemon Controller foo, and then we turn down all pods matching the query “controller: foo” + - label + - update + - Daemon controllers will have labels, so you could, for example, list all daemon controllers with a certain label (the same way you would for a Replication Controller). + - In general, for all the supported features like get, describe, update, etc, the Daemon Controller will work in a similar way to the Replication Controller. However, note that the Daemon Controller and the Replication Controller are different constructs. + +### Health checks + - Ordinary health checks specified in the pod template will of course work to keep pods created by a Daemon Controller running. + +### Cluster Mutations + - When a new node is added to the cluster the daemon controller should start the daemon on the node (if the node’s labels match the user-specified selectors). This is a big advantage of the Daemon Controller compared to alternative ways of launching daemons and configuring clusters. + - Suppose the user launches a daemon controller that runs a logging daemon on all nodes labeled “tolog”. If the user then adds the “tolog” label to a node (that did not initially have the “tolog” label), the logging daemon should be launched on the node. Additionally, if a user removes the “tolog” label from a node, the logging daemon on that node should be killed. + +## Alternatives Considered + +An alternative way to launch daemons is to avoid going through the API server, and instead provide ways to package the daemon into the node. For example, users could: + +1. Include the daemon in the machine image +2. Use config files to launch daemons +3. Use static pod manifests to launch daemon pods when the node initializes + +These alternatives don’t work as well because the daemons won’t be well integrated into Kubernetes. In particular, + +1. In alternatives (1) and (2), health checking for the daemons would need to be re-implemented, or would not exist at all (because the daemons are not run inside pods). In the current proposal, the Kubelet will health-check daemon pods and restart them if necessary. +2. In alternatives (1) and (2), binding services to a group of daemons is difficult (which is needed in use cases such as the sharded data store use case described above), because the daemons are not run inside pods +3. A big disadvantage of these methods is that adding new daemons in existing nodes is difficult (for example, if a cluster manager wants to add a logging daemon after a cluster has been deployed). +4. The above alternatives are less user-friendly. Users need to learn two ways of launching pods: using the API when launching pods associated with Replication Controllers, and using manifests when launching daemons. So in the alternatives, deployment is more difficult. +5. It’s difficult to upgrade binaries launched in any of those three ways. + +Another alternative is for the user to explicitly assign pods to specific nodes (using the Pod spec) when creating pods. A big disadvantage of this alternative is that the user would need to manually check whether new nodes satisfy the desired labels, and if so add the daemon to the node. This makes deployment painful, and could lead to costly mistakes (if a certain daemon is not launched on a new node which it is supposed to run on). In essence, every user will be re-implementing the Daemon Controller for themselves. + +A third alternative is to generalize the Replication Controller. We could add a field for the user to specify that she wishes to bind pods to certain nodes in the cluster. Or we could add a field to the pod-spec allowing the user to specify that each node can have exactly one instance of a pod (so the user would create a Replication Controller with a very large number of replicas, and set the anti-affinity field to true preventing more than one pod with that label from being scheduled onto a single node). The disadvantage of these methods is that the Daemon Controller and the Replication Controller are very different concepts. The Daemon Controller operates on a per-node basis, while the Replication Controller operates on a per-job basis (in particular, the Daemon Controller will take action when a node is changed or added). So presenting them as different concepts makes for a better user interface. Having small and directed controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having one controller to rule them all. + +## Design + +#### Client +- Add support for daemon controller commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. + +#### Apiserver +- Accept, parse, validate client commands +- REST API calls will be handled in registry/daemon + - In particular, the api server will add the object to etcd + - DaemonManager listens for updates to etcd (using Framework.informer) +- API object for Daemon Controller will be created in expapi/v1/types.go and expapi/v1/register.go +- Validation code is in expapi/validation + +#### Daemon Manager +- Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector. +- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node. +- The daemon manager will create a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) + +#### Kubelet +- Does not need to be modified, but health checking for the daemon pods and revive the pods if they are killed (we will set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always. + +## Testing + +Unit Tests: +Each component was unit tested, fakes were implemented when necessary. For example, when testing the client, a fake API server was used. + +End to End Tests: +One end-to-end test was implemented. The end-to-end test verified that the daemon manager runs the daemon on every node, that when a daemon pod is stopped it restarts, that the daemon controller can be reaped (stopped), and that the daemon adds/removes daemon pods appropriately from nodes when their labels change. + +## Open Issues +- Rolling updates across nodes should be performed according to the [anti-affinity policy in scheduler](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/plugin/pkg/scheduler/api/v1/types.go). We need to figure out how to share that configuration. +- See how this can work with [Deployment design](https://github.com/GoogleCloudPlatform/kubernetes/issues/1743). -- cgit v1.2.3 From 17b1ec3333214e52782126da286e3952c5012859 Mon Sep 17 00:00:00 2001 From: Ananya Kumar Date: Tue, 1 Sep 2015 22:03:22 -0400 Subject: Update daemon.md --- daemon.md | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/daemon.md b/daemon.md index a948b78a..9cadfc4d 100644 --- a/daemon.md +++ b/daemon.md @@ -1,14 +1,14 @@ -# Daemons in Kubernetes +# Daemon Controller in Kubernetes **Author**: Ananya Kumar (@AnanyaKumar) **Status**: Draft proposal; prototype in progress. -This document presents the design of a daemon controller for Kubernetes, outlines relevant Kubernetes concepts, describes use cases, and lays out milestones for its development. +This document presents the design of the Kubernetes daemon controller, describes use cases, and gives an overview of the code. ## Motivation -In Kubernetes, a Replication Controller ensures that the specified number of a specified pod are running in the cluster at all times (pods are restarted if they are killed). With the Replication Controller, users cannot control which nodes their pods run on - Kubernetes decides how to schedule the pods onto nodes. However, many users want control over how certain pods are scheduled. In particular, many users have requested for a way to run a daemon on every node in the cluster, or on a certain set of nodes in the cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the daemon controller, a way to conveniently create and manage daemon-like workloads in Kubernetes. +Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the daemon controller, a way to conveniently create and manage daemon-like workloads in Kubernetes. ## Use Cases @@ -24,15 +24,15 @@ For other uses, see the related [feature request](https://github.com/GoogleCloud ## Functionality -The Daemon Controller will support standard API features: +The Daemon Controller supports standard API features: - create - - The spec for daemon controllers will have a pod template field. + - The spec for daemon controllers has a pod template field. - Using the pod’s node selector field, Daemon controllers can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘database’. You can use a daemon controller to launch a datastore pod on exactly those nodes labeled ‘database’. - Using the pod's node name field, Daemon controllers can be restricted to operate on a specified node. - The spec for pod templates that run with the Daemon Controller is the same as the spec for pod templates that run with the Replication Controller, except there will not be a ‘replicas’ field (exactly 1 daemon pod will be launched per node). - - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this controller. - - The Daemon Controller will not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but will make a best effort to launch Daemon pods (like Replication Controllers do with pods) - - A daemon controller named “foo” will add a “controller: foo” annotation to all the pods that it creates + - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this controller. + - The initial implementation of Daemon Controller does not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch Daemon pods (like Replication Controllers do with pods). Subsequent revisions might ensure that Daemon pods show up on nodes, pushing out other pods if necessary. + - A daemon controller named “foo” adds a “controller: foo” annotation to all the pods that it creates - YAML example: ```YAML apiVersion: v1 @@ -61,18 +61,19 @@ The Daemon Controller will support standard API features: - describe - Modifiers - delete - - stop: first we turn down the Daemon Controller foo, and then we turn down all pods matching the query “controller: foo” + - stop: first we turn down all the pods controller by the daemon (by setting the nodeName to a non-existed name). Then we turn down the daemon controller. - label - update - - Daemon controllers will have labels, so you could, for example, list all daemon controllers with a certain label (the same way you would for a Replication Controller). - - In general, for all the supported features like get, describe, update, etc, the Daemon Controller will work in a similar way to the Replication Controller. However, note that the Daemon Controller and the Replication Controller are different constructs. + - Daemon controllers have labels, so you could, for example, list all daemon controllers with a certain label (the same way you would for a Replication Controller). + - In general, for all the supported features like get, describe, update, etc, the Daemon Controller works in a similar way to the Replication Controller. However, note that the Daemon Controller and the Replication Controller are different constructs. -### Health checks - - Ordinary health checks specified in the pod template will of course work to keep pods created by a Daemon Controller running. +### Persisting Pods + - Ordinary health checks specified in the pod template work to keep pods created by a Daemon Controller running. + - If a daemon pod is killed or stopped, the daemon controller will create a new replica of the daemon pod on the node. ### Cluster Mutations - - When a new node is added to the cluster the daemon controller should start the daemon on the node (if the node’s labels match the user-specified selectors). This is a big advantage of the Daemon Controller compared to alternative ways of launching daemons and configuring clusters. - - Suppose the user launches a daemon controller that runs a logging daemon on all nodes labeled “tolog”. If the user then adds the “tolog” label to a node (that did not initially have the “tolog” label), the logging daemon should be launched on the node. Additionally, if a user removes the “tolog” label from a node, the logging daemon on that node should be killed. + - When a new node is added to the cluster the daemon controller starts the daemon on the node (if the node’s labels match the user-specified selectors). This is a big advantage of the Daemon Controller compared to alternative ways of launching daemons and configuring clusters. + - Suppose the user launches a daemon controller that runs a logging daemon on all nodes labeled “tolog”. If the user then adds the “tolog” label to a node (that did not initially have the “tolog” label), the logging daemon will launch on the node. Additionally, if a user removes the “tolog” label from a node, the logging daemon on that node will be killed. ## Alternatives Considered @@ -101,19 +102,19 @@ A third alternative is to generalize the Replication Controller. We could add a #### Apiserver - Accept, parse, validate client commands -- REST API calls will be handled in registry/daemon +- REST API calls are handled in registry/daemon - In particular, the api server will add the object to etcd - DaemonManager listens for updates to etcd (using Framework.informer) -- API object for Daemon Controller will be created in expapi/v1/types.go and expapi/v1/register.go +- API objects for Daemon Controller were created in expapi/v1/types.go and expapi/v1/register.go - Validation code is in expapi/validation #### Daemon Manager - Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector. - Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node. -- The daemon manager will create a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) +- The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) #### Kubelet -- Does not need to be modified, but health checking for the daemon pods and revive the pods if they are killed (we will set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always. +- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always. ## Testing @@ -124,5 +125,4 @@ End to End Tests: One end-to-end test was implemented. The end-to-end test verified that the daemon manager runs the daemon on every node, that when a daemon pod is stopped it restarts, that the daemon controller can be reaped (stopped), and that the daemon adds/removes daemon pods appropriately from nodes when their labels change. ## Open Issues -- Rolling updates across nodes should be performed according to the [anti-affinity policy in scheduler](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/plugin/pkg/scheduler/api/v1/types.go). We need to figure out how to share that configuration. - See how this can work with [Deployment design](https://github.com/GoogleCloudPlatform/kubernetes/issues/1743). -- cgit v1.2.3 From dc60757674f4b9c663261cf8caf123f9ae8d4ac6 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Mon, 21 Sep 2015 17:15:44 -0700 Subject: Design doc for daemon controller. Originally started as PR #13368. --- daemon.md | 86 ++++++++++++++++++++++++++++----------------------------------- 1 file changed, 38 insertions(+), 48 deletions(-) diff --git a/daemon.md b/daemon.md index 9cadfc4d..c4187c7b 100644 --- a/daemon.md +++ b/daemon.md @@ -1,54 +1,54 @@ -# Daemon Controller in Kubernetes +# DaemonSet in Kubernetes **Author**: Ananya Kumar (@AnanyaKumar) -**Status**: Draft proposal; prototype in progress. +**Status**: Implemented. -This document presents the design of the Kubernetes daemon controller, describes use cases, and gives an overview of the code. +This document presents the design of the Kubernetes DaemonSet, describes use cases, and gives an overview of the code. ## Motivation -Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the daemon controller, a way to conveniently create and manage daemon-like workloads in Kubernetes. +Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes. ## Use Cases -The daemon controller can be used for user-specified system services, cluster level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category. +The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category. ### User-Specified System Services: -Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The daemon controller can be used to run a data collection service (for example fluentd) and send the data to a service like ElasticSearch for analysis. +Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis. ### Cluster-Level Applications -Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A daemon controller is a convenient way to implement such a datastore. +Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘app=datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore. -For other uses, see the related [feature request](https://github.com/GoogleCloudPlatform/kubernetes/issues/1518) +For other uses, see the related [feature request](https://issues.k8s.io/1518) ## Functionality -The Daemon Controller supports standard API features: +The DaemonSet supports standard API features: - create - - The spec for daemon controllers has a pod template field. - - Using the pod’s node selector field, Daemon controllers can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘database’. You can use a daemon controller to launch a datastore pod on exactly those nodes labeled ‘database’. - - Using the pod's node name field, Daemon controllers can be restricted to operate on a specified node. - - The spec for pod templates that run with the Daemon Controller is the same as the spec for pod templates that run with the Replication Controller, except there will not be a ‘replicas’ field (exactly 1 daemon pod will be launched per node). - - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this controller. - - The initial implementation of Daemon Controller does not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch Daemon pods (like Replication Controllers do with pods). Subsequent revisions might ensure that Daemon pods show up on nodes, pushing out other pods if necessary. - - A daemon controller named “foo” adds a “controller: foo” annotation to all the pods that it creates + - The spec for DaemonSets has a pod template field. + - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’. + - Using the pod's node name field, DaemonSets can be restricted to operate on a specified nodeName. + - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec usedby the Replication Controller. + - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this abstraction in the initial implementation. + - The initial implementation of DaemonSet does not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch Daemon pods (like Replication Controllers do with pods). Subsequent revisions might ensure that Daemon pods show up on nodes, preempting other pods if necessary. + - The DaemonSet controller adds an annotation "kubernetes.io/created-by: \" - YAML example: ```YAML apiVersion: v1 kind: Daemon metadata: labels: - name: datastore + app: datastore name: datastore spec: template: metadata: labels: - name: datastore-shard + app: datastore-shard spec: - node-selector: - name: datastore-node + nodeSelector: + app: datastore-node containers: name: datastore-shard image: kubernetes/sharded @@ -57,31 +57,29 @@ The Daemon Controller supports standard API features: name: main ``` - commands that get info - - get (e.g. kubectl get dc) + - get (e.g. kubectl get daemonsets) - describe - Modifiers - - delete - - stop: first we turn down all the pods controller by the daemon (by setting the nodeName to a non-existed name). Then we turn down the daemon controller. + - delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeName to a non-existant name); then it deletes the DaemonSet; then it deletes the pods) - label - - update - - Daemon controllers have labels, so you could, for example, list all daemon controllers with a certain label (the same way you would for a Replication Controller). - - In general, for all the supported features like get, describe, update, etc, the Daemon Controller works in a similar way to the Replication Controller. However, note that the Daemon Controller and the Replication Controller are different constructs. + - update (only allowed to selector and to nodeSelector and nodeName of pod template) + - DaemonSets have labels, so you could, for example, list all DaemonSets with a certain label (the same way you would for a Replication Controller). + - In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs. ### Persisting Pods - - Ordinary health checks specified in the pod template work to keep pods created by a Daemon Controller running. - - If a daemon pod is killed or stopped, the daemon controller will create a new replica of the daemon pod on the node. + - Ordinary livenes probes specified in the pod template work to keep pods created by a DaemonSet running. + - If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node. ### Cluster Mutations - - When a new node is added to the cluster the daemon controller starts the daemon on the node (if the node’s labels match the user-specified selectors). This is a big advantage of the Daemon Controller compared to alternative ways of launching daemons and configuring clusters. - - Suppose the user launches a daemon controller that runs a logging daemon on all nodes labeled “tolog”. If the user then adds the “tolog” label to a node (that did not initially have the “tolog” label), the logging daemon will launch on the node. Additionally, if a user removes the “tolog” label from a node, the logging daemon on that node will be killed. + - When a new node is added to the cluster the DaemonSet starts the daemon on the node (if the node’s labels match the user-specified selectors). + - Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed. ## Alternatives Considered An alternative way to launch daemons is to avoid going through the API server, and instead provide ways to package the daemon into the node. For example, users could: 1. Include the daemon in the machine image -2. Use config files to launch daemons -3. Use static pod manifests to launch daemon pods when the node initializes +2. Use static pod manifests to launch daemon pods when the node initializes These alternatives don’t work as well because the daemons won’t be well integrated into Kubernetes. In particular, @@ -91,38 +89,30 @@ These alternatives don’t work as well because the daemons won’t be well inte 4. The above alternatives are less user-friendly. Users need to learn two ways of launching pods: using the API when launching pods associated with Replication Controllers, and using manifests when launching daemons. So in the alternatives, deployment is more difficult. 5. It’s difficult to upgrade binaries launched in any of those three ways. -Another alternative is for the user to explicitly assign pods to specific nodes (using the Pod spec) when creating pods. A big disadvantage of this alternative is that the user would need to manually check whether new nodes satisfy the desired labels, and if so add the daemon to the node. This makes deployment painful, and could lead to costly mistakes (if a certain daemon is not launched on a new node which it is supposed to run on). In essence, every user will be re-implementing the Daemon Controller for themselves. +Another alternative is for the user to explicitly assign pods to specific nodes (using the Pod spec) when creating pods. A big disadvantage of this alternative is that the user would need to manually check whether new nodes satisfy the desired labels, and if so add the daemon to the node. This makes deployment painful, and could lead to costly mistakes (if a certain daemon is not launched on a new node which it is supposed to run on). In essence, every user will be re-implementing the DaemonSet for themselves. -A third alternative is to generalize the Replication Controller. We could add a field for the user to specify that she wishes to bind pods to certain nodes in the cluster. Or we could add a field to the pod-spec allowing the user to specify that each node can have exactly one instance of a pod (so the user would create a Replication Controller with a very large number of replicas, and set the anti-affinity field to true preventing more than one pod with that label from being scheduled onto a single node). The disadvantage of these methods is that the Daemon Controller and the Replication Controller are very different concepts. The Daemon Controller operates on a per-node basis, while the Replication Controller operates on a per-job basis (in particular, the Daemon Controller will take action when a node is changed or added). So presenting them as different concepts makes for a better user interface. Having small and directed controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having one controller to rule them all. +A third alternative is to generalize the Replication Controller. We could add a field for the user to specify that she wishes to bind pods to certain nodes in the cluster. Or we could add a field to the pod-spec allowing the user to specify that each node can have exactly one instance of a pod (so the user would create a Replication Controller with a very large number of replicas, and set the anti-affinity field to true preventing more than one pod with that label from being scheduled onto a single node). The disadvantage of these methods is that the DaemonSet and the Replication Controller are very different concepts. The DaemonSet operates on a per-node basis, while the Replication Controller operates on a per-job basis (in particular, the DaemonSet will take action when a node is changed or added). So presenting them as different concepts makes for a better user interface. Having small and directed controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having one controller to rule them all (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058)). ## Design #### Client -- Add support for daemon controller commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. +- Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. #### Apiserver - Accept, parse, validate client commands - REST API calls are handled in registry/daemon - In particular, the api server will add the object to etcd - DaemonManager listens for updates to etcd (using Framework.informer) -- API objects for Daemon Controller were created in expapi/v1/types.go and expapi/v1/register.go +- API objects for DaemonSet were created in expapi/v1/types.go and expapi/v1/register.go - Validation code is in expapi/validation #### Daemon Manager -- Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector. -- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node. +- Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet’s selector. +- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node. - The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) #### Kubelet -- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always. - -## Testing - -Unit Tests: -Each component was unit tested, fakes were implemented when necessary. For example, when testing the client, a fake API server was used. - -End to End Tests: -One end-to-end test was implemented. The end-to-end test verified that the daemon manager runs the daemon on every node, that when a daemon pod is stopped it restarts, that the daemon controller can be reaped (stopped), and that the daemon adds/removes daemon pods appropriately from nodes when their labels change. +- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always. ## Open Issues -- See how this can work with [Deployment design](https://github.com/GoogleCloudPlatform/kubernetes/issues/1743). +- See how this can work with [Deployment design](http://issues.k8s.io/1743). -- cgit v1.2.3 From 8ad8f8cff031cb1072a13aaba55c495b4d6976ec Mon Sep 17 00:00:00 2001 From: Zichang Lin Date: Wed, 23 Sep 2015 14:58:16 +0800 Subject: Change a describe in docs/design/secrets.md --- secrets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/secrets.md b/secrets.md index 895d9448..e8a5e42f 100644 --- a/secrets.md +++ b/secrets.md @@ -73,7 +73,7 @@ Goals of this design: 2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials from a `.dockercfg` file, so that containers can push images 3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys, - so that I can push and fetch to and from the repository + so that I can push to and fetch from the repository 2. As a user, I want to allow containers to consume supplemental information about services such as username and password which should be kept secret, so that I can share secrets about a service amongst the containers in my application securely -- cgit v1.2.3 From 28aa2acb97f52a83e26514b73ee2c201b1a39660 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 16 Sep 2015 22:15:05 -0700 Subject: move experimental/v1 to experimental/v1alpha1; use "group/version" in many places where used to expect "version" only. --- extending-api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extending-api.md b/extending-api.md index bbd02a54..628b5a16 100644 --- a/extending-api.md +++ b/extending-api.md @@ -114,7 +114,7 @@ For example, if a user creates: ```yaml metadata: name: cron-tab.example.com -apiVersion: experimental/v1 +apiVersion: experimental/v1alpha1 kind: ThirdPartyResource description: "A specification of a Pod to run on a cron style schedule" versions: -- cgit v1.2.3 From 635b078fd44bb12e409674542882bb556e9d5855 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Thu, 24 Sep 2015 16:22:10 -0700 Subject: Respond to reviewer comments. --- daemon.md | 38 +++++++++++++++----------------------- 1 file changed, 15 insertions(+), 23 deletions(-) diff --git a/daemon.md b/daemon.md index c4187c7b..43c49465 100644 --- a/daemon.md +++ b/daemon.md @@ -28,15 +28,15 @@ The DaemonSet supports standard API features: - create - The spec for DaemonSets has a pod template field. - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’. - - Using the pod's node name field, DaemonSets can be restricted to operate on a specified nodeName. - - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec usedby the Replication Controller. - - We will not guarantee that daemon pods show up on nodes before regular pods - run ordering is out of scope for this abstraction in the initial implementation. - - The initial implementation of DaemonSet does not guarantee that Daemon pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch Daemon pods (like Replication Controllers do with pods). Subsequent revisions might ensure that Daemon pods show up on nodes, preempting other pods if necessary. + - Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node. + - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller. + - The initial implementation will not guarnatee that DaemonSet pods are created on nodes before other pods. + - The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary. - The DaemonSet controller adds an annotation "kubernetes.io/created-by: \" - YAML example: ```YAML apiVersion: v1 - kind: Daemon + kind: DaemonSet metadata: labels: app: datastore @@ -62,36 +62,28 @@ The DaemonSet supports standard API features: - Modifiers - delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeName to a non-existant name); then it deletes the DaemonSet; then it deletes the pods) - label - - update (only allowed to selector and to nodeSelector and nodeName of pod template) - - DaemonSets have labels, so you could, for example, list all DaemonSets with a certain label (the same way you would for a Replication Controller). + - annotate + - update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template) + - DaemonSets have labels, so you could, for example, list all DaemonSets with certain labels (the same way you would for a Replication Controller). - In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs. ### Persisting Pods - - Ordinary livenes probes specified in the pod template work to keep pods created by a DaemonSet running. + - Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running. - If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node. ### Cluster Mutations - - When a new node is added to the cluster the DaemonSet starts the daemon on the node (if the node’s labels match the user-specified selectors). + - When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node’s labels. - Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed. ## Alternatives Considered -An alternative way to launch daemons is to avoid going through the API server, and instead provide ways to package the daemon into the node. For example, users could: +We considered several alternatives, that were deemed inferior to the approach of creating a new DaemonSet abstraction. -1. Include the daemon in the machine image -2. Use static pod manifests to launch daemon pods when the node initializes +One alternative is to include the daemon in the machine image. In this case it would run outside of Kubernetes proper, and thus not be monitored, health checked, usable as a service endpoint, easily upgradable, etc. -These alternatives don’t work as well because the daemons won’t be well integrated into Kubernetes. In particular, +A related alternative is to package daemons as static pods. This would address most of the problems described above, but they would still not be easily upgradable, and more generally could not be managed through the API server interface. -1. In alternatives (1) and (2), health checking for the daemons would need to be re-implemented, or would not exist at all (because the daemons are not run inside pods). In the current proposal, the Kubelet will health-check daemon pods and restart them if necessary. -2. In alternatives (1) and (2), binding services to a group of daemons is difficult (which is needed in use cases such as the sharded data store use case described above), because the daemons are not run inside pods -3. A big disadvantage of these methods is that adding new daemons in existing nodes is difficult (for example, if a cluster manager wants to add a logging daemon after a cluster has been deployed). -4. The above alternatives are less user-friendly. Users need to learn two ways of launching pods: using the API when launching pods associated with Replication Controllers, and using manifests when launching daemons. So in the alternatives, deployment is more difficult. -5. It’s difficult to upgrade binaries launched in any of those three ways. - -Another alternative is for the user to explicitly assign pods to specific nodes (using the Pod spec) when creating pods. A big disadvantage of this alternative is that the user would need to manually check whether new nodes satisfy the desired labels, and if so add the daemon to the node. This makes deployment painful, and could lead to costly mistakes (if a certain daemon is not launched on a new node which it is supposed to run on). In essence, every user will be re-implementing the DaemonSet for themselves. - -A third alternative is to generalize the Replication Controller. We could add a field for the user to specify that she wishes to bind pods to certain nodes in the cluster. Or we could add a field to the pod-spec allowing the user to specify that each node can have exactly one instance of a pod (so the user would create a Replication Controller with a very large number of replicas, and set the anti-affinity field to true preventing more than one pod with that label from being scheduled onto a single node). The disadvantage of these methods is that the DaemonSet and the Replication Controller are very different concepts. The DaemonSet operates on a per-node basis, while the Replication Controller operates on a per-job basis (in particular, the DaemonSet will take action when a node is changed or added). So presenting them as different concepts makes for a better user interface. Having small and directed controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having one controller to rule them all (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058)). +A third alternative is to generalize the Replication Controller. We would do something like: if you set the `replicas` field of the ReplicationConrollerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number of nodes ever expected in the cluster -- and would use some anti-affinity mechanism to ensure that no more than one Pod from the ReplicationController runs on any given node. There are two downsides to this approach. First, there would always be a large number of Pending pods in the scheduler (these will be scheduled onto new machines when they are added to the cluster). The second downside is more philosophical: DaemonSet and the Replication Controller are very different concepts. We believe that having small, targeted controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having larger multi-functional controllers (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for some discussion of this topic). ## Design @@ -115,4 +107,4 @@ A third alternative is to generalize the Replication Controller. We could add a - Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always. ## Open Issues -- See how this can work with [Deployment design](http://issues.k8s.io/1743). +- Should work similarly to [Deployment](http://issues.k8s.io/1743). -- cgit v1.2.3 From f2ae9d3ebcb40e07c308210b93e1bce2992e3ff0 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Thu, 24 Sep 2015 17:17:39 -0700 Subject: Ran update-generated-docs.sh --- daemon.md | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/daemon.md b/daemon.md index 43c49465..c88fcec7 100644 --- a/daemon.md +++ b/daemon.md @@ -1,3 +1,36 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/daemon.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + # DaemonSet in Kubernetes **Author**: Ananya Kumar (@AnanyaKumar) @@ -8,16 +41,18 @@ This document presents the design of the Kubernetes DaemonSet, describes use cas ## Motivation -Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes. +Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes. ## Use Cases The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category. ### User-Specified System Services: + Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis. ### Cluster-Level Applications + Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘app=datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore. For other uses, see the related [feature request](https://issues.k8s.io/1518) @@ -34,6 +69,7 @@ The DaemonSet supports standard API features: - The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary. - The DaemonSet controller adds an annotation "kubernetes.io/created-by: \" - YAML example: + ```YAML apiVersion: v1 kind: DaemonSet @@ -56,6 +92,7 @@ The DaemonSet supports standard API features: - containerPort: 9042 name: main ``` + - commands that get info - get (e.g. kubectl get daemonsets) - describe @@ -68,10 +105,12 @@ The DaemonSet supports standard API features: - In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs. ### Persisting Pods + - Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running. - If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node. ### Cluster Mutations + - When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node’s labels. - Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed. @@ -88,9 +127,11 @@ A third alternative is to generalize the Replication Controller. We would do som ## Design #### Client + - Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. #### Apiserver + - Accept, parse, validate client commands - REST API calls are handled in registry/daemon - In particular, the api server will add the object to etcd @@ -99,12 +140,20 @@ A third alternative is to generalize the Replication Controller. We would do som - Validation code is in expapi/validation #### Daemon Manager + - Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet’s selector. - Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node. - The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) #### Kubelet + - Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always. ## Open Issues + - Should work similarly to [Deployment](http://issues.k8s.io/1743). + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() + -- cgit v1.2.3 From 10a5a94e2db162bc55f7924fd02ff0bb50f6e2a9 Mon Sep 17 00:00:00 2001 From: Daniel Smith Date: Thu, 24 Sep 2015 14:00:27 -0700 Subject: Propose combining domain name & group Also remove group from versions. --- extending-api.md | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/extending-api.md b/extending-api.md index 628b5a16..beb3d7ac 100644 --- a/extending-api.md +++ b/extending-api.md @@ -73,11 +73,11 @@ Kubernetes API server to provide the following features: The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be programmatically convertible to the name of the resource using the following conversion. Kinds are expected to be of the form ``, the -`APIVersion` for the object is expected to be `//`. +`APIVersion` for the object is expected to be `/`. To +prevent collisions, it's expected that you'll use a fulling qualified domain +name for the API group, e.g. `example.com`. -For example `example.com/stable/v1` - -`domain-name` is expected to be a fully qualified domain name. +For example `stable.example.com/v1` 'CamelCaseKind' is the specific type name. @@ -113,18 +113,17 @@ For example, if a user creates: ```yaml metadata: - name: cron-tab.example.com + name: cron-tab.stable.example.com apiVersion: experimental/v1alpha1 kind: ThirdPartyResource description: "A specification of a Pod to run on a cron style schedule" versions: - - name: stable/v1 - - name: experimental/v2 +- name: v1 +- name: v2 ``` -Then the API server will program in two new RESTful resource paths: - * `/thirdparty/example.com/stable/v1/namespaces//crontabs/...` - * `/thirdparty/example.com/experimental/v2/namespaces//crontabs/...` +Then the API server will program in the new RESTful resource path: + * `/apis/stable.example.com/v1/namespaces//crontabs/...` Now that this schema has been created, a user can `POST`: @@ -134,19 +133,19 @@ Now that this schema has been created, a user can `POST`: "metadata": { "name": "my-new-cron-object" }, - "apiVersion": "example.com/stable/v1", + "apiVersion": "stable.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", "image": "my-awesome-chron-image" } ``` -to: `/third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object` +to: `/apis/stable.example.com/v1/namespaces/default/crontabs` and the corresponding data will be stored into etcd by the APIServer, so that when the user issues: ``` -GET /third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object` +GET /apis/stable.example.com/v1/namespaces/default/crontabs/my-new-cron-object` ``` And when they do that, they will get back the same data, but with additional Kubernetes metadata @@ -155,21 +154,21 @@ And when they do that, they will get back the same data, but with additional Kub Likewise, to list all resources, a user can issue: ``` -GET /third-party/example.com/stable/v1/namespaces/default/crontabs +GET /apis/stable.example.com/v1/namespaces/default/crontabs ``` and get back: ```json { - "apiVersion": "example.com/stable/v1", + "apiVersion": "stable.example.com/v1", "kind": "CronTabList", "items": [ { "metadata": { "name": "my-new-cron-object" }, - "apiVersion": "example.com/stable/v1", + "apiVersion": "stable.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", "image": "my-awesome-chron-image" -- cgit v1.2.3 From cb58afd814634c8053405d4a15c3d8a6040d05fe Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Thu, 8 Oct 2015 16:57:05 -0700 Subject: Proposed versioning changes --- versioning.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/versioning.md b/versioning.md index ede6b450..c764a585 100644 --- a/versioning.md +++ b/versioning.md @@ -44,9 +44,9 @@ Legend: * Kube 1.0.0, 1.0.1 -- DONE! * Kube 1.0.X (X>1): Standard operating procedure. We patch the release-1.0 branch as needed and increment the patch number. -* Kube 1.1.0-alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. (This applies to the beta releases as well.) -* Kube 1.1.0-beta.X: When HEAD is feature-complete, we go into code freeze 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. Releases continue to be cut from HEAD until we're essentially done. -* Kube 1.1.0: Final release. Should occur between 3 and 4 months after 1.0. +* Kube 1.1.0-alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. +* Kube 1.1.0-beta: When HEAD is feature-complete, we will cut the release-1.1.0 branch 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. This cut will be marked as 1.1.0-beta, and HEAD will be revved to 1.2.0-alpha.0. +* Kube 1.1.0: Final release, cut from the release-1.1.0 branch cut two weeks prior. Should occur between 3 and 4 months after 1.0. 1.1.1-beta will be tagged at the same time on the same branch. ### Major version timeline -- cgit v1.2.3 From 0942ca8f34a43393c1036ec4a5688fc6078c46d7 Mon Sep 17 00:00:00 2001 From: Mike Danese Date: Thu, 8 Oct 2015 15:52:11 -0700 Subject: simplify DaemonReaper by using NodeSelector --- daemon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/daemon.md b/daemon.md index c88fcec7..a72b8755 100644 --- a/daemon.md +++ b/daemon.md @@ -97,7 +97,7 @@ The DaemonSet supports standard API features: - get (e.g. kubectl get daemonsets) - describe - Modifiers - - delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeName to a non-existant name); then it deletes the DaemonSet; then it deletes the pods) + - delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is unlikely to be set on any node); then it deletes the DaemonSet; then it deletes the pods) - label - annotate - update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template) -- cgit v1.2.3 From 1f6336e0656d5f00e1b25e9e4810fea1f738e875 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Mon, 12 Oct 2015 17:47:16 -0700 Subject: refactor "experimental" to "extensions" in documents --- extending-api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/extending-api.md b/extending-api.md index beb3d7ac..077d5530 100644 --- a/extending-api.md +++ b/extending-api.md @@ -114,7 +114,7 @@ For example, if a user creates: ```yaml metadata: name: cron-tab.stable.example.com -apiVersion: experimental/v1alpha1 +apiVersion: extensions/v1beta1 kind: ThirdPartyResource description: "A specification of a Pod to run on a cron style schedule" versions: -- cgit v1.2.3 From ad3c044039b6b08a491de7c7921cdc8948a977f0 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Tue, 28 Jul 2015 14:18:50 -0400 Subject: AWS "under the hood" document Document how we implement kubernetes on AWS, so that configuration tools other than kube-up can have a reference for what they should do, and generally to help developers get up to speed. --- aws_under_the_hood.md | 271 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 aws_under_the_hood.md diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md new file mode 100644 index 00000000..eece5dfb --- /dev/null +++ b/aws_under_the_hood.md @@ -0,0 +1,271 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/aws_under_the_hood.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +## Peeking under the hood of kubernetes on AWS + +We encourage you to use kube-up (or CloudFormation) to create a cluster. But +it is useful to know what is being created: for curiosity, to understand any +problems that may arise, or if you have to create things manually because the +scripts are unsuitable for any reason. We don't recommend manual configuration +(please file an issue and let us know what's missing if there's something you +need) but sometimes it is the only option. + +This document sets out to document how kubernetes on AWS maps to AWS objects. +Familiarity with AWS is assumed. + +### Top-level + +Kubernetes consists of a single master node, and a collection of minion nodes. +Other documents describe the general architecture of Kubernetes (all nodes run +Docker; the kubelet agent runs on each node and launches containers; the +kube-proxy relays traffic between the nodes etc). + +By default on AWS: + +* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently + modern kernel to give a good experience with Docker, it doesn't require a + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images) +* By default we run aufs over ext4 as the filesystem / container storage on the + nodes (mostly because this is what GCE uses). + +These defaults can be changed by passing different environment variables to +kube-up. + +### Storage + +AWS does support persistent volumes via EBS. These can then be attached to +pods that should store persistent data (e.g. if you're running a database). + +Minions do not have persistent volumes otherwise. In general, kubernetes +containers do not have persistent storage unless you attach a persistent +volume, and so minions on AWS use instance storage. Instance storage is +cheaper, often faster, and historically more reliable. This does mean that you +should pick an instance type that has sufficient instance storage, unless you +can make do with whatever space is left on your root partition. + +The master _does_ have a persistent volume attached to it. Containers are +mostly run against instance storage, just like the minions, except that we +repoint some important data onto the peristent volume. + +By default we use aufs over ext4. `DOCKER_STORAGE=btrfs` is also a good choice +for a filesystem: it is relatively reliable with Docker; btrfs itself is much +more reliable than it used to be with modern kernels. It can easily span +multiple volumes, which is particularly useful when we are using an instance +type with multiple ephemeral instance disks. + +### AutoScaling + +We run the minions in an AutoScalingGroup. Currently auto-scaling (e.g. based +on CPU) is not actually enabled (#11935). Instead, the auto-scaling group +means that AWS will relaunch any minions that are terminated. + +We do not currently run the master in an AutoScalingGroup, but we should +(#11934) + +### Networking + +Kubernetes uses an IP-per-pod model. This means that a node, which runs many +pods, must have many IPs. The way we implement this on AWS is to use VPCs and +the advanced routing support that it allows. Each pod is assigned a /24 CIDR; +then this CIDR is configured to route to an instance in the VPC routing table. + +It is also possible to use overlay networking on AWS, but the default kube-up +configuration does not. + +### NodePort & LoadBalancing + +Kubernetes on AWS integrates with ELB. When you create a service with +Type=LoadBalancer, kubernetes (the kube-controller-manager) will create an ELB, +create a security group for the ELB which allows access on the service ports, +attach all the minions to the ELB, and modify the security group for the +minions to allow traffic from the ELB to the minions. This traffic reaches +kube-proxy where it is then forwarded to the pods. + +ELB requires that all minions listen on a single port, and it acts as a layer-7 +forwarding proxy (i.e. the source IP is not preserved). It is not trivial for +kube-proxy to recognize the traffic therefore. So, LoadBalancer services are +also exposed as NodePort services. For NodePort services, a cluster-wide port +is assigned by kubernetes to the service, and kube-proxy listens externally on +that port on every minion, and forwards traffic to the pods. So for a +load-balanced service, ELB is configured to proxy traffic on the public port +(e.g. port 80) to the NodePort assigned to the service (e.g. 31234), kube-proxy +recognizes the traffic coming to the NodePort by the inbound port number, and +send it to the correct pods for the service. + +Note that we do not automatically open NodePort services in the AWS firewall +(although we do open LoadBalancer services). This is because we expect that +NodePort services are more of a building block for things like inter-cluster +services or for LoadBalancer. To consume a NodePort service externally, you +will likely have to open the port in the minion security group +(`kubernetes-minion-`). + +### IAM + +kube-proxy sets up two IAM roles, one for the master called +(kubernetes-master)[cluster/aws/templates/iam/kubernetes-master-policy.json] +and one for the minions called +(kubernetes-minion)[cluster/aws/templates/iam/kubernetes-minion-policy.json]. + +The master is responsible for creating ELBs and configuring them, as well as +setting up advanced VPC routing. Currently it has blanket permissions on EC2, +along with rights to create and destroy ELBs. + +The minion does not need a lot of access to the AWS APIs. It needs to download +a distribution file, and then it is responsible for attaching and detaching EBS +volumes to itself. + +The minion policy is relatively minimal. The master policy is probably overly +permissive. The security concious may want to lock-down the IAM policies +further (#11936) + +We should make it easier to extend IAM permissions and also ensure that they +are correctly configured (#???) + +### Tagging + +All AWS resources are tagged with a tag named "KuberentesCluster". This tag is +used to identify a particular 'instance' of Kubernetes, even if two clusters +are deployed into the same VPC. (The script doesn't do this by default, but it +can be done.) + +Within the AWS cloud provider logic, we filter requests to the AWS APIs to +match resources with our cluster tag. So we only see our own AWS objects. + +If you choose not to use kube-up, you must tag everything with a +KubernetesCluster tag with a unique per-cluster value. + + +# AWS Objects + +The kube-up script does a number of things in AWS: + +* Creates an S3 bucket (`AWS_S3_BUCKET`) and copy the kubernetes distribution + and the salt scripts into it. They are made world-readable and the HTTP URLs +are passed to instances; this is how kubernetes code gets onto the machines. +* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`. + `kubernetes-master` is used by the master node; `kubernetes-minion` is used +by minion nodes. +* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is + the OpenSSH key fingerprint, so that multiple users can run the script with +different keys and their keys will not collide (with near-certainty) It will +use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create +one there. (With the default ubuntu images, if you have to SSH in: the user is +`ubuntu` and that user can `sudo`) +* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16)., and + enables the `dns-support` and `dns-hostnames` options. +* Creates an internet gateway for the VPC. +* Creates a route table for the VPC, with the internet gateway as the default + route +* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` + (defaults to us-west-2a). Currently kubernetes runs in a single AZ; there +are two philosophies on how to achieve HA: cluster-per-AZ and +cross-AZ-clusters. cluster-per-AZ says you should have an independent cluster +for each AZ, they are entirely separate. cross-AZ-clusters allows a single +cluster to span multiple AZs. The debate is open here: cluster-per-AZ is more +robust but cross-AZ-clusters are more convenient. For now though, each AWS +kuberentes cluster lives in one AZ. +* Associates the subnet to the route table +* Creates security groups for the master node (`kubernetes-master-`) + and the minion nodes (`kubernetes-minion-`) +* Configures security groups so that masters & minions can intercommunicate, + and opens SSH to the world on master & minions, and opens port 443 to the +world on the master (for the HTTPS API endpoint) +* Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type + `MASTER_DISK_TYPE` +* Launches a master node with a fixed IP address (172.20.0.9), with the + security group, IAM credentials etc. An instance script is used to pass +vital configuration information to Salt. The hope is that over time we can +reduce the amount of configuration information that must be passed in this way. +* Once the instance is up, it attaches the EBS volume & sets up a manual + routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to +10.246.0.0/24) +* Creates an auto-scaling launch-configuration and group for the minions. The + name for both is `-minion-group`, defaults to +`kubernetes-minion-group`. The auto-scaling group has size min & max both set +to `NUM_MINIONS`. You can change the size of the auto-scaling group to add or +remove minions (directly though the AWS API/Console). The minion nodes +self-configure: they come up, run Salt with the stored configuration; connect +to the master and are assigned an internal CIDR; the master configures the +route-table with the minion CIDR. The script does health-check the minions, +but this is a self-check, it is not required. + +If attempting this configuration manually, I highly recommend following along +with the kube-up script, and being sure to tag everything with a +`KubernetesCluster`=`` tag. Also, passing the right configuration +options to Salt when not using the script is tricky: the plan here is to +simplify this by having Kubernetes take on more node configuration, and even +potentially remove Salt altogether. + + +## Manual infrastructure creation + +While this work is not yet complete, advanced users may choose to create (some) +AWS objects themselves, and still make use of the kube-up script (to configure +Salt, for example). + +* `AWS_S3_BUCKET` will use an existing S3 bucket +* `VPC_ID` will reuse an existing VPC +* `SUBNET_ID` will reuse an existing subnet +* If your route table is tagged with the correct `KubernetesCluster`, it will + be reused +* If your security groups are appropriately named, they will be reused. + +Currently there is no way to do the following with kube-up. If these affect +you, please open an issue with a description of what you're trying to do (your +use-case) and we'll see what we can do: + +* Use an existing AWS SSH key with an arbitrary name +* Override the IAM credentials in a sensible way (but this is in-progress) +* Use different security group permissions +* Configure your own auto-scaling groups + +# Instance boot + +The instance boot procedure is currently pretty complicated, primarily because +we must marshal configuration from Bash to Salt via the AWS instance script. +As we move more post-boot configuration out of Salt and into Kubernetes, we +will hopefully be able to simplify this. + +When the kube-up script launches instances, it builds an instance startup +script which includes some configuration options passed to kube-up, and +concatenates some of the scripts found in the cluster/aws/templates directory. +These scripts are responsible for mounting and formatting volumes, downloading +Salt & Kubernetes from the S3 bucket, and then triggering Salt to actually +install Kubernetes. + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() + -- cgit v1.2.3 From 1d67698c10c7f102f41ebb46ead1304949f04e82 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Sat, 19 Sep 2015 12:53:19 -0400 Subject: Changes per reviews --- aws_under_the_hood.md | 300 ++++++++++++++++++++++++++++---------------------- 1 file changed, 167 insertions(+), 133 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index eece5dfb..17ac1543 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -31,21 +31,31 @@ Documentation for other releases can be found at -## Peeking under the hood of kubernetes on AWS +# Peeking under the hood of Kubernetes on AWS -We encourage you to use kube-up (or CloudFormation) to create a cluster. But -it is useful to know what is being created: for curiosity, to understand any -problems that may arise, or if you have to create things manually because the -scripts are unsuitable for any reason. We don't recommend manual configuration -(please file an issue and let us know what's missing if there's something you -need) but sometimes it is the only option. +This document provides high-level insight into how Kubernetes works on AWS and +maps to AWS objects. We assume that you are familiar with AWS. -This document sets out to document how kubernetes on AWS maps to AWS objects. -Familiarity with AWS is assumed. +We encourage you to use [kube-up](../getting-started-guides/aws.md) (or +[CloudFormation](../getting-started-guides/aws-coreos.md) to create clusters on +AWS. We recommend that you avoid manual configuration but are aware that +sometimes it's the only option. -### Top-level +Tip: You should open an issue and let us know what enhancements can be made to +the scripts to better suit your needs. + +That said, it's also useful to know what's happening under the hood when +Kubernetes clusters are created on AWS. This can be particularly useful if +problems arise or in circumstances where the provided scripts are lacking and +you manually created or configured your cluster. + +### Architecture overview + +Kubernetes is a cluster of several machines that consists of a Kubernetes +master and a set number of nodes (previously known as 'minions') for which the +master which is responsible. See the [Architecture](architecture.md) topic for +more details. -Kubernetes consists of a single master node, and a collection of minion nodes. Other documents describe the general architecture of Kubernetes (all nodes run Docker; the kubelet agent runs on each node and launches containers; the kube-proxy relays traffic between the nodes etc). @@ -53,171 +63,192 @@ kube-proxy relays traffic between the nodes etc). By default on AWS: * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently - modern kernel to give a good experience with Docker, it doesn't require a - reboot. (The default SSH user is `ubuntu` for this and other ubuntu images) + modern kernel that parise well with Docker and doesn't require a + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) * By default we run aufs over ext4 as the filesystem / container storage on the nodes (mostly because this is what GCE uses). -These defaults can be changed by passing different environment variables to +You can override these defaults by passing different environment variables to kube-up. ### Storage -AWS does support persistent volumes via EBS. These can then be attached to -pods that should store persistent data (e.g. if you're running a database). - -Minions do not have persistent volumes otherwise. In general, kubernetes -containers do not have persistent storage unless you attach a persistent -volume, and so minions on AWS use instance storage. Instance storage is -cheaper, often faster, and historically more reliable. This does mean that you -should pick an instance type that has sufficient instance storage, unless you -can make do with whatever space is left on your root partition. - -The master _does_ have a persistent volume attached to it. Containers are -mostly run against instance storage, just like the minions, except that we -repoint some important data onto the peristent volume. - -By default we use aufs over ext4. `DOCKER_STORAGE=btrfs` is also a good choice -for a filesystem: it is relatively reliable with Docker; btrfs itself is much -more reliable than it used to be with modern kernels. It can easily span -multiple volumes, which is particularly useful when we are using an instance -type with multiple ephemeral instance disks. +AWS supports persistent volumes by using [Elastic Block Store +(EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be +attached to pods that should store persistent data (e.g. if you're running a +database). + +By default, nodes in AWS use `[instance +storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)' +unless you create pods with persistent volumes +`[(EBS)](../user-guide/volumes.md#awselasticblockstore)`. In general, +Kubernetes containers do not have persistent storage unless you attach a +persistent volume, and so nodes on AWS use instance storage. Instance +storage is cheaper, often faster, and historically more reliable. This does +mean that you should pick an instance type that has sufficient instance +storage, unless you can make do with whatever space is left on your root +partition. + +Note: Master uses a persistent volume ([etcd](architecture.html#etcd)) to track +its state but similar to the nodes, container are mostly run against instance +storage, except that we repoint some important data onto the peristent volume. + +The default storage driver for Docker images is aufs. Passing the environment +variable `DOCKER_STORAGE=btrfs` is also a good choice for a filesystem. btrfs +is relatively reliable with Docker and has improved its reliability with modern +kernels. It can easily span multiple volumes, which is particularly useful +when we are using an instance type with multiple ephemeral instance disks. ### AutoScaling -We run the minions in an AutoScalingGroup. Currently auto-scaling (e.g. based -on CPU) is not actually enabled (#11935). Instead, the auto-scaling group -means that AWS will relaunch any minions that are terminated. +Nodes (except for the master) are run in an +`[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled +([#11935](http://issues.k8s.io/11935)). Instead, the auto-scaling group means +that AWS will relaunch any non-master nodes that are terminated. We do not currently run the master in an AutoScalingGroup, but we should -(#11934) +([#11934](http://issues.k8s.io/11934)). ### Networking Kubernetes uses an IP-per-pod model. This means that a node, which runs many -pods, must have many IPs. The way we implement this on AWS is to use VPCs and -the advanced routing support that it allows. Each pod is assigned a /24 CIDR; -then this CIDR is configured to route to an instance in the VPC routing table. - -It is also possible to use overlay networking on AWS, but the default kube-up -configuration does not. - -### NodePort & LoadBalancing - -Kubernetes on AWS integrates with ELB. When you create a service with -Type=LoadBalancer, kubernetes (the kube-controller-manager) will create an ELB, -create a security group for the ELB which allows access on the service ports, -attach all the minions to the ELB, and modify the security group for the -minions to allow traffic from the ELB to the minions. This traffic reaches -kube-proxy where it is then forwarded to the pods. - -ELB requires that all minions listen on a single port, and it acts as a layer-7 -forwarding proxy (i.e. the source IP is not preserved). It is not trivial for -kube-proxy to recognize the traffic therefore. So, LoadBalancer services are -also exposed as NodePort services. For NodePort services, a cluster-wide port -is assigned by kubernetes to the service, and kube-proxy listens externally on -that port on every minion, and forwards traffic to the pods. So for a -load-balanced service, ELB is configured to proxy traffic on the public port -(e.g. port 80) to the NodePort assigned to the service (e.g. 31234), kube-proxy -recognizes the traffic coming to the NodePort by the inbound port number, and -send it to the correct pods for the service. +pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced +routing support so each pod is assigned a /24 CIDR. Each pod is assigned a /24 +CIDR; the assigned CIDR is then configured to route to an instance in the VPC +routing table. + +It is also possible to use overlay networking on AWS, but that is not the +configuration of the kube-up script. + +### NodePort and LoadBalancing + +Kubernetes on AWS integrates with [Elastic Load Balancing +(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). +When you create a service with `Type=LoadBalancer`, Kubernetes (the +kube-controller-manager) will create an ELB, create a security group for the +ELB which allows access on the service ports, attach all the nodes to the ELB, +and modify the security group for the nodes to allow traffic from the ELB to +the nodes. This traffic reaches kube-proxy where it is then forwarded to the +pods. + +ELB has some restrictions: it requires that all nodes listen on a single port, +and it acts as a forwarding proxy (i.e. the source IP is not preserved). To +work with these restrictions, in Kubernetes, `[LoadBalancer +services](../user-guide/services.html#type-loadbalancer)` are exposed as +`[NodePort services](../user-guide/services.html#type-nodeport)`. Then +kube-proxy listens externally on the cluster-wide port that's assigned to +NodePort services and forwards traffic to the corresponding pods. So ELB is +configured to proxy traffic on the public port (e.g. port 80) to the NodePort +that is assigned to the service (e.g. 31234). Any in-coming traffic sent to +the NodePort (e.g. port 31234) is recognized by kube-proxy and then sent to the +correct pods for that service. Note that we do not automatically open NodePort services in the AWS firewall (although we do open LoadBalancer services). This is because we expect that NodePort services are more of a building block for things like inter-cluster services or for LoadBalancer. To consume a NodePort service externally, you -will likely have to open the port in the minion security group +will likely have to open the port in the node security group (`kubernetes-minion-`). -### IAM +### Identity and Access Management (IAM) kube-proxy sets up two IAM roles, one for the master called -(kubernetes-master)[cluster/aws/templates/iam/kubernetes-master-policy.json] -and one for the minions called -(kubernetes-minion)[cluster/aws/templates/iam/kubernetes-minion-policy.json]. +[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) +and one for the non-master nodes called +[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). The master is responsible for creating ELBs and configuring them, as well as setting up advanced VPC routing. Currently it has blanket permissions on EC2, along with rights to create and destroy ELBs. -The minion does not need a lot of access to the AWS APIs. It needs to download -a distribution file, and then it is responsible for attaching and detaching EBS -volumes to itself. +The (non-master) nodes do not need a lot of access to the AWS APIs. They need to download +a distribution file, and then are responsible for attaching and detaching EBS +volumes from itself. -The minion policy is relatively minimal. The master policy is probably overly +The (non-master) node policy is relatively minimal. The master policy is probably overly permissive. The security concious may want to lock-down the IAM policies -further (#11936) +further ([#11936](http://issues.k8s.io/11936)). We should make it easier to extend IAM permissions and also ensure that they -are correctly configured (#???) +are correctly configured ([#14226](http://issues.k8s.io/14226)). ### Tagging -All AWS resources are tagged with a tag named "KuberentesCluster". This tag is -used to identify a particular 'instance' of Kubernetes, even if two clusters -are deployed into the same VPC. (The script doesn't do this by default, but it -can be done.) +All AWS resources are tagged with a tag named "KuberentesCluster", with a value +that is the unique cluster-id. This tag is used to identify a particular +'instance' of Kubernetes, even if two clusters are deployed into the same VPC. +Resources are considered to belong to the same cluster if and only if they have +the same value in the tag named "KubernetesCluster". (The kube-up script is +not configured to create multiple clusters in the same VPC by default, but it +is possible to create another cluster in the same VPC.) Within the AWS cloud provider logic, we filter requests to the AWS APIs to -match resources with our cluster tag. So we only see our own AWS objects. - -If you choose not to use kube-up, you must tag everything with a -KubernetesCluster tag with a unique per-cluster value. +match resources with our cluster tag. By filtering the requests, we ensure +that we see only our own AWS objects. +Important: If you choose not to use kube-up, you must pick a unique cluster-id +value, and ensure that all AWS resources have a tag with +`Name=KubernetesCluster,Value=`. -# AWS Objects +### AWS Objects The kube-up script does a number of things in AWS: -* Creates an S3 bucket (`AWS_S3_BUCKET`) and copy the kubernetes distribution +* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution and the salt scripts into it. They are made world-readable and the HTTP URLs -are passed to instances; this is how kubernetes code gets onto the machines. -* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`. - `kubernetes-master` is used by the master node; `kubernetes-minion` is used -by minion nodes. +are passed to instances; this is how Kubernetes code gets onto the machines. +* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`: + * `kubernetes-master` is used by the master node + * `kubernetes-minion` is used by non-master nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is the OpenSSH key fingerprint, so that multiple users can run the script with -different keys and their keys will not collide (with near-certainty) It will +different keys and their keys will not collide (with near-certainty). It will use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create one there. (With the default ubuntu images, if you have to SSH in: the user is `ubuntu` and that user can `sudo`) -* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16)., and +* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and enables the `dns-support` and `dns-hostnames` options. * Creates an internet gateway for the VPC. * Creates a route table for the VPC, with the internet gateway as the default route * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` - (defaults to us-west-2a). Currently kubernetes runs in a single AZ; there -are two philosophies on how to achieve HA: cluster-per-AZ and -cross-AZ-clusters. cluster-per-AZ says you should have an independent cluster -for each AZ, they are entirely separate. cross-AZ-clusters allows a single -cluster to span multiple AZs. The debate is open here: cluster-per-AZ is more -robust but cross-AZ-clusters are more convenient. For now though, each AWS -kuberentes cluster lives in one AZ. + (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a +single AZ on AWS. Although, there are two philosophies in discussion on how to +achieve High Availability (HA): + * cluster-per-AZ: An independent cluster for each AZ, where each cluster + is entirely separate. + * cross-AZ-clusters: A single cluster spans multiple AZs. +The debate is open here, where cluster-per-AZ is discussed as more robust but +cross-AZ-clusters are more convenient. * Associates the subnet to the route table * Creates security groups for the master node (`kubernetes-master-`) - and the minion nodes (`kubernetes-minion-`) -* Configures security groups so that masters & minions can intercommunicate, - and opens SSH to the world on master & minions, and opens port 443 to the -world on the master (for the HTTPS API endpoint) + and the non-master nodes (`kubernetes-minion-`) +* Configures security groups so that masters and nodes can communicate. This + includes intercommunication between masters and nodes, opening SSH publicly +for both masters and nodes, and opening port 443 on the master for the HTTPS +API endpoints. * Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type `MASTER_DISK_TYPE` -* Launches a master node with a fixed IP address (172.20.0.9), with the - security group, IAM credentials etc. An instance script is used to pass -vital configuration information to Salt. The hope is that over time we can -reduce the amount of configuration information that must be passed in this way. -* Once the instance is up, it attaches the EBS volume & sets up a manual +* Launches a master node with a fixed IP address (172.20.0.9) that is also + configured for the security group and all the necessary IAM credentials. An +instance script is used to pass vital configuration information to Salt. Note: +The hope is that over time we can reduce the amount of configuration +information that must be passed in this way. +* Once the instance is up, it attaches the EBS volume and sets up a manual routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to 10.246.0.0/24) -* Creates an auto-scaling launch-configuration and group for the minions. The - name for both is `-minion-group`, defaults to -`kubernetes-minion-group`. The auto-scaling group has size min & max both set -to `NUM_MINIONS`. You can change the size of the auto-scaling group to add or -remove minions (directly though the AWS API/Console). The minion nodes -self-configure: they come up, run Salt with the stored configuration; connect -to the master and are assigned an internal CIDR; the master configures the -route-table with the minion CIDR. The script does health-check the minions, -but this is a self-check, it is not required. +* For auto-scaling, on each nodes it creates a launch configuration and group. + The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default +name is kubernetes-minion-group. The auto-scaling group has a min and max size +that are both set to NUM_MINIONS. You can change the size of the auto-scaling +group to add or remove the total number of nodes from within the AWS API or +Console. Each nodes self-configures, meaning that they come up; run Salt with +the stored configuration; connect to the master; are assigned an internal CIDR; +and then the master configures the route-table with the assigned CIDR. The +kube-up script performs a health-check on the nodes but it's a self-check that +is not required. + If attempting this configuration manually, I highly recommend following along with the kube-up script, and being sure to tag everything with a @@ -227,29 +258,32 @@ simplify this by having Kubernetes take on more node configuration, and even potentially remove Salt altogether. -## Manual infrastructure creation +### Manual infrastructure creation -While this work is not yet complete, advanced users may choose to create (some) -AWS objects themselves, and still make use of the kube-up script (to configure -Salt, for example). +While this work is not yet complete, advanced users might choose to manually +certain AWS objects while still making use of the kube-up script (to configure +Salt, for example). These objects can currently be manually created: -* `AWS_S3_BUCKET` will use an existing S3 bucket -* `VPC_ID` will reuse an existing VPC -* `SUBNET_ID` will reuse an existing subnet -* If your route table is tagged with the correct `KubernetesCluster`, it will - be reused +* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. +* Set the `VPC_ID` environment variable to reuse an existing VPC. +* Set the `SUBNET_ID` environemnt variable to reuse an existing subnet. +* If your route table has a matching `KubernetesCluster` tag, it will + be reused. * If your security groups are appropriately named, they will be reused. -Currently there is no way to do the following with kube-up. If these affect -you, please open an issue with a description of what you're trying to do (your -use-case) and we'll see what we can do: +Currently there is no way to do the following with kube-up: + +* Use an existing AWS SSH key with an arbitrary name. +* Override the IAM credentials in a sensible way + ([#14226](http://issues.k8s.io/14226)). +* Use different security group permissions. +* Configure your own auto-scaling groups. -* Use an existing AWS SSH key with an arbitrary name -* Override the IAM credentials in a sensible way (but this is in-progress) -* Use different security group permissions -* Configure your own auto-scaling groups +If any of the above items apply to your situation, open an issue to request an +enhancement to the kube-up script. You should provide a complete description of +the use-case, including all the details around what you want to accomplish. -# Instance boot +### Instance boot The instance boot procedure is currently pretty complicated, primarily because we must marshal configuration from Bash to Salt via the AWS instance script. @@ -260,7 +294,7 @@ When the kube-up script launches instances, it builds an instance startup script which includes some configuration options passed to kube-up, and concatenates some of the scripts found in the cluster/aws/templates directory. These scripts are responsible for mounting and formatting volumes, downloading -Salt & Kubernetes from the S3 bucket, and then triggering Salt to actually +Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually install Kubernetes. -- cgit v1.2.3 From a8965e59a1235e89ede3ea8c1aaa4f92ec98303f Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Sat, 19 Sep 2015 13:16:52 -0400 Subject: Fix some typos from my read-through --- aws_under_the_hood.md | 85 +++++++++++++++++++++++++-------------------------- 1 file changed, 41 insertions(+), 44 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 17ac1543..6c54dcc4 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -37,7 +37,7 @@ This document provides high-level insight into how Kubernetes works on AWS and maps to AWS objects. We assume that you are familiar with AWS. We encourage you to use [kube-up](../getting-started-guides/aws.md) (or -[CloudFormation](../getting-started-guides/aws-coreos.md) to create clusters on +[CloudFormation](../getting-started-guides/aws-coreos.md)) to create clusters on AWS. We recommend that you avoid manual configuration but are aware that sometimes it's the only option. @@ -63,7 +63,7 @@ kube-proxy relays traffic between the nodes etc). By default on AWS: * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently - modern kernel that parise well with Docker and doesn't require a + modern kernel that pairs well with Docker and doesn't require a reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) * By default we run aufs over ext4 as the filesystem / container storage on the nodes (mostly because this is what GCE uses). @@ -73,39 +73,36 @@ kube-up. ### Storage -AWS supports persistent volumes by using [Elastic Block Store -(EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be +AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be attached to pods that should store persistent data (e.g. if you're running a database). -By default, nodes in AWS use `[instance -storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)' +By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) unless you create pods with persistent volumes -`[(EBS)](../user-guide/volumes.md#awselasticblockstore)`. In general, -Kubernetes containers do not have persistent storage unless you attach a -persistent volume, and so nodes on AWS use instance storage. Instance -storage is cheaper, often faster, and historically more reliable. This does -mean that you should pick an instance type that has sufficient instance -storage, unless you can make do with whatever space is left on your root -partition. - -Note: Master uses a persistent volume ([etcd](architecture.html#etcd)) to track -its state but similar to the nodes, container are mostly run against instance +[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes +containers do not have persistent storage unless you attach a persistent +volume, and so nodes on AWS use instance storage. Instance storage is cheaper, +often faster, and historically more reliable. This does mean that you should +pick an instance type that has sufficient instance storage, unless you can make +do with whatever space is left on your root partition. + +Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track +its state but similar to the nodes, containers are mostly run against instance storage, except that we repoint some important data onto the peristent volume. -The default storage driver for Docker images is aufs. Passing the environment -variable `DOCKER_STORAGE=btrfs` is also a good choice for a filesystem. btrfs +The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment +variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem. btrfs is relatively reliable with Docker and has improved its reliability with modern kernels. It can easily span multiple volumes, which is particularly useful when we are using an instance type with multiple ephemeral instance disks. ### AutoScaling -Nodes (except for the master) are run in an -`[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +Nodes (but not the master) are run in an +[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled ([#11935](http://issues.k8s.io/11935)). Instead, the auto-scaling group means -that AWS will relaunch any non-master nodes that are terminated. +that AWS will relaunch any nodes that are terminated. We do not currently run the master in an AutoScalingGroup, but we should ([#11934](http://issues.k8s.io/11934)). @@ -134,9 +131,9 @@ pods. ELB has some restrictions: it requires that all nodes listen on a single port, and it acts as a forwarding proxy (i.e. the source IP is not preserved). To -work with these restrictions, in Kubernetes, `[LoadBalancer -services](../user-guide/services.html#type-loadbalancer)` are exposed as -`[NodePort services](../user-guide/services.html#type-nodeport)`. Then +work with these restrictions, in Kubernetes, [LoadBalancer +services](../user-guide/services.html#type-loadbalancer) are exposed as +[NodePort services](../user-guide/services.html#type-nodeport). Then kube-proxy listens externally on the cluster-wide port that's assigned to NodePort services and forwards traffic to the corresponding pods. So ELB is configured to proxy traffic on the public port (e.g. port 80) to the NodePort @@ -155,18 +152,18 @@ will likely have to open the port in the node security group kube-proxy sets up two IAM roles, one for the master called [kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) -and one for the non-master nodes called +and one for the nodes called [kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). The master is responsible for creating ELBs and configuring them, as well as setting up advanced VPC routing. Currently it has blanket permissions on EC2, along with rights to create and destroy ELBs. -The (non-master) nodes do not need a lot of access to the AWS APIs. They need to download +The nodes do not need a lot of access to the AWS APIs. They need to download a distribution file, and then are responsible for attaching and detaching EBS volumes from itself. -The (non-master) node policy is relatively minimal. The master policy is probably overly +The node policy is relatively minimal. The master policy is probably overly permissive. The security concious may want to lock-down the IAM policies further ([#11936](http://issues.k8s.io/11936)). @@ -198,9 +195,9 @@ The kube-up script does a number of things in AWS: * Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution and the salt scripts into it. They are made world-readable and the HTTP URLs are passed to instances; this is how Kubernetes code gets onto the machines. -* Creates two IAM profiles based on templates in `cluster/aws/templates/iam`: - * `kubernetes-master` is used by the master node - * `kubernetes-minion` is used by non-master nodes. +* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam): + * `kubernetes-master` is used by the master + * `kubernetes-minion` is used by nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is the OpenSSH key fingerprint, so that multiple users can run the script with different keys and their keys will not collide (with near-certainty). It will @@ -215,22 +212,22 @@ one there. (With the default ubuntu images, if you have to SSH in: the user is * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a single AZ on AWS. Although, there are two philosophies in discussion on how to -achieve High Availability (HA): +achieve High Availability (HA): * cluster-per-AZ: An independent cluster for each AZ, where each cluster - is entirely separate. - * cross-AZ-clusters: A single cluster spans multiple AZs. + is entirely separate. + * cross-AZ-clusters: A single cluster spans multiple AZs. The debate is open here, where cluster-per-AZ is discussed as more robust but -cross-AZ-clusters are more convenient. +cross-AZ-clusters are more convenient. * Associates the subnet to the route table -* Creates security groups for the master node (`kubernetes-master-`) - and the non-master nodes (`kubernetes-minion-`) +* Creates security groups for the master (`kubernetes-master-`) + and the nodes (`kubernetes-minion-`) * Configures security groups so that masters and nodes can communicate. This includes intercommunication between masters and nodes, opening SSH publicly for both masters and nodes, and opening port 443 on the master for the HTTPS API endpoints. -* Creates an EBS volume for the master node of size `MASTER_DISK_SIZE` and type +* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type `MASTER_DISK_TYPE` -* Launches a master node with a fixed IP address (172.20.0.9) that is also +* Launches a master with a fixed IP address (172.20.0.9) that is also configured for the security group and all the necessary IAM credentials. An instance script is used to pass vital configuration information to Salt. Note: The hope is that over time we can reduce the amount of configuration @@ -251,17 +248,17 @@ is not required. If attempting this configuration manually, I highly recommend following along -with the kube-up script, and being sure to tag everything with a -`KubernetesCluster`=`` tag. Also, passing the right configuration -options to Salt when not using the script is tricky: the plan here is to -simplify this by having Kubernetes take on more node configuration, and even -potentially remove Salt altogether. +with the kube-up script, and being sure to tag everything with a tag with name +`KubernetesCluster` and value set to a unique cluster-id. Also, passing the +right configuration options to Salt when not using the script is tricky: the +plan here is to simplify this by having Kubernetes take on more node +configuration, and even potentially remove Salt altogether. ### Manual infrastructure creation While this work is not yet complete, advanced users might choose to manually -certain AWS objects while still making use of the kube-up script (to configure +create certain AWS objects while still making use of the kube-up script (to configure Salt, for example). These objects can currently be manually created: * Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. -- cgit v1.2.3 From 7582e453bcb112311b010b25dd7a038aecbff9bf Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Sat, 19 Sep 2015 15:20:20 -0400 Subject: Two small fixes (to keep doc-gen happy) --- aws_under_the_hood.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 6c54dcc4..ac9efe55 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -133,7 +133,7 @@ ELB has some restrictions: it requires that all nodes listen on a single port, and it acts as a forwarding proxy (i.e. the source IP is not preserved). To work with these restrictions, in Kubernetes, [LoadBalancer services](../user-guide/services.html#type-loadbalancer) are exposed as -[NodePort services](../user-guide/services.html#type-nodeport). Then +[NodePort services](../user-guide/services.md#type-nodeport). Then kube-proxy listens externally on the cluster-wide port that's assigned to NodePort services and forwards traffic to the corresponding pods. So ELB is configured to proxy traffic on the public port (e.g. port 80) to the NodePort @@ -195,7 +195,7 @@ The kube-up script does a number of things in AWS: * Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution and the salt scripts into it. They are made world-readable and the HTTP URLs are passed to instances; this is how Kubernetes code gets onto the machines. -* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam): +* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): * `kubernetes-master` is used by the master * `kubernetes-minion` is used by nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is -- cgit v1.2.3 From 9c150b14df3ffdc37582bd0c6b887171da233974 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Mon, 19 Oct 2015 13:55:43 -0400 Subject: More fixes based on commments --- aws_under_the_hood.md | 119 ++++++++++++++++++++++++++++---------------------- 1 file changed, 66 insertions(+), 53 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index ac9efe55..845964f2 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -49,6 +49,18 @@ Kubernetes clusters are created on AWS. This can be particularly useful if problems arise or in circumstances where the provided scripts are lacking and you manually created or configured your cluster. +**Table of contents:** + * [Architecture overview](#architecture-overview) + * [Storage](#storage) + * [Auto Scaling group](#auto-scaling-group) + * [Networking](#networking) + * [NodePort and LoadBalancing services](#nodeport-and-loadbalancing-services) + * [Identity and access management (IAM)](#identity-and-access-management-iam) + * [Tagging](#tagging) + * [AWS objects](#aws-objects) + * [Manual infrastructure creation](#manual-infrastructure-creation) + * [Instance boot](#instance-boot) + ### Architecture overview Kubernetes is a cluster of several machines that consists of a Kubernetes @@ -56,17 +68,13 @@ master and a set number of nodes (previously known as 'minions') for which the master which is responsible. See the [Architecture](architecture.md) topic for more details. -Other documents describe the general architecture of Kubernetes (all nodes run -Docker; the kubelet agent runs on each node and launches containers; the -kube-proxy relays traffic between the nodes etc). - By default on AWS: * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently modern kernel that pairs well with Docker and doesn't require a reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) -* By default we run aufs over ext4 as the filesystem / container storage on the - nodes (mostly because this is what GCE uses). +* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly + because this is what Google Compute Engine uses). You can override these defaults by passing different environment variables to kube-up. @@ -82,12 +90,12 @@ unless you create pods with persistent volumes [(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes containers do not have persistent storage unless you attach a persistent volume, and so nodes on AWS use instance storage. Instance storage is cheaper, -often faster, and historically more reliable. This does mean that you should -pick an instance type that has sufficient instance storage, unless you can make -do with whatever space is left on your root partition. +often faster, and historically more reliable. Unless you can make do with whatever +space is left on your root partition, you must choose an instance type that provides +you with sufficient instance storage for your needs. Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track -its state but similar to the nodes, containers are mostly run against instance +its state. Similar to nodes, containers are mostly run against instance storage, except that we repoint some important data onto the peristent volume. The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment @@ -96,12 +104,12 @@ is relatively reliable with Docker and has improved its reliability with modern kernels. It can easily span multiple volumes, which is particularly useful when we are using an instance type with multiple ephemeral instance disks. -### AutoScaling +### Auto Scaling group Nodes (but not the master) are run in an -[AutoScalingGroup](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled -([#11935](http://issues.k8s.io/11935)). Instead, the auto-scaling group means +([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means that AWS will relaunch any nodes that are terminated. We do not currently run the master in an AutoScalingGroup, but we should @@ -111,14 +119,13 @@ We do not currently run the master in an AutoScalingGroup, but we should Kubernetes uses an IP-per-pod model. This means that a node, which runs many pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced -routing support so each pod is assigned a /24 CIDR. Each pod is assigned a /24 -CIDR; the assigned CIDR is then configured to route to an instance in the VPC -routing table. +routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then +configured to route to an instance in the VPC routing table. -It is also possible to use overlay networking on AWS, but that is not the +It is also possible to use overlay networking on AWS, but that is not the default configuration of the kube-up script. -### NodePort and LoadBalancing +### NodePort and LoadBalancing services Kubernetes on AWS integrates with [Elastic Load Balancing (ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). @@ -129,17 +136,23 @@ and modify the security group for the nodes to allow traffic from the ELB to the nodes. This traffic reaches kube-proxy where it is then forwarded to the pods. -ELB has some restrictions: it requires that all nodes listen on a single port, -and it acts as a forwarding proxy (i.e. the source IP is not preserved). To -work with these restrictions, in Kubernetes, [LoadBalancer -services](../user-guide/services.html#type-loadbalancer) are exposed as +ELB has some restrictions: +* it requires that all nodes listen on a single port, +* it acts as a forwarding proxy (i.e. the source IP is not preserved). + +To work with these restrictions, in Kubernetes, [LoadBalancer +services](../user-guide/services.md#type-loadbalancer) are exposed as [NodePort services](../user-guide/services.md#type-nodeport). Then kube-proxy listens externally on the cluster-wide port that's assigned to -NodePort services and forwards traffic to the corresponding pods. So ELB is -configured to proxy traffic on the public port (e.g. port 80) to the NodePort -that is assigned to the service (e.g. 31234). Any in-coming traffic sent to -the NodePort (e.g. port 31234) is recognized by kube-proxy and then sent to the -correct pods for that service. +NodePort services and forwards traffic to the corresponding pods. + +So for example, if we configure a service of Type LoadBalancer with a +public port of 80: +* Kubernetes will assign a NodePort to the service (e.g. 31234) +* ELB is configured to proxy traffic on the public port 80 to the NodePort + that is assigned to the service (31234). +* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234) + is recognized by kube-proxy and sent to the correct pods for that service. Note that we do not automatically open NodePort services in the AWS firewall (although we do open LoadBalancer services). This is because we expect that @@ -188,31 +201,31 @@ Important: If you choose not to use kube-up, you must pick a unique cluster-id value, and ensure that all AWS resources have a tag with `Name=KubernetesCluster,Value=`. -### AWS Objects +### AWS objects The kube-up script does a number of things in AWS: * Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution and the salt scripts into it. They are made world-readable and the HTTP URLs -are passed to instances; this is how Kubernetes code gets onto the machines. + are passed to instances; this is how Kubernetes code gets onto the machines. * Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): - * `kubernetes-master` is used by the master + * `kubernetes-master` is used by the master. * `kubernetes-minion` is used by nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is the OpenSSH key fingerprint, so that multiple users can run the script with -different keys and their keys will not collide (with near-certainty). It will -use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create -one there. (With the default ubuntu images, if you have to SSH in: the user is -`ubuntu` and that user can `sudo`) + different keys and their keys will not collide (with near-certainty). It will + use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create + one there. (With the default Ubuntu images, if you have to SSH in: the user is + `ubuntu` and that user can `sudo`). * Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and enables the `dns-support` and `dns-hostnames` options. * Creates an internet gateway for the VPC. * Creates a route table for the VPC, with the internet gateway as the default - route + route. * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a -single AZ on AWS. Although, there are two philosophies in discussion on how to -achieve High Availability (HA): + single AZ on AWS. Although, there are two philosophies in discussion on how to + achieve High Availability (HA): * cluster-per-AZ: An independent cluster for each AZ, where each cluster is entirely separate. * cross-AZ-clusters: A single cluster spans multiple AZs. @@ -220,31 +233,31 @@ The debate is open here, where cluster-per-AZ is discussed as more robust but cross-AZ-clusters are more convenient. * Associates the subnet to the route table * Creates security groups for the master (`kubernetes-master-`) - and the nodes (`kubernetes-minion-`) + and the nodes (`kubernetes-minion-`). * Configures security groups so that masters and nodes can communicate. This includes intercommunication between masters and nodes, opening SSH publicly -for both masters and nodes, and opening port 443 on the master for the HTTPS -API endpoints. + for both masters and nodes, and opening port 443 on the master for the HTTPS + API endpoints. * Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type - `MASTER_DISK_TYPE` + `MASTER_DISK_TYPE`. * Launches a master with a fixed IP address (172.20.0.9) that is also configured for the security group and all the necessary IAM credentials. An -instance script is used to pass vital configuration information to Salt. Note: -The hope is that over time we can reduce the amount of configuration -information that must be passed in this way. + instance script is used to pass vital configuration information to Salt. Note: + The hope is that over time we can reduce the amount of configuration + information that must be passed in this way. * Once the instance is up, it attaches the EBS volume and sets up a manual routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to -10.246.0.0/24) + 10.246.0.0/24). * For auto-scaling, on each nodes it creates a launch configuration and group. The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default -name is kubernetes-minion-group. The auto-scaling group has a min and max size -that are both set to NUM_MINIONS. You can change the size of the auto-scaling -group to add or remove the total number of nodes from within the AWS API or -Console. Each nodes self-configures, meaning that they come up; run Salt with -the stored configuration; connect to the master; are assigned an internal CIDR; -and then the master configures the route-table with the assigned CIDR. The -kube-up script performs a health-check on the nodes but it's a self-check that -is not required. + name is kubernetes-minion-group. The auto-scaling group has a min and max size + that are both set to NUM_MINIONS. You can change the size of the auto-scaling + group to add or remove the total number of nodes from within the AWS API or + Console. Each nodes self-configures, meaning that they come up; run Salt with + the stored configuration; connect to the master; are assigned an internal CIDR; + and then the master configures the route-table with the assigned CIDR. The + kube-up script performs a health-check on the nodes but it's a self-check that + is not required. If attempting this configuration manually, I highly recommend following along -- cgit v1.2.3 From 5d938fc28fe94a988711f8cdc68c08451e05bab3 Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Mon, 19 Oct 2015 14:06:32 -0400 Subject: Remove broken link to CloudFormation setup --- aws_under_the_hood.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 845964f2..3eaf20cf 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -36,10 +36,9 @@ Documentation for other releases can be found at This document provides high-level insight into how Kubernetes works on AWS and maps to AWS objects. We assume that you are familiar with AWS. -We encourage you to use [kube-up](../getting-started-guides/aws.md) (or -[CloudFormation](../getting-started-guides/aws-coreos.md)) to create clusters on -AWS. We recommend that you avoid manual configuration but are aware that -sometimes it's the only option. +We encourage you to use [kube-up](../getting-started-guides/aws.md) to create +clusters on AWS. We recommend that you avoid manual configuration but are aware +that sometimes it's the only option. Tip: You should open an issue and let us know what enhancements can be made to the scripts to better suit your needs. -- cgit v1.2.3 From 830e70f0da5a28b1cc93209f511c0be6e7f47aab Mon Sep 17 00:00:00 2001 From: Justin Santa Barbara Date: Mon, 19 Oct 2015 15:43:41 -0400 Subject: Rename LoadBalancing -> LoadBalancer To match the Type value --- aws_under_the_hood.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 3eaf20cf..ec8a31c2 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -53,7 +53,7 @@ you manually created or configured your cluster. * [Storage](#storage) * [Auto Scaling group](#auto-scaling-group) * [Networking](#networking) - * [NodePort and LoadBalancing services](#nodeport-and-loadbalancing-services) + * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services) * [Identity and access management (IAM)](#identity-and-access-management-iam) * [Tagging](#tagging) * [AWS objects](#aws-objects) @@ -124,7 +124,7 @@ configured to route to an instance in the VPC routing table. It is also possible to use overlay networking on AWS, but that is not the default configuration of the kube-up script. -### NodePort and LoadBalancing services +### NodePort and LoadBalancer services Kubernetes on AWS integrates with [Elastic Load Balancing (ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). -- cgit v1.2.3 From 0880787aa4ba01a636a4da90e75dde425360dd39 Mon Sep 17 00:00:00 2001 From: Jerzy Szczepkowski Date: Fri, 16 Oct 2015 12:04:43 +0200 Subject: Proposal for horizontal pod autoscaler updated and moved to design. Proposal for horizontal pod autoscaler updated and moved to design. Related to #15652. --- horizontal-pod-autoscaler.md | 272 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 272 insertions(+) create mode 100644 horizontal-pod-autoscaler.md diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md new file mode 100644 index 00000000..35991847 --- /dev/null +++ b/horizontal-pod-autoscaler.md @@ -0,0 +1,272 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest 1.0.x release of this document can be found +[here](http://releases.k8s.io/release-1.0/docs/design/horizontal-pod-autoscaler.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Horizontal Pod Autoscaling + +## Preface + +This document briefly describes the design of the horizontal autoscaler for pods. +The autoscaler (implemented as a Kubernetes API resource and controller) is responsible for dynamically controlling +the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), +for example a target per-pod CPU utilization. + +This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). + +## Overview + +The resource usage of a serving application usually varies over time: sometimes the demand for the application rises, +and sometimes it drops. +In Kubernetes version 1.0, a user can only manually set the number of serving pods. +Our aim is to provide a mechanism for the automatic adjustment of the number of pods based on CPU utilization statistics +(a future version will allow autoscaling based on other resources/metrics). + +## Scale Subresource + +In Kubernetes version 1.1, we are introducing Scale subresource and implementing horizontal autoscaling of pods based on it. +Scale subresource is supported for replication controllers and deployments. +Scale subresource is a Virtual Resource (does not correspond to an object stored in etcd). +It is only present in the API as an interface that a controller (in this case the HorizontalPodAutoscaler) can use to dynamically scale +the number of replicas controlled by some other API object (currently ReplicationController and Deployment) and to learn the current number of replicas. +Scale is a subresource of the API object that it serves as the interface for. +The Scale subresource is useful because whenever we introduce another type we want to autoscale, we just need to implement the Scale subresource for it. +The wider discussion regarding Scale took place in [#1629](https://github.com/kubernetes/kubernetes/issues/1629). + +Scale subresource is in API for replication controller or deployment under the following paths: + +`apis/extensions/v1beta1/replicationcontrollers/myrc/scale` + +`apis/extensions/v1beta1/deployments/mydeployment/scale` + +It has the following structure: + +```go +// represents a scaling request for a resource. +type Scale struct { + unversioned.TypeMeta + api.ObjectMeta + + // defines the behavior of the scale. + Spec ScaleSpec + + // current status of the scale. + Status ScaleStatus +} + +// describes the attributes of a scale subresource +type ScaleSpec struct { + // desired number of instances for the scaled object. + Replicas int `json:"replicas,omitempty"` +} + +// represents the current status of a scale subresource. +type ScaleStatus struct { + // actual number of observed instances of the scaled object. + Replicas int `json:"replicas"` + + // label query over pods that should match the replicas count. + Selector map[string]string `json:"selector,omitempty"` +} +``` + +Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment associated with +the given Scale subresource. +`ScaleStatus.Replicas` reports how many pods are currently running in the replication controller/deployment, +and `ScaleStatus.Selector` returns selector for the pods. + +## HorizontalPodAutoscaler Object + +In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It is accessible under: + +`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler` + +It has the following structure: + +```go +// configuration of a horizontal pod autoscaler. +type HorizontalPodAutoscaler struct { + unversioned.TypeMeta + api.ObjectMeta + + // behavior of autoscaler. + Spec HorizontalPodAutoscalerSpec + + // current information about the autoscaler. + Status HorizontalPodAutoscalerStatus +} + +// specification of a horizontal pod autoscaler. +type HorizontalPodAutoscalerSpec struct { + // reference to Scale subresource; horizontal pod autoscaler will learn the current resource + // consumption from its status,and will set the desired number of pods by modifying its spec. + ScaleRef SubresourceReference + // lower limit for the number of pods that can be set by the autoscaler, default 1. + MinReplicas *int + // upper limit for the number of pods that can be set by the autoscaler. + // It cannot be smaller than MinReplicas. + MaxReplicas int + // target average CPU utilization (represented as a percentage of requested CPU) over all the pods; + // if not specified it defaults to the target CPU utilization at 80% of the requested resources. + CPUUtilization *CPUTargetUtilization +} + +type CPUTargetUtilization struct { + // fraction of the requested CPU that should be utilized/used, + // e.g. 70 means that 70% of the requested CPU should be in use. + TargetPercentage int +} + +// current status of a horizontal pod autoscaler +type HorizontalPodAutoscalerStatus struct { + // most recent generation observed by this autoscaler. + ObservedGeneration *int64 + + // last time the HorizontalPodAutoscaler scaled the number of pods; + // used by the autoscaler to control how often the number of pods is changed. + LastScaleTime *unversioned.Time + + // current number of replicas of pods managed by this autoscaler. + CurrentReplicas int + + // desired number of replicas of pods managed by this autoscaler. + DesiredReplicas int + + // current average CPU utilization over all pods, represented as a percentage of requested CPU, + // e.g. 70 means that an average pod is using now 70% of its requested CPU. + CurrentCPUUtilizationPercentage *int +} +``` + +`ScaleRef` is a reference to the Scale subresource. +`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler configuration. +We are also introducing HorizontalPodAutoscalerList object to enable listing all autoscalers in a namespace: + +```go +// list of horizontal pod autoscaler objects. +type HorizontalPodAutoscalerList struct { + unversioned.TypeMeta + unversioned.ListMeta + + // list of horizontal pod autoscaler objects. + Items []HorizontalPodAutoscaler +} +``` + +## Autoscaling Algorithm + +The autoscaler is implemented as a control loop. It periodically queries pods described by `Status.PodSelector` of Scale subresource, and collects their CPU utilization. +Then, it compares the arithmetic mean of the pods' CPU utilization with the target defined in `Spec.CPUUtilization`, +and adjust the replicas of the Scale if needed to match the target +(preserving condition: MinReplicas <= Replicas <= MaxReplicas). + +The period of the autoscaler is controlled by `--horizontal-pod-autoscaler-sync-period` flag of controller manager. +The default value is 30 seconds. + + +CPU utilization is the recent CPU usage of a pod (average across the last 1 minute) divided by the CPU requested by the pod. +In Kubernetes version 1.1, CPU usage is taken directly from Heapster. +In future, there will be API on master for this purpose +(see [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). + +The target number of pods is calculated from the following formula: + +``` +TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target) +``` + +Starting and stopping pods may introduce noise to the metric (for instance, starting may temporarily increase CPU). +So, after each action, the autoscaler should wait some time for reliable data. +Scale-up can only happen if there was no rescaling within the last 3 minutes. +Scale-down will wait for 5 minutes from the last rescaling. +Moreover any scaling will only be made if: `avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 (10% tolerance). +Such approach has two benefits: + +* Autoscaler works in a conservative way. + If new user load appears, it is important for us to rapidly increase the number of pods, + so that user requests will not be rejected. + Lowering the number of pods is not that urgent. + +* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting decision if the load is not stable. + +## Relative vs. absolute metrics + +We chose values of the target metric to be relative (e.g. 90% of requested CPU resource) rather than absolute (e.g. 0.6 core) for the following reason. +If we choose absolute metric, user will need to guarantee that the target is lower than the request. +Otherwise, overloaded pods may not be able to consume more than the autoscaler's absolute target utilization, +thereby preventing the autoscaler from seeing high enough utilization to trigger it to scale up. +This may be especially troublesome when user changes requested resources for a pod +because they would need to also change the autoscaler utilization threshold. +Therefore, we decided to choose relative metric. +For user, it is enough to set it to a value smaller than 100%, and further changes of requested resources will not invalidate it. + +## Support in kubectl + +To make manipulation of HorizontalPodAutoscaler object simpler, we added support for +creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. +In addition, in future, we are planning to add kubectl support for the following use-cases: +* When creating a replication controller or deployment with `kubectl create [-f]`, there should be + a possibility to specify an additional autoscaler object. + (This should work out-of-the-box when creation of autoscaler is supported by kubectl as we may include + multiple objects in the same config file). +* *[future]* When running an image with `kubectl run`, there should be an additional option to create + an autoscaler for it. +* *[future]* We will add a new command `kubectl autoscale` that will allow for easy creation of an autoscaler object + for already existing replication controller/deployment. + +## Next steps + +We list here some features that are not supported in Kubernetes version 1.1. +However, we want to keep them in mind, as they will most probably be needed in future. +Our design is in general compatible with them. +* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. memory, network traffic, qps). + This includes scaling based on a custom/application metric. +* *[future]* **Autoscale pods base on an aggregate metric.** + Autoscaler, instead of computing average for a target metric across pods, will use a single, external, metric (e.g. qps metric from load balancer). + The metric will be aggregated while the target will remain per-pod + (e.g. when observing 100 qps on load balancer while the target is 20 qps per pod, autoscaler will set the number of replicas to 5). +* *[future]* **Autoscale pods based on multiple metrics.** + If the target numbers of pods for different metrics are different, choose the largest target number of pods. +* *[future]* **Scale the number of pods starting from 0.** + All pods can be turned-off, and then turned-on when there is a demand for them. + When a request to service with no pods arrives, kube-proxy will generate an event for autoscaler + to create a new pod. + Discussed in [#3247](https://github.com/kubernetes/kubernetes/issues/3247). +* *[future]* **When scaling down, make more educated decision which pods to kill.** + E.g.: if two or more pods from the same replication controller are on the same node, kill one of them. + Discussed in [#4301](https://github.com/kubernetes/kubernetes/issues/4301). + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() + -- cgit v1.2.3 From c19a1f90d8c5c47210af5c8e8577c4950999a078 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Fri, 9 Oct 2015 15:29:17 -0700 Subject: Updates to versioning.md --- versioning.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/versioning.md b/versioning.md index c764a585..75cdffce 100644 --- a/versioning.md +++ b/versioning.md @@ -35,23 +35,26 @@ Documentation for other releases can be found at Legend: -* **Kube <major>.<minor>.<patch>** refers to the version of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. +* **Kube X.Y.Z** refers to the version of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the major version, **Y** is the minor version, and **Z** is the patch version.) * **API vX[betaY]** refers to the version of the HTTP API. -## Release Timeline +## Release versioning ### Minor version scheme and timeline -* Kube 1.0.0, 1.0.1 -- DONE! -* Kube 1.0.X (X>1): Standard operating procedure. We patch the release-1.0 branch as needed and increment the patch number. -* Kube 1.1.0-alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule. -* Kube 1.1.0-beta: When HEAD is feature-complete, we will cut the release-1.1.0 branch 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. This cut will be marked as 1.1.0-beta, and HEAD will be revved to 1.2.0-alpha.0. -* Kube 1.1.0: Final release, cut from the release-1.1.0 branch cut two weeks prior. Should occur between 3 and 4 months after 1.0. 1.1.1-beta will be tagged at the same time on the same branch. +* Kube X.Y.0-alpha.W, W > 0: Alpha releases are released roughly every two weeks directly from the master branch. No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. +* Kube X.Y.Z-beta: When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. This cut will be marked as X.Y.0-beta, and master will be revved to X.Y+1.0-alpha.0. +* Kube X.Y.0: Final release, cut from the release-X.Y branch cut two weeks prior. X.Y.1-beta will be tagged at the same commit on the same branch. X.Y.0 occur 3 to 4 months after X.Y-1.0. +* Kube X.Y.Z, Z > 0: [Patch releases](#patches) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta is tagged on the same commit. ### Major version timeline There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date. +### CI version scheme + +* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.C+bbbb is C commits after X.Y.Z-beta, with an additional +bbbb build suffix added. + ## Release versions as related to API versions Here is an example major release cycle: @@ -64,11 +67,11 @@ Here is an example major release cycle: * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades. * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis. -## Rationale for API v2 being complete before v2.0's release +### Rationale for API v2 being complete before v2.0's release It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options. -# Patches +## Patches Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes. @@ -76,7 +79,7 @@ They should not contain miscellaneous feature additions or improvements, and esp Dependencies, such as Docker or Etcd, should also not be changed unless absolutely necessary, and also just to fix critical bugs (so, at most patch version changes, not new major nor minor versions). -# Upgrades +## Upgrades * Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.) * No hard breaking changes over version boundaries. -- cgit v1.2.3 From 77d47119045fbaa5128ecddab9150341599c0049 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Thu, 29 Oct 2015 15:10:00 -0700 Subject: Fixups of docs and scripts --- versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/versioning.md b/versioning.md index 75cdffce..a189d0cf 100644 --- a/versioning.md +++ b/versioning.md @@ -53,7 +53,7 @@ There is no mandated timeline for major versions. They only occur when we need t ### CI version scheme -* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.C+bbbb is C commits after X.Y.Z-beta, with an additional +bbbb build suffix added. +* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.C+bbbb is C commits after X.Y.Z-beta, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (with things in the tree that are not checked it,) it will be appended with -dirty. ## Release versions as related to API versions -- cgit v1.2.3 From de53b2aec614a1f98bea17159be3670ac969b6e0 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Mon, 2 Nov 2015 14:54:11 -0800 Subject: Versioned beta releases --- versioning.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/versioning.md b/versioning.md index a189d0cf..7b63059c 100644 --- a/versioning.md +++ b/versioning.md @@ -43,9 +43,9 @@ Legend: ### Minor version scheme and timeline * Kube X.Y.0-alpha.W, W > 0: Alpha releases are released roughly every two weeks directly from the master branch. No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. -* Kube X.Y.Z-beta: When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. This cut will be marked as X.Y.0-beta, and master will be revved to X.Y+1.0-alpha.0. -* Kube X.Y.0: Final release, cut from the release-X.Y branch cut two weeks prior. X.Y.1-beta will be tagged at the same commit on the same branch. X.Y.0 occur 3 to 4 months after X.Y-1.0. -* Kube X.Y.Z, Z > 0: [Patch releases](#patches) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta is tagged on the same commit. +* Kube X.Y.Z-beta.W: When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary. +* Kube X.Y.0: Final release, cut from the release-X.Y branch cut two weeks prior. X.Y.1-beta.0 will be tagged at the same commit on the same branch. X.Y.0 occur 3 to 4 months after X.Y-1.0. +* Kube X.Y.Z, Z > 0: [Patch releases](#patches) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the same commit. ### Major version timeline @@ -53,7 +53,7 @@ There is no mandated timeline for major versions. They only occur when we need t ### CI version scheme -* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.C+bbbb is C commits after X.Y.Z-beta, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (with things in the tree that are not checked it,) it will be appended with -dirty. +* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (with things in the tree that are not checked it,) it will be appended with -dirty. ## Release versions as related to API versions -- cgit v1.2.3 From 06e6f72355b8e5bf1700e966a322453186c0fae4 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Tue, 3 Nov 2015 09:42:49 -0800 Subject: Clarify -dirty language, and add --no-dry-run to usage --- versioning.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/versioning.md b/versioning.md index 7b63059c..341e3b36 100644 --- a/versioning.md +++ b/versioning.md @@ -51,9 +51,9 @@ Legend: There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date. -### CI version scheme +### CI and dev version scheme -* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (with things in the tree that are not checked it,) it will be appended with -dirty. +* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (during development, with things in the tree that are not checked it,) it will be appended with -dirty. ## Release versions as related to API versions -- cgit v1.2.3 From 2dce50054d005a77e8007b348f0db065b99217bd Mon Sep 17 00:00:00 2001 From: derekwaynecarr Date: Tue, 3 Nov 2015 11:38:52 -0500 Subject: Add event correlation to client --- event_compression.md | 35 +++++++++++++++++++++++++++++++---- 1 file changed, 31 insertions(+), 4 deletions(-) diff --git a/event_compression.md b/event_compression.md index b9861717..e1a95165 100644 --- a/event_compression.md +++ b/event_compression.md @@ -35,14 +35,23 @@ Documentation for other releases can be found at This document captures the design of event compression. - ## Background -Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)). +Kubernetes components can get into a state where they generate tons of events. + +The events can be categorized in one of two ways: + +1. same - the event is identical to previous events except it varies only on timestamp +2. similar - the event is identical to previous events except it varies on timestamp and message + +For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)). + +The goal is introduce event counting to increment same events, and event aggregation to collapse similar events. ## Proposal -Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. +Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. In addition, if many similar events are +created, events should be aggregated into a single event to reduce spam. Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries. @@ -61,6 +70,24 @@ Instead of a single Timestamp, each event object [contains](http://releases.k8s. Each binary that generates events: * Maintains a historical record of previously generated events: * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). + * Implemented behind an `EventCorrelator` that manages two subcomponents: `EventAggregator` and `EventLogger` + * The `EventCorrelator` observes all incoming events and lets each subcomponent visit and modify the event in turn. + * The `EventAggregator` runs an aggregation function over each event. This function buckets each event based on an `aggregateKey`, + and identifies the event uniquely with a `localKey` in that bucket. + * The default aggregation function groups similar events that differ only by `event.Message`. It's `localKey` is `event.Message` and its aggregate key is produced by joining: + * `event.Source.Component` + * `event.Source.Host` + * `event.InvolvedObject.Kind` + * `event.InvolvedObject.Namespace` + * `event.InvolvedObject.Name` + * `event.InvolvedObject.UID` + * `event.InvolvedObject.APIVersion` + * `event.Reason` + * If the `EventAggregator` observes a similar event produced 10 times in a 10 minute window, it drops the event that was provided as + input and creates a new event that differs only on the message. The message denotes that this event is used to group similar events + that matched on reason. This aggregated `Event` is then used in the event processing sequence. + * The `EventLogger` observes the event out of `EventAggregation` and tracks the number of times it has observed that event previously + by incrementing a key in a cache associated with that matching event. * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: * `event.Source.Component` * `event.Source.Host` @@ -71,7 +98,7 @@ Each binary that generates events: * `event.InvolvedObject.APIVersion` * `event.Reason` * `event.Message` - * The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. + * The LRU cache is capped at 4096 events for both `EventAggregator` and `EventLogger`. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. * When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/unversioned/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. -- cgit v1.2.3 From f10d80bd8b208da1c5470177e0d843fe1d0de830 Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Tue, 3 Nov 2015 10:17:57 -0800 Subject: Run update-gendocs --- README.md | 4 ++-- access.md | 4 ++-- admission_control.md | 4 ++-- admission_control_limit_range.md | 4 ++-- admission_control_resource_quota.md | 4 ++-- architecture.md | 4 ++-- aws_under_the_hood.md | 4 ++-- clustering.md | 4 ++-- clustering/README.md | 4 ++-- command_execution_port_forwarding.md | 4 ++-- daemon.md | 4 ++-- event_compression.md | 4 ++-- expansion.md | 4 ++-- extending-api.md | 4 ++-- horizontal-pod-autoscaler.md | 4 ++-- identifiers.md | 4 ++-- namespaces.md | 4 ++-- networking.md | 4 ++-- persistent-storage.md | 4 ++-- principles.md | 4 ++-- resources.md | 4 ++-- secrets.md | 4 ++-- security.md | 4 ++-- security_context.md | 4 ++-- service_accounts.md | 4 ++-- simple-rolling-update.md | 4 ++-- versioning.md | 4 ++-- 27 files changed, 54 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index 72d2c662..ef5a1157 100644 --- a/README.md +++ b/README.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/README.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/access.md b/access.md index 123516f9..10a0c9fe 100644 --- a/access.md +++ b/access.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/access.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/access.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control.md b/admission_control.md index a2b5700b..e9303728 100644 --- a/admission_control.md +++ b/admission_control.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/admission_control.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/admission_control.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index e7c706ef..d13a98f1 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/admission_control_limit_range.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/admission_control_limit_range.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index a9de7a9c..31d4a147 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/admission_control_resource_quota.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/architecture.md b/architecture.md index 2a761dea..3bb24e44 100644 --- a/architecture.md +++ b/architecture.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/architecture.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/architecture.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index ec8a31c2..9fe46d6f 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/aws_under_the_hood.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/aws_under_the_hood.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering.md b/clustering.md index 757c1f0b..66bd0784 100644 --- a/clustering.md +++ b/clustering.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/clustering.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/clustering.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering/README.md b/clustering/README.md index d02b7d50..073deb05 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/clustering/README.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/clustering/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 852e761e..dbd7b0eb 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/command_execution_port_forwarding.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/command_execution_port_forwarding.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/daemon.md b/daemon.md index a72b8755..6e783d8f 100644 --- a/daemon.md +++ b/daemon.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/daemon.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/daemon.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/event_compression.md b/event_compression.md index e1a95165..c7982712 100644 --- a/event_compression.md +++ b/event_compression.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/event_compression.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/event_compression.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/expansion.md b/expansion.md index b19731b9..770ec054 100644 --- a/expansion.md +++ b/expansion.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/expansion.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/expansion.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/extending-api.md b/extending-api.md index 077d5530..303ebeac 100644 --- a/extending-api.md +++ b/extending-api.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/extending-api.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/extending-api.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 35991847..42cd27bb 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/horizontal-pod-autoscaler.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/horizontal-pod-autoscaler.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/identifiers.md b/identifiers.md index 7deff9e9..04ee4ab1 100644 --- a/identifiers.md +++ b/identifiers.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/identifiers.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/identifiers.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/namespaces.md b/namespaces.md index bb907c67..b5965348 100644 --- a/namespaces.md +++ b/namespaces.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/namespaces.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/namespaces.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/networking.md b/networking.md index dfe0f93e..56009d5b 100644 --- a/networking.md +++ b/networking.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/networking.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/networking.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/persistent-storage.md b/persistent-storage.md index bb200811..a95ba305 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/persistent-storage.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/persistent-storage.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/principles.md b/principles.md index be3dff55..20343ac4 100644 --- a/principles.md +++ b/principles.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/principles.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/principles.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resources.md b/resources.md index f9bbc8db..9b6ac51b 100644 --- a/resources.md +++ b/resources.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/resources.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/resources.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/secrets.md b/secrets.md index e8a5e42f..763c5567 100644 --- a/secrets.md +++ b/secrets.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/secrets.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/secrets.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security.md b/security.md index 5c187d69..e845c925 100644 --- a/security.md +++ b/security.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/security.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/security.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security_context.md b/security_context.md index 434d275e..413e2a2e 100644 --- a/security_context.md +++ b/security_context.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/security_context.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/security_context.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/service_accounts.md b/service_accounts.md index 8e63e045..fb065d1a 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/service_accounts.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/service_accounts.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 720f4cbf..31f31d67 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/simple-rolling-update.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/simple-rolling-update.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/versioning.md b/versioning.md index 341e3b36..def20a03 100644 --- a/versioning.md +++ b/versioning.md @@ -19,8 +19,8 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. -The latest 1.0.x release of this document can be found -[here](http://releases.k8s.io/release-1.0/docs/design/versioning.md). +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/versioning.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- cgit v1.2.3 From 442bba37e2d93ae560adffe6c20b626e5916751c Mon Sep 17 00:00:00 2001 From: Brendan Burns Date: Thu, 19 Nov 2015 13:35:24 -0800 Subject: Update daemon.md --- daemon.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/daemon.md b/daemon.md index 6e783d8f..29f7e913 100644 --- a/daemon.md +++ b/daemon.md @@ -71,7 +71,7 @@ The DaemonSet supports standard API features: - YAML example: ```YAML - apiVersion: v1 + apiVersion: extensions/v1beta1 kind: DaemonSet metadata: labels: -- cgit v1.2.3 From 53e1173488dc198aad3424fc7526452dd71f8644 Mon Sep 17 00:00:00 2001 From: Brad Erickson Date: Mon, 23 Nov 2015 19:03:44 -0800 Subject: Minion->Node rename: NODE_IP_BASE, NODE_IP_RANGES, NODE_IP_RANGE, etc NODE_IPS NODE_IP NODE_MEMORY_MB --- networking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/networking.md b/networking.md index 56009d5b..3259a83a 100644 --- a/networking.md +++ b/networking.md @@ -134,7 +134,7 @@ Example of GCE's advanced routing rules: ```sh gcloud compute routes add "${MINION_NAMES[$i]}" \ --project "${PROJECT}" \ - --destination-range "${MINION_IP_RANGES[$i]}" \ + --destination-range "${NODE_IP_RANGES[$i]}" \ --network "${NETWORK}" \ --next-hop-instance "${MINION_NAMES[$i]}" \ --next-hop-instance-zone "${ZONE}" & -- cgit v1.2.3 From 718787711fc99207d148873711743279af124215 Mon Sep 17 00:00:00 2001 From: Brad Erickson Date: Mon, 23 Nov 2015 19:04:40 -0800 Subject: Minion->Node rename: NODE_NAMES, NODE_NAME, NODE_PORT --- networking.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/networking.md b/networking.md index 3259a83a..b110ca75 100644 --- a/networking.md +++ b/networking.md @@ -132,11 +132,11 @@ differentiate it from `docker0`) is set up outside of Docker proper. Example of GCE's advanced routing rules: ```sh -gcloud compute routes add "${MINION_NAMES[$i]}" \ +gcloud compute routes add "${NODE_NAMES[$i]}" \ --project "${PROJECT}" \ --destination-range "${NODE_IP_RANGES[$i]}" \ --network "${NETWORK}" \ - --next-hop-instance "${MINION_NAMES[$i]}" \ + --next-hop-instance "${NODE_NAMES[$i]}" \ --next-hop-instance-zone "${ZONE}" & ``` -- cgit v1.2.3 From 2a9c9d4c4984dde0acebbef17383e26a20be1312 Mon Sep 17 00:00:00 2001 From: Brad Erickson Date: Mon, 23 Nov 2015 19:06:36 -0800 Subject: Minion->Node rename: NUM_NODES --- aws_under_the_hood.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 9fe46d6f..a55c09e3 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -250,7 +250,7 @@ cross-AZ-clusters are more convenient. * For auto-scaling, on each nodes it creates a launch configuration and group. The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default name is kubernetes-minion-group. The auto-scaling group has a min and max size - that are both set to NUM_MINIONS. You can change the size of the auto-scaling + that are both set to NUM_NODES. You can change the size of the auto-scaling group to add or remove the total number of nodes from within the AWS API or Console. Each nodes self-configures, meaning that they come up; run Salt with the stored configuration; connect to the master; are assigned an internal CIDR; -- cgit v1.2.3 From 7c869b4d00b438bccece82d38ba3f13570ee8877 Mon Sep 17 00:00:00 2001 From: Ravi Gadde Date: Thu, 3 Sep 2015 23:50:14 -0700 Subject: Scheduler extension --- scheduler_extender.md | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 scheduler_extender.md diff --git a/scheduler_extender.md b/scheduler_extender.md new file mode 100644 index 00000000..0c10de59 --- /dev/null +++ b/scheduler_extender.md @@ -0,0 +1,117 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/scheduler_extender.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Scheduler extender + +There are three ways to add new scheduling rules (predicates and priority functions) to Kubernetes: (1) by adding these rules to the scheduler and recompiling (described here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), (2) implementing your own scheduler process that runs instead of, or alongside of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" process that the standard Kubernetes scheduler calls out to as a final pass when making scheduling decisions. + +This document describes the third approach. This approach is needed for use cases where scheduling decisions need to be made on resources not directly managed by the standard Kubernetes scheduler. The extender helps make scheduling decisions based on such resources. (Note that the three approaches are not mutually exclusive.) + +When scheduling a pod, the extender allows an external process to filter and prioritize nodes. Two separate http/https calls are issued to the extender, one for "filter" and one for "prioritize" actions. To use the extender, you must create a scheduler policy configuration file. The configuration specifies how to reach the extender, whether to use http or https and the timeout. + +```go +// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty, +// it is assumed that the extender chose not to provide that extension. +type ExtenderConfig struct { + // URLPrefix at which the extender is available + URLPrefix string `json:"urlPrefix"` + // Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. + FilterVerb string `json:"filterVerb,omitempty"` + // Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. + PrioritizeVerb string `json:"prioritizeVerb,omitempty"` + // The numeric multiplier for the node scores that the prioritize call generates. + // The weight should be a positive integer + Weight int `json:"weight,omitempty"` + // EnableHttps specifies whether https should be used to communicate with the extender + EnableHttps bool `json:"enableHttps,omitempty"` + // TLSConfig specifies the transport layer security config + TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"` + // HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize + // timeout is ignored, k8s/other extenders priorities are used to select the node. + HTTPTimeout time.Duration `json:"httpTimeout,omitempty"` +} +``` + +A sample scheduler policy file with extender configuration: + +```json +{ + "predicates": [ + { + "name": "HostName" + }, + { + "name": "MatchNodeSelector" + }, + { + "name": "PodFitsResources" + } + ], + "priorities": [ + { + "name": "LeastRequestedPriority", + "weight": 1 + } + ], + "extenders": [ + { + "urlPrefix": "http://127.0.0.1:12345/api/scheduler", + "filterVerb": "filter", + "enableHttps": false + } + ] +} +``` + +Arguments passed to the FilterVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and the pod. Arguments passed to the PrioritizeVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and extender predicates and the pod. + +```go +// ExtenderArgs represents the arguments needed by the extender to filter/prioritize +// nodes for a pod. +type ExtenderArgs struct { + // Pod being scheduled + Pod api.Pod `json:"pod"` + // List of candidate nodes where the pod can be scheduled + Nodes api.NodeList `json:"nodes"` +} +``` + +The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList). + +The "filter" call may prune the set of nodes based on its predicates. Scores returned by the "prioritize" call are added to the k8s scores (computed through its priority functions) and used for final host selection. + +Multiple extenders can be configured in the scheduler policy. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() + -- cgit v1.2.3 From 0f4b7ce1b071c0eb3e18a63067e953f3f896cd86 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Wed, 25 Nov 2015 16:22:11 -0800 Subject: Add "supported releases" language to versioning.md --- versioning.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/versioning.md b/versioning.md index def20a03..399752d8 100644 --- a/versioning.md +++ b/versioning.md @@ -45,7 +45,7 @@ Legend: * Kube X.Y.0-alpha.W, W > 0: Alpha releases are released roughly every two weeks directly from the master branch. No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. * Kube X.Y.Z-beta.W: When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary. * Kube X.Y.0: Final release, cut from the release-X.Y branch cut two weeks prior. X.Y.1-beta.0 will be tagged at the same commit on the same branch. X.Y.0 occur 3 to 4 months after X.Y-1.0. -* Kube X.Y.Z, Z > 0: [Patch releases](#patches) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the same commit. +* Kube X.Y.Z, Z > 0: [Patch releases](#patch-releases) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the same commit. ### Major version timeline @@ -55,7 +55,19 @@ There is no mandated timeline for major versions. They only occur when we need t * Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (during development, with things in the tree that are not checked it,) it will be appended with -dirty. -## Release versions as related to API versions +### Supported releases + +We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade. + +We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as possible. Furthermore, we expect to "support" three minor releases at a time. With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months. For example, when v1.3 comes out, v1.0 will no longer be considered "fit for use": basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". + +This does *not* mean that we expect to introduce breaking changes between v1.0 and v1.3, but it does mean that we probably won't have reasonable confidence in clusters where some components are running at v1.0 and others running at v1.3. + +This policy is in line with [GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). + +## API versioning + +### Release versions as related to API versions Here is an example major release cycle: @@ -67,11 +79,11 @@ Here is an example major release cycle: * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades. * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis. -### Rationale for API v2 being complete before v2.0's release +#### Rationale for API v2 being complete before v2.0's release It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options. -## Patches +## Patch releases Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes. @@ -82,8 +94,9 @@ Dependencies, such as Docker or Etcd, should also not be changed unless absolute ## Upgrades * Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.) + * However, we do not recommend upgrading more than two minor releases at a time (see [Supported releases](#supported-releases)), and do not recommend running non-latest patch releases of a given minor release. * No hard breaking changes over version boundaries. - * For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x. + * For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x. There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. -- cgit v1.2.3 From 5a8872ff13c3b70dea5458c481b39e2fee2bf489 Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Tue, 1 Dec 2015 14:07:23 -0800 Subject: Clarify what is meant by 'support' --- versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/versioning.md b/versioning.md index 399752d8..ab7d7ecb 100644 --- a/versioning.md +++ b/versioning.md @@ -59,7 +59,7 @@ There is no mandated timeline for major versions. They only occur when we need t We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade. -We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as possible. Furthermore, we expect to "support" three minor releases at a time. With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months. For example, when v1.3 comes out, v1.0 will no longer be considered "fit for use": basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". +We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as possible. Furthermore, we expect to "support" three minor releases at a time. "Support" means we expect users to be running that version in production, though we may not port fixes back before the latest minor version. For example, when v1.3 comes out, v1.0 will no longer be supported: basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months. This does *not* mean that we expect to introduce breaking changes between v1.0 and v1.3, but it does mean that we probably won't have reasonable confidence in clusters where some components are running at v1.0 and others running at v1.3. -- cgit v1.2.3 From c821f1f430b0525b76e27ab346b87fcb323b2455 Mon Sep 17 00:00:00 2001 From: Alex Robinson Date: Tue, 1 Dec 2015 22:24:58 -0800 Subject: Typo fixes in docs --- extending-api.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/extending-api.md b/extending-api.md index 303ebeac..1f76235f 100644 --- a/extending-api.md +++ b/extending-api.md @@ -41,7 +41,7 @@ This document describes the design for implementing the storage of custom API ty ### The ThirdPartyResource The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add -to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource, attempting to place it in a resource +to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource; attempting to place it in a namespace will return an error. Each `ThirdPartyResource` resource has the following: @@ -63,18 +63,18 @@ only specifies: Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata). This requirement enables the Kubernetes API server to provide the following features: - * Filtering lists of objects via LabelQueries + * Filtering lists of objects via label queries * `resourceVersion`-based optimistic concurrency via compare-and-swap * Versioned storage * Event recording - * Integration with basic `kubectl` command line tooling. - * Watch for resource changes. + * Integration with basic `kubectl` command line tooling + * Watch for resource changes The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be programmatically convertible to the name of the resource using -the following conversion. Kinds are expected to be of the form ``, the +the following conversion. Kinds are expected to be of the form ``, and the `APIVersion` for the object is expected to be `/`. To -prevent collisions, it's expected that you'll use a fulling qualified domain +prevent collisions, it's expected that you'll use a fully qualified domain name for the API group, e.g. `example.com`. For example `stable.example.com/v1` @@ -106,8 +106,8 @@ This is also the reason why `ThirdPartyResource` is not namespaced. ## Usage When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced -RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects -deleting a namespace, deletes all third party resources in that namespace. +RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects, +deleting a namespace deletes all third party resources in that namespace. For example, if a user creates: @@ -136,7 +136,7 @@ Now that this schema has been created, a user can `POST`: "apiVersion": "stable.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", - "image": "my-awesome-chron-image" + "image": "my-awesome-cron-image" } ``` @@ -171,14 +171,14 @@ and get back: "apiVersion": "stable.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", - "image": "my-awesome-chron-image" + "image": "my-awesome-cron-image" } ] } ``` Because all objects are expected to contain standard Kubernetes metadata fields, these -list operations can also use `Label` queries to filter requests down to specific subsets. +list operations can also use label queries to filter requests down to specific subsets. Likewise, clients can use watch endpoints to watch for changes to stored objects. @@ -196,10 +196,10 @@ Each custom object stored by the API server needs a custom key in storage, this #### Definitions - * `resource-namespace` : the namespace of the particular resource that is being stored + * `resource-namespace`: the namespace of the particular resource that is being stored * `resource-name`: the name of the particular resource being stored - * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored. - * `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored. + * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored + * `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored #### Key -- cgit v1.2.3 From 7aa2f920eaa3eeffee040b117ecf3c28e5820337 Mon Sep 17 00:00:00 2001 From: deads2k Date: Fri, 23 Jan 2015 10:37:11 -0500 Subject: enhance pluggable policy --- enhance-pluggable-policy.md | 379 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 379 insertions(+) create mode 100644 enhance-pluggable-policy.md diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md new file mode 100644 index 00000000..6a881250 --- /dev/null +++ b/enhance-pluggable-policy.md @@ -0,0 +1,379 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.1/docs/design/enhance-pluggable-policy.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Enhance Pluggable Policy + +While trying to develop an authorization plugin for Kubernetes, we found a few places where API extensions would ease development and add power. There are a few goals: + 1. Provide an authorization plugin that can evaluate a .Authorize() call based on the full content of the request to RESTStorage. This includes information like the full verb, the content of creates and updates, and the names of resources being acted upon. + 1. Provide a way to ask whether a user is permitted to take an action without running in process with the API Authorizer. For instance, a proxy for exec calls could ask whether a user can run the exec they are requesting. + 1. Provide a way to ask who can perform a given action on a given resource. This is useful for answering questions like, "who can create replication controllers in my namespace". + +This proposal adds to and extends the existing API to so that authorizers may provide the functionality described above. It does not attempt to describe how the policies themselves can be expressed, that is up the authorization plugins themselves. + + +## Enhancements to existing Authorization interfaces + +The existing Authorization interfaces are described here: [docs/admin/authorization.md](../admin/authorization.md). A couple additions will allow the development of an Authorizer that matches based on different rules than the existing implementation. + +### Request Attributes + +The existing authorizer.Attributes only has 5 attributes (user, groups, isReadOnly, kind, and namespace). If we add more detailed verbs, content, and resource names, then Authorizer plugins will have the same level of information available to RESTStorage components in order to express more detailed policy. The replacement excerpt is below. + +An API request has the following attributes that can be considered for authorization: + - user - the user-string which a user was authenticated as. This is included in the Context. + - groups - the groups to which the user belongs. This is included in the Context. + - verb - string describing the requesting action. Today we have: get, list, watch, create, update, and delete. The old `readOnly` behavior is equivalent to allowing get, list, watch. + - namespace - the namespace of the object being access, or the empty string if the endpoint does not support namespaced objects. This is included in the Context. + - resourceGroup - the API group of the resource being accessed + - resourceVersion - the API version of the resource being accessed + - resource - which resource is being accessed + - applies only to the API endpoints, such as + `/api/v1beta1/pods`. For miscelaneous endpoints, like `/version`, the kind is the empty string. + - resourceName - the name of the resource during a get, update, or delete action. + - subresource - which subresource is being accessed + +A non-API request has 2 attributes: + - verb - the HTTP verb of the request + - path - the path of the URL being requested + + +### Authorizer Interface + +The existing Authorizer interface is very simple, but there isn't a way to provide details about allows, denies, or failures. The extended detail is useful for UIs that want to describe why certain actions are allowed or disallowed. Not all Authorizers will want to provide that information, but for those that do, having that capability is useful. In addition, adding a `GetAllowedSubjects` method that returns back the users and groups that can perform a particular action makes it possible to answer questions like, "who can see resources in my namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). + +```go +// OLD +type Authorizer interface { + Authorize(a Attributes) error +} +``` + +```go +// NEW +// Authorizer provides the ability to determine if a particular user can perform a particular action +type Authorizer interface { + // Authorize takes a Context (for namespace, user, and traceability) and Attributes to make a policy determination. + // reason is an optional return value that can describe why a policy decision was made. Reasons are useful during + // debugging when trying to figure out why a user or group has access to perform a particular action. + Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error) +} + +// AuthorizerIntrospection is an optional interface that provides the ability to determine which users and groups can perform a particular action. +// This is useful for building caches of who can see what: for instance, "which namespaces can this user see". That would allow +// someone to see only the namespaces they are allowed to view instead of having to choose between listing them all or listing none. +type AuthorizerIntrospection interface { + // GetAllowedSubjects takes a Context (for namespace and traceability) and Attributes to determine which users and + // groups are allowed to perform the described action in the namespace. This API enables the ResourceBasedReview requests below + GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error) +} +``` + +### SubjectAccessReviews + +This set of APIs answers the question: can a user or group (use authenticated user if none is specified) perform a given action. Given the Authorizer interface (proposed or existing), this endpoint can be implemented generically against any Authorizer by creating the correct Attributes and making an .Authorize() call. + +There are three different flavors: + +1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this checks to see if a specified user or group can perform a given action at the cluster scope or across all namespaces. +This is a highly privileged operation. It allows a cluster-admin to inspect rights of any person across the entire cluster and against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - this checks to see if the current user (including his groups) can perform a given action at any specified scope. +This is an unprivileged operation. It doesn't expose any information that a user couldn't discover simply by trying an endpoint themselves. +3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - this checks to see if a specified user or group can perform a given action in **this** namespace. +This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin. +This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect whether a given user or group +can manipulate resources in his namespace. + + +SubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. + +``` +// input +{ + "kind": "SubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "user": "Clark", + "groups": ["admins", "managers"] + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json +// or +accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject) + +// output +{ + "kind": "SubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "allowed": true +} + +PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. +``` + +// input +{ + "kind": "PersonalSubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "namespace": "any-ns", + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json +// or +accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject) + +// output +{ + "kind": "PersonalSubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "allowed": true +} + +LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. + +``` +// input +{ + "kind": "LocalSubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "my-ns" + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "user": "Clark", + "groups": ["admins", "managers"] + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json +// or +accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject) + +// output +{ + "kind": "LocalSubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "my-ns" + "allowed": true +} + + +``` + +The actual Go objects look like this: + +```go +type AuthorizationAttributes struct { + // Namespace is the namespace of the action being requested. Currently, there is no distinction between no namespace and all namespaces + Namespace string `json:"namespace" description:"namespace of the action being requested"` + // Verb is one of: get, list, watch, create, update, delete + Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"` + // Resource is one of the existing resource types + ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"` + // ResourceVersion is the version of resource + ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"` + // Resource is one of the existing resource types + Resource string `json:"resource" description:"one of the existing resource types"` + // ResourceName is the name of the resource being requested for a "get" or deleted for a "delete" + ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"` + // Subresource is one of the existing subresources types + Subresource string `json:"subresource" description:"one of the existing subresources"` +} + +// SubjectAccessReview is an object for requesting information about whether a user or group can perform an action +type SubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` + // User is optional, but at least one of User or Groups must be specified + User string `json:"user" description:"optional, user to check"` + // Groups is optional, but at least one of User or Groups must be specified + Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` +} + +// SubjectAccessReviewResponse describes whether or not a user or group can perform an action +type SubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} + +// PersonalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action +type PersonalSubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// PersonalSubjectAccessReviewResponse describes whether this user can perform an action +type PersonalSubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} + +// LocalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action +type LocalSubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` + // User is optional, but at least one of User or Groups must be specified + User string `json:"user" description:"optional, user to check"` + // Groups is optional, but at least one of User or Groups must be specified + Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` +} + +// LocalSubjectAccessReviewResponse describes whether or not a user or group can perform an action +type LocalSubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} +``` + + +### ResourceAccessReview + +This set of APIs nswers the question: which users and groups can perform the specified verb on the specified resourceKind. Given the Authorizer interface described above, this endpoint can be implemented generically against any Authorizer by calling the .GetAllowedSubjects() function. + +There are two different flavors: + +1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this checks to see which users and groups can perform a given action at the cluster scope or across all namespaces. +This is a highly privileged operation. It allows a cluster-admin to inspect rights of all subjects across the entire cluster and against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - this checks to see which users and groups can perform a given action in **this** namespace. +This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin. +This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect which users and groups +can manipulate resources in his namespace. + +ResourceAccessReview is a runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets a ResourceAccessReviewResponse back. Here is an example of a call and its corresponding return. + +``` +// input +{ + "kind": "ResourceAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "list", + "resource": "replicationcontrollers" + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json +// or +accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject) + +// output +{ + "kind": "ResourceAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "default" + "users": ["Clark", "Hubert"], + "groups": ["cluster-admins"] +} +``` + +The actual Go objects look like this: + +```go +// ResourceAccessReview is a means to request a list of which users and groups are authorized to perform the +// action specified by spec +type ResourceAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// ResourceAccessReviewResponse describes who can perform the action +type ResourceAccessReviewResponse struct { + kapi.TypeMeta + + // Users is the list of users who can perform the action + Users []string + // Groups is the list of groups who can perform the action + Groups []string +} + +// LocalResourceAccessReview is a means to request a list of which users and groups are authorized to perform the +// action specified in a specific namespace +type LocalResourceAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// LocalResourceAccessReviewResponse describes who can perform the action +type LocalResourceAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Users is the list of users who can perform the action + Users []string + // Groups is the list of groups who can perform the action + Groups []string +} + +``` + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() + -- cgit v1.2.3 From 4634819e313be2270469f45a4c4d4afe53dc6c65 Mon Sep 17 00:00:00 2001 From: Brad Erickson Date: Thu, 3 Dec 2015 15:42:10 -0800 Subject: Minion->Node rename: docs/ machine names only, except gce/aws --- event_compression.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/event_compression.md b/event_compression.md index c7982712..a8c5916b 100644 --- a/event_compression.md +++ b/event_compression.md @@ -119,17 +119,17 @@ Sample kubectl output ```console FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE -Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-2.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal ``` This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries. -- cgit v1.2.3 From bdd5d6654f654d21a18df8c1364f066f0149071b Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Mon, 14 Dec 2015 10:37:38 -0800 Subject: run hack/update-generated-docs.sh --- README.md | 1 + access.md | 1 + admission_control.md | 1 + admission_control_limit_range.md | 1 + admission_control_resource_quota.md | 1 + architecture.md | 1 + aws_under_the_hood.md | 4 ---- clustering.md | 1 + clustering/README.md | 1 + command_execution_port_forwarding.md | 1 + daemon.md | 1 + enhance-pluggable-policy.md | 4 ---- event_compression.md | 1 + expansion.md | 1 + extending-api.md | 1 + horizontal-pod-autoscaler.md | 1 + identifiers.md | 1 + namespaces.md | 1 + networking.md | 1 + persistent-storage.md | 1 + principles.md | 1 + resources.md | 1 + scheduler_extender.md | 4 ---- secrets.md | 1 + security.md | 1 + security_context.md | 1 + service_accounts.md | 1 + simple-rolling-update.md | 1 + versioning.md | 1 + 29 files changed, 26 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index ef5a1157..e7beb90b 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/README.md). diff --git a/access.md b/access.md index 10a0c9fe..fa173392 100644 --- a/access.md +++ b/access.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/access.md). diff --git a/admission_control.md b/admission_control.md index e9303728..37cf5e1f 100644 --- a/admission_control.md +++ b/admission_control.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/admission_control.md). diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index d13a98f1..890ba37d 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/admission_control_limit_range.md). diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 31d4a147..2b01ea7e 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md). diff --git a/architecture.md b/architecture.md index 3bb24e44..93213066 100644 --- a/architecture.md +++ b/architecture.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/architecture.md). diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index a55c09e3..7d895627 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -18,10 +18,6 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/aws_under_the_hood.md). - Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering.md b/clustering.md index 66bd0784..01df7410 100644 --- a/clustering.md +++ b/clustering.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/clustering.md). diff --git a/clustering/README.md b/clustering/README.md index 073deb05..6f3d379c 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/clustering/README.md). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index dbd7b0eb..89ed7665 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/command_execution_port_forwarding.md). diff --git a/daemon.md b/daemon.md index 29f7e913..d8ed8d43 100644 --- a/daemon.md +++ b/daemon.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/daemon.md). diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 6a881250..1ee9bf29 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -18,10 +18,6 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/enhance-pluggable-policy.md). - Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/event_compression.md b/event_compression.md index a8c5916b..c8030559 100644 --- a/event_compression.md +++ b/event_compression.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/event_compression.md). diff --git a/expansion.md b/expansion.md index 770ec054..371f7c86 100644 --- a/expansion.md +++ b/expansion.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/expansion.md). diff --git a/extending-api.md b/extending-api.md index 1f76235f..5f5e6c0a 100644 --- a/extending-api.md +++ b/extending-api.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/extending-api.md). diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 42cd27bb..7c54da06 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/horizontal-pod-autoscaler.md). diff --git a/identifiers.md b/identifiers.md index 04ee4ab1..ca2c95df 100644 --- a/identifiers.md +++ b/identifiers.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/identifiers.md). diff --git a/namespaces.md b/namespaces.md index b5965348..45e07f72 100644 --- a/namespaces.md +++ b/namespaces.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/namespaces.md). diff --git a/networking.md b/networking.md index b110ca75..e5807b50 100644 --- a/networking.md +++ b/networking.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/networking.md). diff --git a/persistent-storage.md b/persistent-storage.md index a95ba305..7aa9bfa9 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/persistent-storage.md). diff --git a/principles.md b/principles.md index 20343ac4..52b839fb 100644 --- a/principles.md +++ b/principles.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/principles.md). diff --git a/resources.md b/resources.md index 9b6ac51b..069ddd6c 100644 --- a/resources.md +++ b/resources.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/resources.md). diff --git a/scheduler_extender.md b/scheduler_extender.md index 0c10de59..3a55139d 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -18,10 +18,6 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/scheduler_extender.md). - Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/secrets.md b/secrets.md index 763c5567..a9941cb3 100644 --- a/secrets.md +++ b/secrets.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/secrets.md). diff --git a/security.md b/security.md index e845c925..db380250 100644 --- a/security.md +++ b/security.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/security.md). diff --git a/security_context.md b/security_context.md index 413e2a2e..8b9b8c12 100644 --- a/security_context.md +++ b/security_context.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/security_context.md). diff --git a/service_accounts.md b/service_accounts.md index fb065d1a..72c3df81 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/service_accounts.md). diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 31f31d67..e34e695c 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/simple-rolling-update.md). diff --git a/versioning.md b/versioning.md index ab7d7ecb..99caa6e6 100644 --- a/versioning.md +++ b/versioning.md @@ -18,6 +18,7 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + The latest release of this document can be found [here](http://releases.k8s.io/release-1.1/docs/design/versioning.md). -- cgit v1.2.3 From f04f12d31546c069df69a9f706fef41542e51a6e Mon Sep 17 00:00:00 2001 From: Ed Costello Date: Thu, 29 Oct 2015 14:36:29 -0400 Subject: Copy edits for typos --- aws_under_the_hood.md | 6 +++--- daemon.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index a55c09e3..d7feb8fc 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -95,7 +95,7 @@ you with sufficient instance storage for your needs. Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track its state. Similar to nodes, containers are mostly run against instance -storage, except that we repoint some important data onto the peristent volume. +storage, except that we repoint some important data onto the persistent volume. The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem. btrfs @@ -176,7 +176,7 @@ a distribution file, and then are responsible for attaching and detaching EBS volumes from itself. The node policy is relatively minimal. The master policy is probably overly -permissive. The security concious may want to lock-down the IAM policies +permissive. The security conscious may want to lock-down the IAM policies further ([#11936](http://issues.k8s.io/11936)). We should make it easier to extend IAM permissions and also ensure that they @@ -275,7 +275,7 @@ Salt, for example). These objects can currently be manually created: * Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. * Set the `VPC_ID` environment variable to reuse an existing VPC. -* Set the `SUBNET_ID` environemnt variable to reuse an existing subnet. +* Set the `SUBNET_ID` environment variable to reuse an existing subnet. * If your route table has a matching `KubernetesCluster` tag, it will be reused. * If your security groups are appropriately named, they will be reused. diff --git a/daemon.md b/daemon.md index 29f7e913..a5ff3215 100644 --- a/daemon.md +++ b/daemon.md @@ -65,7 +65,7 @@ The DaemonSet supports standard API features: - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’. - Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node. - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller. - - The initial implementation will not guarnatee that DaemonSet pods are created on nodes before other pods. + - The initial implementation will not guarantee that DaemonSet pods are created on nodes before other pods. - The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary. - The DaemonSet controller adds an annotation "kubernetes.io/created-by: \" - YAML example: -- cgit v1.2.3 From c8f51acbd8546cc5943da54b3e0af2b3d6b0e407 Mon Sep 17 00:00:00 2001 From: Christophe Augello Date: Thu, 21 Jan 2016 13:19:05 +0100 Subject: rename anchor tl;dr to Abstract --- persistent-storage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/persistent-storage.md b/persistent-storage.md index 7aa9bfa9..5db565c7 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -36,7 +36,7 @@ Documentation for other releases can be found at This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. -### tl;dr +### Abstract Two new API kinds: -- cgit v1.2.3 From 19371b130701cd44089049c4965e7796c0eae92c Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Sun, 6 Dec 2015 13:01:35 -0800 Subject: Design doc for Indexed Jobs. --- indexed-job.md | 895 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 895 insertions(+) create mode 100644 indexed-job.md diff --git a/indexed-job.md b/indexed-job.md new file mode 100644 index 00000000..7b72cc0f --- /dev/null +++ b/indexed-job.md @@ -0,0 +1,895 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Design: Indexed Feature of Job object + + +## Summary + +This design extends kubernetes with user-friendly support for +running embarassingly parallel jobs. + +Here, *parallel* means on multiple nodes, which means multiple pods. +By *embarassingly parallel*, it is meant that the pods +have no dependencies between each other. In particular, neither +ordering between pods nor gang scheduling are supported. + +Users already have two other options for running embarassingly parallel +Jobs (described in the next section), but both have ease-of-use issues. + +Therefore, this document proposes extending the Job resource type to support +a third way to run embarassingly parallel programs, with a focus on +ease of use. + +This new style of Job is called an *indexed job*, because each Pod of the Job +is specialized to work on a particular *index* from a fixed length array of work items. + +## Background + +The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports +the embarassingly parallel use case through *workqueue jobs*. +While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) + are very flexible, they can be difficult to use. +They: (1) typically require running a message queue +or other database service, (2) typically require modifications +to existing binaries and images and (3) subtle race conditions +are easy to overlook. + +Users also have another option for parallel jobs: creating [multiple Job objects +from a template](hdocs/design/indexed-job.md#job-patterns). +For small numbers of Jobs, this is a fine choice. Labels make it easy to view and +delete multiple Job objects at once. But, that approach also has its drawbacks: +(1) for large levels of parallelism (hundreds or thousands of pods) this approach +means that listing all jobs presents too much information, (2) users want a single +source of information about the success or failure of what the user views as a single +logical process. + +Indexed job fills provides a third option with better ease-of-use for common use cases. + +## Requirements + +### User Requirements + +- Users want an easy way to run a Pod to completion *for each* item within a + [work list](#example-use-cases). + +- Users want to run these pods in parallel for speed, but to vary the level of + parallelism as needed, independent of the number of work items. + +- Users want to do this without requiring changes to existing images, +or source-to-image pipelines. + +- Users want a single object that encompasses the lifetime of the parallel + program. Deleting it should delete all dependent objects. It should report + the status of the overall process. Users should be + able to wait for it to complete, and can refer to it from other resource types, such as + [ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). + + +### Example Use Cases + +Here are several examples of *work lists*: lists of command lines that the +user wants to run, each line its own Pod. (Note that in practice, a work +list may not ever be written out in this form, but it exists in the mind of +the Job creator, and it is a useful way to talk about the the intent of the user when discussing alternatives for specifying Indexed Jobs). + +Note that we will not have the user express their requirements in work list +form; it is just a format for presenting use cases. Subsequent discussion +will reference these work lists. + +#### Work List 1 + +Process several files with the same program + +``` +/usr/local/bin/process_file 12342.dat +/usr/local/bin/process_file 97283.dat +/usr/local/bin/process_file 38732.dat +``` + +#### Work List 2 + +Process a matrix (or image, etc) in rectangular blocks + +``` +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 +``` + +#### Work List 3 + +Build a program at several different git commits + +``` +HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH +HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH +HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH +``` + +#### Work List 4 + +Render several frames of a movie. + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1 +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2 +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3 +``` + +#### Work List 5 + +Render several blocks of frames. (Render blocks to avoid Pod startup overhead for every frame) + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100 +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200 +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300 +``` + +## Design Discussion + +### Converting Work Lists into Indexed Jobs. + +Given a work list, like in the [work list examples](#work-list-examples), +the information from the work list needs to get into each Pod of the Job. + +Users will typically not want to create a new image for each job they +run. They will want to use existing images. So, the image is not the place +for the work list. + +A work list can be stored on networked storage, and mounted by pods of the job. +Also, as a shortcut, for small worklists, it can be included in an annotation on the Job object, +which is then exposed as a volume in the pod via the downward API. + +### What Varies Between Pods of a Job + +Pods need to differ in some way to do something different. (They do not +differ in the work-queue style of Job, but that style has ease-of-use issues). + +A general approach would be to allow pods to differ from each other in arbitrary ways. +For example, the Job object could have a list of PodSpecs to run. +However, this is so general that it provides little value. It would: + +- make the Job Spec very verbose, especially for jobs with thousands of work items +- Job becomes such a vague concept that it is hard to explain to users +- in practice, we do not see cases where many pods which differ across many fields of their + specs, and need to run as a group, with no ordering constraints. +- CLIs and UIs need to support more options for creating Job +- it is useful for monitoring and accounting databases want to aggregate data for pods + with the same controller. However, pods with very different Specs may not make sense + to aggregate. +- profiling, debugging, accounting, auditing and monitoring tools cannot assume common + images/files, behaviors, provenance and so on between Pods of a Job. + +Also, variety has another cost. Pods which differ in ways that affect scheduling +(node constraints, resource requirements, labels) prevent the scheduler +from treating them as fungible, which is an important optimization for the scheduler. + +Therefore, we will not allow Pods from the same Job to differ arbitrarily +(anyway, users can use multiple Job objects for that case). We will try to +allow as little as possible to differ between pods of the same Job, while +still allowing users to express common parallel patterns easily. +For users who need to run jobs which differ in other ways, they can create multiple +Jobs, and manage them as a group using labels. + +From the above work lists, we see a need for Pods which differ in their command +lines, and in their environment variables. These work lists do not require the +pods to differ in other ways. + +Experience in a [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) has shown this model to be applicable +to a very broad range of problems, despite this restriction. + +Therefore we to allow pods in the same Job to differ **only** in the following aspects: + +- command line +- environment variables + + +### Composition of existing images + +The docker image that is used in a job may not be maintained by the person +running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD. +If we require people to specify the complete command line to use Indexed Job, +then they will not automatically pick up changes in the default +command or args. + +This needs more thought. + +### Running Ad-Hoc Jobs using kubectl + +A user should be able to easily start an Indexed Job using `kubectl`. +For example to run [work list 1](#work-list-1), a user should be able +to type something simple like: + +``` +kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + -- \ + /usr/local/bin/process_file '$F' +``` + +In the above example: + +- `--restart=OnFailure` implies creating a job instead of replicationController. +- Each pods command line is `/usr/local/bin/process_file $F`. +- `--per-completion-env=` implies the jobs `.spec.completions` is set to the length of the argument array (3 in the example). +- `--per-completion-env=F=` causes env var with `F` to be available in the enviroment when the command line is evaluated. + +How exactly this happens is discussed later in the doc: this is a sketch of the user experience. + +In practice, the list of files might be much longer and stored in a file +on the users local host, like: + +``` +$ cat files-to-process.txt +12342.dat +97283.dat +38732.dat +... +``` + +So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`. + +However, `kubectl` should also support a format like: + `--per-completion-env=F=@files-to-process.txt`. +That allows `kubectl` to parse the file, point out any syntax errors, and would not run up against command line length limits (2MB is common, as low as 4kB is POSIX compliant). + +One case we do not try to handle is where the file of work is stored on a cloud filesystem, and not accessible from the users local host. Then we cannot easily use indexed job, because we do not know the number of completions. The user needs to copy the file locally first or use the Work-Queue style of Job (already supported). + +Another case we do not try to handle is where the input file does not exist yet because this Job is to be run at a future time, or depends on another job. The workflow and scheduled job proposal need to consider this case. For that case, you could use an indexed job which runs a program which shards the input file (map-reduce-style). + +#### Multiple parameters + +The user may also have multiple paramters, like in [work list 2](#work-list-2). +One way is to just list all the command lines already expanded, one per line, in a file, like this: + +``` +$ cat matrix-commandlines.txt +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 +``` + +and run the Job like this: + +``` +kubectl run process-matrix --image=my/matrix \ + --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \ + --restart=OnFailure \ + -- \ + 'eval "$COMMAND_LINE"' +``` + +However, this may have some subtleties with shell escaping. Also, it depends on the user +knowing all the correct arguments to the docker image being used (more on this later). + +Instead, kubectl should support multiple instances of the `--per-completion-env` flag. For example, to implement work list 2, a user could do: + +``` +kubectl run process-matrix --image=my/matrix \ + --per-completion-env=SR="0 16 0 16" \ + --per-completion-env=ER="15 31 15 31" \ + --per-completion-env=SC="0 0 16 16" \ + --per-completion-env=EC="15 15 31 31" \ + --restart=OnFailure \ + -- \ + /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC +``` + +### Composition With Workflows and ScheduledJob + +A user should be able to create a job (Indexed or not) which runs at a specific time(s). +For example: + +``` +$ kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + --runAt=2015-07-21T14:00:00Z + -- \ + /usr/local/bin/process_file '$F' +created "scheduledJob/process-files-37dt3" +``` + +Kubectl should build the same JobSpec, and then put it into a ScheduledJob (#11980) and create that. + +For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a complete workflow from a single command line would be messy, because of the need to specify all the arguments multiple times. + +For that use case, the user could create a workflow message by hand. +Or the user could create a job template, and then make a workflow from the templates, perhaps like this: + +``` +$ kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + --asTemplate \ + -- \ + /usr/local/bin/process_file '$F' +created "jobTemplate/process-files" +$ kubectl run merge-files --image=mymerger \ + --restart=OnFailure \ + --asTemplate \ + -- \ + /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \ +created "jobTemplate/merge-files" +$ kubectl create-workflow process-and-merge \ + --job=jobTemplate/process-files + --job=jobTemplate/merge-files + --dependency=process-files:merge-files +created "workflow/process-and-merge" +``` + +### Completion Indexes + +A JobSpec specifies the number of times a pod needs to complete successfully, +through the `job.Spec.Completions` field. The number of completions +will be equal to the number of work items in the work list. + +Each pod that the job controller creates is intended to complete one work item +from the work list. Since a pod may fail, several pods may, serially, +attempt to complete the same index. Therefore, we call it a +a *completion index* (or just *index*), but not a *pod index*. + +For each completion index, in the range 1 to `.job.Spec.Completions`, +the job controller will create a pod with that index, and keep creating them +on failure, until each index is completed. + +An dense integer index, rather than a sparse string index (e.g. using just +`metadata.generate-name`) makes it easy to use the index to lookup parameters +in, for example, an array in shared storage. + +### Pod Identity and Template Substitution in Job Controller + +The JobSpec contains a single pod template. When the job controller creates a particular +pod, it copies the pod template and modifies it in some way to make that pod distinctive. +Whatever is distinctive about that pod is its *identity*. + +We consider several options. + +#### Index Substitution Only + +The job controller substitutes only the *completion index* of the pod into the +pod template when creating it. The JSON it POSTs differs only in a single +fields. + +We would put the completion index as a stringified integer, into an +annotation of the pod. The user can extract it from the annotation +into an env var via the downward API, or put it in a file via a Downward +API volume, and parse it himself. + + +Once it is an environment variable in the pod (say `$INDEX`), +then one of two things can happen. + +First, the main program can know how to map from an integer index to what it +needs to do. +For example, from Work List 4 above: + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX +``` + +Second, a shell script can be prepended to the original command line which maps the +index to one or more string parameters. For example, to implement Work List 5 above, +you could do: + +``` +/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME +``` + +In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` and exports `$START_FRAME` and `$END_FRAME`. + +The shell could be part of the image, but more usefully, it could be generated by a program and stuffed in an annotation +or a configMap, and from there added to a volume. + +The first approach may require the user +to modify an existing image (see next section) to be able to accept an `$INDEX` env var or argument. +The second approach requires that the image have a shell. We think that together these two options +cover a wide range of use cases (though not all). + +#### Multiple Substitution + +In this option, the JobSpec is extended to include a list of values to substitute, +and which fields to substitute them into. For example, a worklist like this: + +``` +FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds +FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt +FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit +``` + +Can be broken down into a template like this, with three parameters + +``` +; process-fruit -a -b -c +``` + +and a list of parameter tuples, like this: + +``` +("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds") +("FRUIT_COLOR=yellow", "-f banana.txt", "") +("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit") +``` + +The JobSpec can be extended to hold a list of parameter tuples (which +are more easily expressed as a list of lists of individual parameters). +For example: + +``` +apiVersion: extensions/v1beta1 +kind: Job +... +spec: + completions: 3 + ... + template: + ... + perCompletionArgs: + container: 0 + - + - "-f apple.txt" + - "-f banana.txt" + - "-f cherry.txt" + - + - "--remove-seeds" + - "" + - "--remove-pit" + perCompletionEnvVars: + - name: "FRUIT_COLOR" + - "green" + - "yellow" + - "red" +``` + +However, just providing custom env vars, and not arguments, is sufficient +for many use cases: parameter can be put into env vars, and then +substituted on the command line. + +#### Comparison + +The multiple substitution approach: + +- keeps the *per completion parameters* in the JobSpec. +- Drawback: makes the job spec large for job with thousands of completions. (But for very large jobs, the work-queue style or another type of controller, such as map-reduce or spark, may be a better fit.) +- Drawback: is a form of server-side templating, which we want in Kubernetes but have not fully designed + (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). + + +The index-only approach: + +- requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage. +- makes no changes to the JobSpec. +- Drawback: while in separate storage, they could be mutatated, which would have unexpected effects +- Drawback: Logic for using index to lookup paramters needs to be in the Pod. +- Drawback: CLIs and UIs are limited to using the "index" as the identity of a pod + from a job. They cannot easily say, for example `repeated failures on the pod processing banana.txt`. + + +Index-only approach relies on at least one of the following being true: + +1. image containing a shell and certain shell commands (not all images have this) +1. use directly consumes the index from annoations (file or env var) and expands to specific behavior in the main program. + +Also Using the index-only approach from +non-kubectl clients requires that they mimic the script-generation step, +or only use the second style. + +#### Decision + +It is decided to implement the Index-only approach now. Once the server-side +templating design is complete for Kubernetes, and we have feedback from users, +we can consider if Multiple Substitution. + +## Detailed Design + +#### Job Resource Schema Changes + +No changes are made to the JobSpec. + + +The JobStatus is also not changed. +The user can gauge the progress of the job by the `.status.succeeded` count. + + +#### Job Spec Compatilibity + +A job spec written before this change will work exactly the same +as before with the new controller. +The Pods it creates will have the same environment as before. +They will have a new annotation, but pod are expected to tolerate +unfamiliar annotations. + +However, if the job controller version is reverted, to a version before this change, +the jobs whose pod specs depend on the the new annotation will fail. This is +okay for a Beta resource. + +#### Job Controller Changes + +The Job controller will maintain for each Job a data structed which +indicates the status of each completion index. We call this the +*scoreboard* for short. It is an array of length `.spec.completions`. +Elements of the array are `enum` type with possible values including +`complete`, `running`, and `notStarted`. + +The scoreboard is stored in Job Controller +memory for efficiency. In either case, the Status can be reconstructed from +watching pods of the job (such as on a controller manager restart). +The index of the pods can be extracted from the pod annotation. + +When Job controller sees that the number of running pods is less than the desired +parallelism of the job, it finds the first index in the scoreboard with value +`notRunning`. It creates a pod with this creation index. + +When it creates a pod with creation index `i`, it makes a copy +of the `.spec.template`, and sets +`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` +to `i`. It does this in both the index-only and multiple-substitutions options. + +Then it creates the pod. + +When the controller notices that a pod has completed or is running or failed, +it updates the scoreboard. + +When all entries in the scoreboard are `complete`, then the job is complete. + + +#### Downward API Changes + +The downward API is changed to support extracting specific key names +into a single environment variable. So, the following would be supported: + +``` +kind: Pod +version: v1 +spec: + containers: + - name: foo + env: + - name: MY_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations[kubernetes.io/job/completion-index] +``` + +This requires kubelet changes. + +Users who fail to upgrade their kubelets at the same time as they upgrade their controller +manager will see a failure for pods to run when they are created by the controller. +The Kubelet will send an event about failure to create the pod. +The `kubectl describe job` will show many failed pods. + + +#### Kubectl Interface Changes + +The `--completions` and `--completion-index-var-name` flags are added to kubectl. + +For example, this command: + +``` +kubectl run say-number --image=busybox \ + --completions=3 \ + --completion-index-var-name=I \ + -- \ + sh -c 'echo "My index is $I" && sleep 5' +``` + +will run 3 pods to completion, each printing one of the following lines: + +``` +My index is 1 +My index is 2 +My index is 0 +``` + +Kubectl would create the following pod: + + + +Kubectl will also support the `--per-completion-env` flag, as described previously. +For example, this command: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="apple banana cherry" \ + --per-completion-env=COLOR="green yellow red" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +or equivalently: + +``` +echo "apple banana cherry" > fruits.txt +echo "green yellow red" > colors.txt + +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="$(cat fruits.txt)" \ + --per-completion-env=COLOR="$(cat fruits.txt)" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +or similarly: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT=@fruits.txt \ + --per-completion-env=COLOR=@fruits.txt \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +will all run 3 pods in parallel. Index 0 pod will log: + +``` +Have a nice grenn apple +``` + +and so on. + + +Notes: + +- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a quoted + space separated list or `@` and the name of a text file containing a list. +- `--per-completion-env=` can be specified several times, but all must have the same + length list +- `--completions=N` with `N` equal to list length is implied. +- The flag `--completions=3` sets `job.spec.completions=3`. +- The flag `--completion-index-var-name=I` causes an env var to be created named I in each pod, with the index in it. +- The flag `--restart=OnFailure` is implied by `--completions` or any job-specific arguments. The user can also specify + `--restart=Never` if they desire but may not specify `--restart=Always` with job-related flags. +- Setting any of these flags in turn tells kubectl to create a Job, not a replicationController. + +#### How Kubectl Creates Job Specs. + +To pass in the parameters, kubectl will generate a shell script which +can: +- parse the index from the annotation +- hold all the parameter lists. +- lookup the correct index in each parameter list and set an env var. + +For example, consider this command: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="apple banana cherry" \ + --per-completion-env=COLOR="green yellow red" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +First, kubectl generates the PodSpec as it normally does for `kubectl run`. + +But, then it will generate this script: + +```sh +#!/bin/sh +# Generated by kubectl run ... +# Check for needed commands +if [[ ! type cat ]] +then + echo "$0: Image does not include required command: cat" + exit 2 +fi +if [[ ! type grep ]] +then + echo "$0: Image does not include required command: grep" + exit 2 +fi +# Check that annotations are mounted from downward API +if [[ ! -e /etc/annotations ]] +then + echo "$0: Cannot find /etc/annotations" + exit 2 +fi +# Get our index from annotations file +I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index" +export I + +# Our parameter lists are stored inline in this script. +FRUIT_0="apple" +FRUIT_1="banana" +FRUIT_2="cherry" +# Extract the right parameter value based on our index. +# This works on any Bourne-based shell. +FRUIT=$(eval echo \$"FRUIT_$I") +export FRUIT + +COLOR_0="green" +COLOR_1="yellow" +COLOR_2="red" + +COLOR=$(eval echo \$"FRUIT_$I") +export COLOR +``` + +Then it POSTs this script, encoded, inside a ConfigData. +It attaches this volume to the PodSpec. + +Then it will edit the command line of the Pod to run this script before the rest of +the command line. + +Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this: +It also appends the Secret (later configData) volume with the script in it. + +So, the Pod template that kubectl creates (inside the job template) looks like this: + +``` +apiVersion: v1 +kind: Job +... +spec: + ... + template: + ... + spec: + containers: + - name: c + image: gcr.io/google_containers/busybox + command: + - 'sh' + - '-c' + - '/etc/job-params.sh; echo "this is the rest of the command"' + volumeMounts: + - name: annotations + mountPath: /etc + - name: script + mountPath: /etc + volumes: + - name: annotations + downwardAPI: + items: + - path: "annotations" + ieldRef: + fieldPath: metadata.annotations + - name: script + secret: + secretName: jobparams-abc123 +``` + +###### Alternatives + +Kubectl could append a `valueFrom` line like this to +get the index into the environment: + +```yaml +apiVersion: extensions/v1beta1 +kind: Job +metadata: + ... +spec: + ... + template: + ... + spec: + containers: + - name: foo + ... + env: + # following block added: + - name: I + valueFrom: + fieldRef: + fieldPath: metadata.annotations."kubernetes.io/job-idx" +``` + +However, in order to inject other env vars from parameter list, +kubectl still needs to edit the command line. + +Parameter lists could be passed via a configData volume instead of a secret. +Kubectl can be changed to work that way once the configData implementation is +complete. + +Parameter lists could be passed inside an EnvVar. This would have length +limitations, would pollute the output of `kubectl describe pods` and `kubectl +get pods -o json`. + +Parameter lists could be passed inside an annotation. This would have length +limitations, would pollute the output of `kubectl describe pods` and `kubectl +get pods -o json`. Also, currently annotations can only be extracted into a +single file. Complex logic is then needed to filter out exactly the desired +annotation data. + +Bash array variables could simplify extraction of a particular parameter from a +list of parameters. However, some popular base images do not include +`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation +that does not support array syntax. + +Kubelet does support [expanding varaibles without a +shell](http://kubernetes.io/v1.1/docs/design/expansion.html). But it does not +allow for recursive substitution, which is required to extract the correct +parameter from a list based on the completion index of the pod. The syntax +could be extended, but doing so seems complex and will be an unfamiliar syntax +for users. + +Putting all the command line editing into a script and running that causes +the least pollution to the original command line, and it allows +for complex error handling. + +Kubectl could store the script in an [Inline Volume]( +https://github.com/kubernetes/kubernetes/issues/13610) if that proposal +is approved. That would remove the need to manage the lifetime of the +configData/secret, and prevent the case where someone changes the +configData mid-job, and breaks things in a hard-to-debug way. + + + +## Interactions with other features + +#### Supporting Work Queue Jobs too + +For Work Queue Jobs, completions has no meaning. Parallelism should be allowed to be greater than it, and pods have no identity. So, the job controller should not create a scoreboard in the JobStatus, just a count. Therefore, we need to add one of the following to JobSpec: + +- allow unset `.spec.completions` to indicate no scoreboard, and no index for tasks (identical tasks) +- allow `.spec.completions=-1` to indicate the same. +- add `.spec.indexed` to job to indicate need for scoreboard. + +#### Interaction with vertical autoscaling + +Since pods of the same job will not be created with different resources, +a vertical autoscaler will need to: + +- if it has index-specific initial resource suggestions, suggest those at admission +time; it will need to understand indexes. +- mutate resource requests on already created pods based on usage trend or previous container failures +- modify the job template, affecting all indexes. + +#### Comparison to PetSets + + +The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b. +The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more restrictive and thus less verbose. + +It would be easier for users if Indexed Job and PetSet are similar where possible. +However, PetSet differs in several key respects: + +- PetSet is for ones to tens of instances. Indexed job should work with tens of + thousands of instances. +- When you have few instances, you may want to given them pet names. When you have many + instances, you that many instances, integer indexes make more sense. +- When you have thousands of instances, storing the work-list in the JobSpec + is verbose. For PetSet, this is less of a problem. +- PetSets (apparently) need to differ in more fields than indexed Jobs. + +This differs from PetSet in that PetSet uses names and not indexes. +PetSet is intended to support ones to tens of things. + + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() + -- cgit v1.2.3 From bed31a43a9e5c33e430ca48221cbb01a344344e6 Mon Sep 17 00:00:00 2001 From: Rudi Chiarito Date: Fri, 8 Jan 2016 19:02:05 -0500 Subject: ECR credential provider --- aws_under_the_hood.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index a551c07c..019b07d6 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -171,7 +171,11 @@ The nodes do not need a lot of access to the AWS APIs. They need to download a distribution file, and then are responsible for attaching and detaching EBS volumes from itself. -The node policy is relatively minimal. The master policy is probably overly +The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR +authorization tokens, refresh them every 12 hours if needed, and fetch Docker +images from it, as long as the appropriate permissions are enabled. Those in +[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly), +without write access, should suffice. The master policy is probably overly permissive. The security conscious may want to lock-down the IAM policies further ([#11936](http://issues.k8s.io/11936)). @@ -180,7 +184,7 @@ are correctly configured ([#14226](http://issues.k8s.io/14226)). ### Tagging -All AWS resources are tagged with a tag named "KuberentesCluster", with a value +All AWS resources are tagged with a tag named "KubernetesCluster", with a value that is the unique cluster-id. This tag is used to identify a particular 'instance' of Kubernetes, even if two clusters are deployed into the same VPC. Resources are considered to belong to the same cluster if and only if they have -- cgit v1.2.3 From 04ec5d64a1de8e5dce8f61edc8a9c321d88d6ea5 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Fri, 4 Dec 2015 23:59:50 -0800 Subject: Inter-pod topological affinity/anti-affinity design doc. --- podaffinity.md | 615 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 615 insertions(+) create mode 100644 podaffinity.md diff --git a/podaffinity.md b/podaffinity.md new file mode 100644 index 00000000..108697c6 --- /dev/null +++ b/podaffinity.md @@ -0,0 +1,615 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Inter-pod topological affinity and anti-affinity + +## Introduction + +NOTE: It is useful to read about [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) first. + +This document describes a proposal for specifying and implementing inter-pod topological affinity and +anti-affinity. By that we mean: rules that specify that certain pods should be placed +in the same topological domain (e.g. same node, same rack, same zone, same +power domain, etc.) as some other pods, or, conversely, should *not* be placed in the +same topological domain as some other pods. + +Here are a few example rules; we explain how to express them using the API described +in this doc later, in the section "Examples." +* Affinity + * Co-locate the pods from a particular service or Job in the same availability zone, + without specifying which zone that should be. + * Co-locate the pods from service S1 with pods from service S2 because S1 uses S2 + and thus it is useful to minimize the network latency between them. Co-location + might mean same nodes and/or same availability zone. +* Anti-affinity + * Spread the pods of a service across nodes and/or availability zones, + e.g. to reduce correlated failures + * Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods + * Don't schedule the pods of a particular service on the same nodes as pods of + another service that are known to interfere with the performance of the pods of the first service. + +For both affinity and anti-affinity, there are three variants. Two variants have the +property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed +to schedule onto a node; the difference between them is that if the condition ceases to +be met later on at runtime, for one of them the system will try to eventually evict the pod, +while for the other the system may not try to do so. The third variant +simply provides scheduling-time *hints* that the scheduler will try +to satisfy but may not be able to. These three variants are directly analogous to the three +variants of [node affinity](https://github.com/kubernetes/kubernetes/pull/18261). + +Note that this proposal is only about *inter-pod* topological affinity and anti-affinity. +There are other forms of topological affinity and anti-affinity. For example, +you can use [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) to require (prefer) +that a set of pods all be scheduled in some specific zone Z. Node affinity is not +capable of expressing inter-pod dependencies, and conversely the API +we descibe in this document is not capable of expressing node affinity rules. +For simplicity, we will use the terms "affinity" and "anti-affinity" to mean +"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively, +in the remainder of this document. + +## API + +We will add one field to `PodSpec` + +```go +Affinity *Affinity `json:"affinity,omitempty"` +``` + +The `Affinity` type is defined as follows + +```go +type Affinity struct { + PodAffinity *PodAffinity `json:"podAffinity,omitempty"` + PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` +} + +type PodAffinity struct { + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system will try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system may or may not try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +type PodAntiAffinity struct { + // If the anti-affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the anti-affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system will try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the anti-affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the anti-affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system may or may not try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the anti-affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling anti-affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +type WeightedPodAffinityTerm struct { + // weight is in the range 1-100 + Weight int `json:"weight"` + PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"` +} + +type PodAffinityTerm struct { + LabelSelector *LabelSelector `json:"labelSelector,omitempty"` + // namespaces specifies which namespaces the LabelSelector applies to (matches against); + // nil list means "this pod's namespace," empty list means "all namespaces" + // The json tag here is not "omitempty" since we need to distinguish nil and empty. + // See https://golang.org/pkg/encoding/json/#Marshal for more details. + Namespaces []api.Namespace `json:"namespaces"` + // empty topology key is interpreted by the scheduler as "all topologies" + TopologyKey string `json:"topologyKey,omitempty"` +} +``` + +Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped +to the pod's namespace, but we need to be able to match against all pods globally. + +To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity` +that is configured as follows (note that we've omitted and collapsed some fields for +simplicity, but this should sufficiently convey the intent of the design): + +```go +PodAffinity { + RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}}, + PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}}, +} +PodAntiAffinity { + RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}}, + PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}} +} +``` + +Then when scheduling pod P, the scheduler +* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key "node" and value specifying their node name.) +* Should try to schedule P onto zones that are running pods that satisfy `P3`. (Assumes all nodes have a label with key "zone" and value specifying their zone.) +* Cannot schedule P onto any racks that are running pods that satisfy `P2`. (Assumes all nodes have a label with key "rack" and value specifying their rack name.) +* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key "power" and value specifying their power domain.) + +When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed. +For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and +the node(s) with the highest weight(s) are the most preferred. + +In reality there are two variants of `RequiredDuringScheduling`: one suffixed with +`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the +first variant, if the affinity/anti-affinity ceases to be met at some point during +pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod +from its node. In the second variant, the system may or may not try to eventually +evict the pod from its node. + +## A comment on symmetry + +One thing that makes affinity and anti-affinity tricky is symmetry. + +Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule +"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when +you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod, +*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1 +pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned +RequiredDuringScheduling anti-affinity rule, then +* if a node is empty, you can schedule S1 or S2 onto the node +* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node + +Note that while RequiredDuringScheduling anti-affinity is symmetric, +RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running +pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More +specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then +* if a node is empty, you can schedule S2 onto the node +* if a node is empty, you cannot schedule S1 onto the node +* if a node is running S2, you can schedule S1 onto the node +* if a node is running S1+S2 and S1 terminates, S2 continues running +* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually) + +However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every +RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running +pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node, +but it would be better if there are. + +PreferredDuringScheduling is symmetric. +If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2" +then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also +to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of +S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer +to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place +a S2 pod that we are scheduling onto a node that is running a S1 pod. + +## Examples + +Here are some examples of how you would express various affinity and anti-affinity rules using the API we described. + +### Affinity + +In the examples below, the word "put" is intentionally ambiguous; the rules are the same +whether "put" means "must put" (RequiredDuringScheduling) or "try to put" +(PreferredDuringScheduling)--all that changes is which field the rule goes into. +Also, we only discuss scheduling-time, and ignore the execution-time. +Finally, some of the examples +use "zone" and some use "node," just to make the examples more interesting; any of the examples +with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa. + +* **Put the pod in zone Z**: +Tricked you! It is not possible express this using the API described here. For this you should use node affinity. + +* **Put the pod in a zone that is running at least one pod from service S**: +`{LabelSelector: , TopologyKey: "zone"}` + +* **Put the pod on a node that is already running a pod that requires a license for software package P**: +Assuming pods that require a license for software package P have a label `{key=license, value=P}`: +`{LabelSelector: "license" In "P", TopologyKey: "node"}` + +* **Put this pod in the same zone as other pods from its same service**: +Assuming pods from this pod's service have some label `{key=service, value=S}`: +`{LabelSelector: "service" In "S", TopologyKey: "zone"}` + +This last example illustrates a small issue with this API when it is used +with a scheduler that processes the pending queue one pod at a time, like the current +Kubernetes scheduler. The RequiredDuringScheduling rule +`{LabelSelector: "service" In "S", TopologyKey: "zone"}` +only "works" once one pod from service S has been scheduled. But if all pods in service +S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule +will block the first +pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from +the same service. And of course that means none of the pods of the service will be able +to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not +PreferredDuringScheduling affinity or any variant of anti-affinity. +There are at least three ways to solve this problem +* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement +matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement. +This approach has a corner case when running parallel schedulers that are allowed to +schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to +schedule pods from the set +at the same time and think there are no other pods from that set scheduled yet (e.g. they are +trying to schedule the first two pods from the set), but by the time +the second binding is committed, the first one has already been committed, leaving you with +two pods running that do not respect their RequiredDuringScheduling affinity. There is no +simple way to detect this "conflict" at scheduling time given the current system implementation. +* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those +pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate. +* **very long-term/speculative**: controllers could present the scheduler with a group of pods from +the same PodTemplate as a single unit. This is similar to the first approach described above but +avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow +the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) +since it could receive an entire gang simultaneously as a single unit. + +### Anti-affinity + +As with the affinity examples, the examples here can be RequiredDuringScheduling or +PreferredDuringScheduling anti-affinity, i.e. +"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears +in `RequiredDuringScheduling` or `PreferredDuringScheduling`. + +* **Spread the pods of this service S across nodes and zones**: +`{{LabelSelector: , TopologyKey: "node"}, {LabelSelector: , TopologyKey: "zone"}}` +(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second +clause will force the scheduler to not put more than one pod from S in the same zone, and thus by +definition it will not put more than one pod from S on the same node, assuming each node is in one zone. +This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in +[Ubernetes](../../docs/proposals/federation.md) clusters.) + +* **Don't co-locate pods of this service with pods from service "evilService"**: +`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` + +* **Don't co-locate pods of this service with any other pods including pods of this service**: +`{LabelSelector: empty, TopologyKey: "node"}` + +* **Don't co-locate pods of this service with any other pods except other pods of this service**: +Assuming pods from the service have some label `{key=service, value=S}`: +`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}` +Note that this works because `"service" NotIn "S"` matches pods with no key "service" +as well as pods with key "service" and a corresponding value that is not "S." + +## Algorithm + +An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows. +There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's +semantics are implementable. + +Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler +predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling +time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution. + +To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand +for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for +"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." + +** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account; +currently it assumes all terms have weight 1. ** + +``` +Z = the pod you are scheduling +{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z +// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction +X = {Z's PodSpec's HardPodAffinity} +foreach element H of {X} + P = {all pods in the system that match H.LabelSelector} + M map[string]int // topology value -> number of pods running on nodes with that topology value + foreach pod Q of {P} + L = {labels of the node on which Q is running, represented as a map from label key to label value} + M[L[H.TopologyKey]]++ + {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]} +// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity +// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0 +X = {Z's PodSpec's HardPodAntiAffinity} +foreach element H of {X} + P = {all pods in the system that match H.LabelSelector} + M map[string]int // topology value -> number of pods running on nodes with that topology value + foreach pod Q of {P} + L = {labels of the node on which Q is running, represented as a map from label key to label value} + M[L[H.TopologyKey]]++ + {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]} +// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity +foreach node A of {N} + foreach pod B that is bound to A + if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N} +// At this point, all node in {N} are feasible for Z. +// Step 3a: Soft version of Step 1a +Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node +Initialize the keys of Y to all of the nodes in {N}, and the values to 0 +X = {Z's PodSpec's SoftPodAffinity} +Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" +// Step 3b: Soft version of Step 1b +X = {Z's PodSpec's SoftPodAntiAffinity} +Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" +// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft +foreach node A of {N} + foreach pod B that is bound to A + increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A +// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is +// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with +// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better). +``` + +## Special considerations for RequiredDuringScheduling anti-affinity + +In this section we discuss three issues with RequiredDuringScheduling anti-affinity: +Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill. +See issue #18265 for additional discussion of these topics. + +### Denial of Service + +Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally +or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity. + +The most notable danger is the ability for a +pod that arrives first to some topology domain, to block all other pods from +scheduling there by stating a conflict with all other pods. +The standard approach +to preventing resource hogging is quota, but simple resource quota cannot prevent +this scenario because the pod may request very little resources. Addressing this +using quota requires a quota scheme that charges based on "opportunity cost" rather +than based simply on requested resources. For example, when handling a pod that expresses +RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` +(i.e. exclusive access to a node), it could charge for the resources of the +average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling +anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the +entire cluster. If a cluster administrator wants to overcommit quota, for +example to allow more than N pods across all users to request exclusive node +access in a cluster with N nodes, then a priority/preemption scheme should be added +so that the most important pods run when resource demand exceeds supply. + +Our initial implementation will use quota that charges based on opportunity cost. + +A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade +the scheduling quality of another pod, but not completely block it from scheduling. +For example, a set of pods S1 could use node affinity to request to schedule onto a set +of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1 +have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2, +then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from +scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and +with some probability that depends on the weighting scheme for the PreferredDuringScheduling case). +A very sophisticated priority and/or quota scheme could mitigate this, or alternatively +we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity. +Then only RequiredDuringScheduling anti-affinity could affect scheduling quality +of another pod, and as we described in the previous paragraph, such pods could be charged +quota for the full topology domain, thereby reducing the potential for abuse. + +We won't try to address this issue in our initial implementation; we can consider one +of the approaches mentioned above if it turns out to be a problem in practice. + +### Co-existing with daemons + +A cluster administrator +may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with +system daemon pods, such as those run by DaemonSet. In principle, we would like the specification +for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more +other pods (see #18263 for a more detailed explanation of the toleration concept). There are +at least two ways to accomplish this: + +* Scheduler special-cases the namespace(s) where daemons live, in the + sense that it ignores pods in those namespaces when it is + determining feasibility for pods with anti-affinity. The name(s) of + the special namespace(s) could be a scheduler configuration + parameter, and default to `kube-system`. We could allow + multiple namespaces to be specified if we want cluster admins to be + able to give their own daemons this special power (they would add + their namespace to the list in the scheduler configuration). And of + course this would be symmetric, so daemons could schedule onto a node + that is already running a pod with anti-affinity. + +* We could add an explicit "toleration" concept/field to allow the + user to specify namespaces that are excluded when they use + RequiredDuringScheduling anti-affinity, and use an admission + controller/defaulter to ensure these namespaces are always listed. + +Our initial implementation will use the first approach. + +### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution) + +Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution +anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in +such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution +anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod +with the anti-affinity rule that becomes violated should be the one killed. +A pod should only specify constraints that apply to +namespaces it trusts to not do malicious things. Once we have priority/preemption, we can +change the rule to say that the lowest-priority pod(s) are killed until all +RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied. + +## Special considerations for RequiredDuringScheduling affinity + +The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry: +if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods, +and pods that conflict with P cannot schedule onto the node one P has been scheduled there. +The design we have described says that the symmetry property for RequiredDuringScheduling *affinity* +is weaker: if a pod P says it can only schedule onto nodes running pod Q, this +does not mean Q can only run on a node that is running P, but the scheduler will try +to schedule Q onto a node that is running P (i.e. treats the reverse direction as +preferred). This raises the same scheduling quality concern as we menioned at the +end of the Denial of Service section above, and can be addressed in similar ways. + +The nature of affinity (as opposed to anti-affinity) means that there is no issue of +determining which pod(s) to kill +when a pod's labels change: it is obviously the pod with the affinity rule that becomes +violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule; +it can only "fix" violation an anti-affinity rule.) However, affinity does have a +different question related to killing: how long should the system wait before declaring +that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime? +For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed +so that it can be updated to a new binary version, should that trigger killing of P? More +generally, how long should the system wait before declaring that P's affinity is +violated? (Of course affinity is expressed in terms of label selectors, not for a specific +pod, but the scenario is easier to describe using a concrete pod.) This is closely related to +the concept of forgiveness (see issue #1574). In theory we could make this time duration be +configurable by the user on a per-pod basis, but for the first version of this feature we will +make it a configurable property of whichever component does the killing and that applies across +all pods using the feature. Making it configurable by the user would require a nontrivial change +to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution +affinity). + +## Implementation plan + +1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types. +2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` +affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod). +3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account +4. Implement a quota mechanism that charges for the entire topology domain when `RequiredDuringScheduling` anti-affinity is used. Later +this should be refined to only apply when it is used to request exclusive access, not when it is used to express conflict with specific pods. +5. Implement the recommended solution to the "co-existing with daemons" issue +6. At this point, the feature can be deployed. +7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure +the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take +`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism, +the "co-existing with daemons" solution). +8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision +9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies +`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`; +if controller then potentially for all `TopologyKeys`'s. +(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). +Do so in a way that addresses the "determining which pod(s) to kill" issue. + +We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling +domains (e.g. node name, rack name, availability zone name, etc.). See #9044. + +## Backward compatiblity + +Old versions of the scheduler will ignore `Affinity`. + +Users should not start using `Affinity` until the full implementation has +been in Kubelet and the master for enough binary versions that we feel +comfortable that we will not need to roll back either Kubelet or +master to a version that does not support them. Longer-term we will +use a programatic approach to enforcing this (#4855). + +## Extensibility + +The design described here is the result of careful analysis of use cases, a decade of experience +with Borg at Google, and a review of similar features in other open-source container orchestration +systems. We believe that it properly balances the goal of expressiveness against the goals of +simplicity and efficiency of implementation. However, we recognize that +use cases may arise in the future that cannot be expressed using the syntax described here. +Although we are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes +users to get a consistent experience, etc.), the regular Kubernetes +annotation mechanism can be used to add or replace affinity rules. The way this work would is +1. Define one or more annotations to describe the new affinity rule(s) +1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. +If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields +from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the +annotation(s). +1. Scheduler takes the annotation(s) into account when scheduling. + +If some particular new syntax becomes popular, we would consider upstreaming it by integrating +it into the standard `Affinity`. + +## Future work and non-work + +One can imagine that in the anti-affinity RequiredDuringScheduling case +one might want to associate a number with the rule, +for example "do not allow this pod to share a rack with more than three other +pods (in total, or from the same service as the pod)." We could allow this to be +specified by adding an integer `Limit` to `PodAffinityTerm` just for the +`RequiredDuringScheduling` case. However, this flexibility complicates the +system and we do not intend to implement it. + +It is likely that the specification and implementation of pod anti-affinity +can be unified with [taints and tolerations](https://github.com/kubernetes/kubernetes/pull/18263), +and likewise that the specification and implementation of pod affinity +can be unified with [node affinity](https://github.com/kubernetes/kubernetes/pull/18261). +The basic idea is that pod labels would be "inherited" by the node, and pods +would only be able to specify affinity and anti-affinity for a node's labels. +Our main motivation for not unifying taints and tolerations with +pod anti-affinity is that we foresee taints and tolerations as being a concept that +only cluster administrators need to understand (and indeed in some setups taints and +tolerations wouldn't even be directly manipulated by a cluster administrator, +instead they would only be set by an admission controller that is implementing the administrator's +high-level policy about different classes of special machines and the users who belong to the groups +allowed to access them). Moreover, the concept of nodes "inheriting" labels +from pods seems complicated; it seems conceptually simpler to separate rules involving +relatively static properties of nodes from rules involving which other pods are running +on the same node or larger topology domain. + +Data/storage affinity is related to pod affinity, and is likely to draw on some of the +ideas we have used for pod affinity. Today, data/storage affinity is expressed using +node affinity, on the assumption that the pod knows which node(s) store(s) the data +it wants. But a more flexible approach would allow the pod to name the data rather than +the node. + +## Related issues + +The review for this proposal is in #18265. + +The topic of affinity/anti-affinity has generated a lot of discussion. The main issue +is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906 +all have additional discussion and use cases. + +As the examples in this document have demonstrated, topological affinity is very useful +in clusters that are spread across availability zones, e.g. to co-locate pods of a service +in the same zone to avoid a wide-area network hop, or to spread pods across zones for +failure tolerance. #17059, #13056, #13063, and #4235 are relevant. + +Issue #15675 describes connection affinity, which is vaguely related. + +This proposal is to satisfy #14816. + +## Related work + +** TODO: cite references ** + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() + -- cgit v1.2.3 From 682d41b9532908e03e6fb5a2114a302804f9a999 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Sat, 5 Dec 2015 13:11:27 -0800 Subject: Design doc for node affinity, including NodeSelector. --- nodeaffinity.md | 263 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 263 insertions(+) create mode 100644 nodeaffinity.md diff --git a/nodeaffinity.md b/nodeaffinity.md new file mode 100644 index 00000000..5deda322 --- /dev/null +++ b/nodeaffinity.md @@ -0,0 +1,263 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Node affinity and NodeSelector + +## Introduction + +This document proposes a new label selector representation, called `NodeSelector`, +that is similar in many ways to `LabelSelector`, but is a bit more flexible and is +intended to be used only for selecting nodes. + +In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler +currently uses as part of restricting the set of nodes onto which a pod is +eligible to schedule, with a field of type `Affinity` that contains contains one or +more affinity specifications. In this document we discuss `NodeAffinity`, which +contains one or more of the following +* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be +represented by a `NodeSelector`, and thus generalizes the scheduling behavior of +the current `map[string]string` but still serves the purpose of restricting +the set of nodes onto which the pod can schedule. In addition, unlike the behavior +of the current `map[string]string`, when it becomes violated the system will +try to eventually evict the pod from its node. +* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical +to `RequiredDuringSchedulingRequiredDuringExecution` except that the system +may or may not try to eventually evict the pod from its node. +* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are +preferred for scheduling among those that meet all scheduling requirements. + +(In practice, as discussed later, we will actually *add* the `Affinity` field +rather than replacing `map[string]string`, due to backward compatibility requirements.) + +The affiniy specifications described above allow a pod to request various properties +that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a +multi-zone cluster, "run this pod on a node in zone Z." +([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes +some of the properties that a node might publish as labels, which affinity expressions +can match against.) +They do *not* allow a pod to request to schedule +(or not schedule) on a node based on what other pods are running on the node. That +feature is called "inter-pod topological affinity/anti-afinity" and is described +[here](https://github.com/kubernetes/kubernetes/pull/18265). + +## API + +### NodeSelector + +```go +// A node selector represents the union of the results of one or more label queries +// over a set of nodes; that is, it represents the OR of the selectors represented +// by the nodeSelectorTerms. +type NodeSelector struct { + // nodeSelectorTerms is a list of node selector terms. The terms are ORed. + NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` +} + +// An empty node selector term matches all objects. A null node selector term +// matches no objects. +type NodeSelectorTerm struct { + // matchExpressions is a list of node selector requirements. The requirements are ANDed. + MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` +} + +// A node selector requirement is a selector that contains values, a key, and an operator +// that relates the key and values. +type NodeSelectorRequirement struct { + // key is the label key that the selector applies to. + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + // operator represents a key's relationship to a set of values. + // Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. + Operator NodeSelectorOperator `json:"operator"` + // values is an array of string values. If the operator is In or NotIn, + // the values array must be non-empty. If the operator is Exists or DoesNotExist, + // the values array must be empty. If the operator is Gt or Lt, the values + // array must have a single element, which will be interpreted as an integer. + // This array is replaced during a strategic merge patch. + Values []string `json:"values,omitempty"` +} + +// A node selector operator is the set of operators that can be used in +// a node selector requirement. +type NodeSelectorOperator string + +const ( + NodeSelectorOpIn NodeSelectorOperator = "In" + NodeSelectorOpNotIn NodeSelectorOperator = "NotIn" + NodeSelectorOpExists NodeSelectorOperator = "Exists" + NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist" + NodeSelectorOpGt NodeSelectorOperator = "Gt" + NodeSelectorOpLt NodeSelectorOperator = "Lt" +) +``` + +### NodeAffinity + +We will add one field to `PodSpec` + +```go +Affinity *Affinity `json:"affinity,omitempty"` +``` + +The `Affinity` type is defined as follows + +```go +type Affinity struct { + NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"` +} + +type NodeAffinity struct { + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a node label update), + // the system will try to eventually evict the pod from its node. + RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a node label update), + // the system may or may not try to eventually evict the pod from its node. + RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +// An empty preferred scheduling term matches all objects with implicit weight 0 +// (i.e. it's a no-op). A null preferred scheduling term matches no objects. +type PreferredSchedulingTerm struct { + // weight is in the range 1-100 + Weight int `json:"weight"` + // matchExpressions is a list of node selector requirements. The requirements are ANDed. + MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` +} +``` + +Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector` +and we can't change it since this name is part of the API. Hopefully this won't +cause too much confusion. + +## Examples + +** TODO: fill in this section ** + +* Run this pod on a node with an Intel or AMD CPU + +* Run this pod on a node in availability zone Z + + +## Backward compatibility + +When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec + +```go +NodeSelector map[string]string `json:"nodeSelector,omitempty"` +``` + +Old version of the scheduler will ignore the `Affinity` field. +New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`, +i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not +attempt to convert between `Affinity` and `nodeSelector`. + +Old versions of non-scheduling clients will not know how to do anything semantically meaningful +with `Affinity`, but we don't expect that this will cause a problem. + +See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) +for more discussion. + +Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master +for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet +or master to a version that does not support them. Longer-term we will use a programatic approach to +enforcing this (#4855). + +## Implementation plan + +1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`, +and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API +2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account +3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account +4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated +5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API +6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account +7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision +8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies +`RequiredDuringSchedulingRequiredDuringExecution` +(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). + +We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling +domains (e.g. node name, rack name, availability zone name, etc.). See #9044. + +## Extensibility + +The design described here is the result of careful analysis of use cases, a decade of experience +with Borg at Google, and a review of similar features in other open-source container orchestration +systems. We believe that it properly balances the goal of expressiveness against the goals of +simplicity and efficiency of implementation. However, we recognize that +use cases may arise in the future that cannot be expressed using the syntax described here. +Although we are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes +users to get a consistent experience, etc.), the regular Kubernetes +annotation mechanism can be used to add or replace affinity rules. The way this work would is + +1. Define one or more annotations to describe the new affinity rule(s) +1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. +If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields +from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the +annotation(s). +1. Scheduler takes the annotation(s) into account when scheduling. + +If some particular new syntax becomes popular, we would consider upstreaming it by integrating +it into the standard `Affinity`. + +## Future work + +Are there any other fields we should convert from `map[string]string` to `NodeSelector`? + +## Related issues + +The review for this proposal is in #18261. + +The main related issue is #341. Issue #367 is also related. Those issues reference other +related issues. + + + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() + -- cgit v1.2.3 From ac0204808b911ef7c19bf4d3e64928541e94701a Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Sat, 5 Dec 2015 16:06:17 -0800 Subject: Dedicated nodes, taints, and tolerations design doc. --- taint-toleration-dedicated.md | 301 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 301 insertions(+) create mode 100644 taint-toleration-dedicated.md diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md new file mode 100644 index 00000000..cca2ee44 --- /dev/null +++ b/taint-toleration-dedicated.md @@ -0,0 +1,301 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Taints, Tolerations, and Dedicated Nodes + +## Introduction + +This document describes *taints* and *tolerations*, which constitute a generic mechanism for restricting +the set of pods that can use a node. We also describe one concrete use case for the mechanism, +namely to limit the set of users (or more generally, authorization domains) +who can access a set of nodes (a feature we call +*dedicated nodes*). There are many other uses--for example, a set of nodes with a particular +piece of hardware could +be reserved for pods that require that hardware, or a node could be marked as unschedulable +when it is being drained before shutdown, or a node could trigger evictions when it experiences +hardware or software problems or abnormal node configurations; see #17190 and #3885 for more discussion. + +## Taints, tolerations, and dedicated nodes + +A *taint* is a new type that is part of the `NodeSpec`; when present, it prevents pods +from scheduling onto the node unless the pod *tolerates* the taint (tolerations are listed +in the `PodSpec`). Note that there are actually multiple flavors of taints: taints that +prevent scheduling on a node, taints that cause the scheduler to try to avoid scheduling +on a node but do not prevent it, taints that prevent a pod from starting on Kubelet even +if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler), +and taints that evict already-running pods. +[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) +has more background on these diffrent scenarios. We will focus on the first +kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case. + +Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that +is dedicated to group A gets taint `dedicated=A` and the pods belonging to group A get +toleration `dedicated=A`. (The exact syntax and semantics of taints and tolerations are +described later in this doc.) This keeps all pods except those belonging to group A off of the nodes. +This approach easily generalizes to pods that are allowed to +schedule into multiple dedicated node groups, and nodes that are a member of multiple +dedicated node groups. + +Note that because tolerations are at the granularity of pods, +the mechanism is very flexible -- any policy can be used to determine which tolerations +should be placed on a pod. So the "group A" mentioned above could be all pods from a +particular namespace or set of namespaces, or all pods with some other arbitrary characteristic +in common. We expect that any real-world usage of taints and tolerations will employ an admission controller +to apply the tolerations. For example, to give all pods from namespace A access to dedicated +node group A, an admission controller would add the corresponding toleration to all +pods from namespace A. Or to give all pods that require GPUs access to GPU nodes, an admission +controller would add the toleration for GPU taints to pods that request the GPU resource. + +Everything that can be expressed using taints and tolerations can be expressed using +[node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. in the example +in the previous paragraph, you could put a label `dedicated=A` on the set of dedicated nodes and +a node affinity `dedicated NotIn A` on all pods *not* belonging to group A. But it is +cumbersome to express exclusion policies using node affinity because every time you add +a new type of restricted node, all pods that aren't allowed to use those nodes need to start avoiding those +nodes using node affinity. This means the node affinity list can get quite long in clusters with lots of different +groups of special nodes (lots of dedicated node groups, lots of different kinds of special hardware, etc.). +Moreover, you need to also update any Pending pods when you add new types of special nodes. +In contrast, with taints and tolerations, +when you add a new type of special node, "regular" pods are unaffected, and you just need to add +the necessary toleration to the pods you subsequent create that need to use the new type of special nodes. +To put it another way, with taints and tolerations, only pods that use a set of special nodes +need to know about those special nodes; with the node affinity approach, pods that have +no interest in those special nodes need to know about all of the groups of special nodes. + +One final comment: in practice, it is often desirable to not +only keep "regular" pods off of special nodes, but also to keep "special" pods off of +regular nodes. An example in the dedicated nodes case is to not only keep regular +users off of dedicated nodes, but also to keep dedicated users off of non-dedicated (shared) +nodes. In this case, the "non-dedicated" nodes can be modeled as their own dedicated node group +(for example, tainted as `dedicated=shared`), and pods that are not given access to any +dedicated nodes ("regular" pods) would be given a toleration for `dedicated=shared`. (As mentioned earlier, +we expect tolerations will be added by an admission controller.) In this case taints/tolerations +are still better than node affinity because with taints/tolerations each pod only needs one special "marking", +versus in the node affinity case where every time you add a dedicated node group (i.e. a new +`dedicated=` value), you need to add a new node affinity rule to all pods (including pending pods) +except the ones allowed to use that new dedicated node group. + +## API + +```go +// The node this Taint is attached to has the effect "effect" on +// any pod that that does not tolerate the Taint. +type Taint struct { + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + Value string `json:"value,omitempty"` + Effect TaintEffect `json:"effect"` +} + +type TaintEffect string + +const ( + // Do not allow new pods to schedule unless they tolerate the taint, + // but allow all pods submitted to Kubelet without going through the scheduler + // to start, and allow all already-running pods to continue running. + // Enforced by the scheduler. + TaintEffectNoSchedule TaintEffect = "NoSchedule" + // Like TaintEffectNoSchedule, but the scheduler tries not to schedule + // new pods onto the node, rather than prohibiting new pods from scheduling + // onto the node. Enforced by the scheduler. + TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule" + // Do not allow new pods to schedule unless they tolerate the taint, + // do not allow pods to start on Kubelet unless they tolerate the taint, + // but allow all already-running pods to continue running. + // Enforced by the scheduler and Kubelet. + TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit" + // Do not allow new pods to schedule unless they tolerate the taint, + // do not allow pods to start on Kubelet unless they tolerate the taint, + // and try to eventually evict any already-running pods that do not tolerate the taint. + // Enforced by the scheduler and Kubelet. + TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute" +) + +// The pod this Toleration is attached to tolerates any taint that matches +// the triple using the matching operator . +type Toleration struct { + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + // operator represents a key's relationship to the value. + // Valid operators are Exists and Equal. Defaults to Equal. + // Exists is equivalent to wildcard for value, so that a pod can + // tolerate all taints of a particular category. + Operator TolerationOperator `json:"operator"` + Value string `json:"value,omitempty"` + Effect TaintEffect `json:"effect"` + // TODO: For forgiveness (#1574), we'd eventually add at least a grace period + // here, and possibly an occurrence threshold and period. +} + +// A toleration operator is the set of operators that can be used in a toleration. +type TolerationOperator string + +const ( + TolerationOpExists TolerationOperator = "Exists" + TolerationOpEqual TolerationOperator = "Equal" +) + +``` + +(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) +to understand the motivation for the various taint effects.) + +We will add + +```go + // Multiple tolerations with the same key are allowed. + Tolerations []Toleration `json:"tolerations,omitempty"` +``` + +to `PodSpec`. A pod must tolerate all of a node's taints (except taints +of type TaintEffectPreferNoSchedule) in order to be able +to schedule onto that node. + +We will add + +```go + // Multiple taints with the same key are not allowed. + Taints []Taint `json:"taints,omitempty"` +``` + +to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union +of the taints specified by various sources. For now, the only source is +the `NodeSpec` itself, but in the future one could imagine a node inheriting +taints from pods (if we were to allow taints to be attached to pods), from +the node's startup coniguration, etc. The scheduler should look at the `Taints` +in `NodeStatus`, not in `NodeSpec`. + +Taints and tolerations are not scoped to namespace. + +## Implementation plan: taints, tolerations, and dedicated nodes + +Using taints and tolerations to implement dedicated nodes requires these steps: + +1. Add the API described above +1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule) +and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule). +1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and +TaintEffectNoScheduleNoAdmitNoExecute +1. Implement code in Kubelet that evicts a pod that no longer satisfies +TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers +instead, but since taints might be used to enforce security policies, it is better +to do in kubelet because kubelet can respond quickly and can guarantee the rules will +be applied to all pods. +Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing +taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod. +1. Add a new `kubectl` command that adds/removes taints to/from nodes, +1. (This is the one step is that is specific to dedicated nodes) +Implement an admission controller that adds tolerations to pods that are supposed +to be allowed to use dedicated nodes (for example, based on pod's namespace). + +In the future one can imagine a generic policy configuration that configures +an admission controller to apply the appropriate tolerations to the desired class of pods and +taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes, +but also other uses of taints and tolerations, e.g. nodes that are restricted +due to their hardware configuration. + +The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`. +Examples usages: + +```sh +# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. +# If a taint with that key already exists, its value and effect are replaced as specified. +$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute + +# Remove from node 'foo' the taint with key 'dedicated' if one exists. +$ kubectl taint nodes foo dedicated- +``` + +## Example: implementing a dedicated nodes policy + +Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available +only to pods in a particular namespace `banana`. First the administrator does + +```sh +$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute +$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute +$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute + +``` + +(assuming they want to evict pods that are already running on those nodes if those +pods don't already tolerate the new taint) + +Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify +a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`. + +In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having +to enumerate them by name. + +## Future work + +At present, the Kubernetes security model allows any user to add and remove any taints and tolerations. +Obviously this makes it impossible to securely enforce +rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints` +field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`) +and from mutating the `Tolerations` field of their pods. #17549 is relevant. + +Another security vulnterability arises if nodes are added to the cluster before receiving +their taint. Thus we need to ensure that a new node does not become "Ready" until it has been +configured with its taints. One way to do this is to have an admission controller that adds the taint whenever +a Node object is created. + +A quota policy may want to treat nodes diffrently based on what taints, if any, +they have. For example, if a particular namespace is only allowed to access dedicated nodes, +then it may be convenient to give the namespace unlimited quota. (To use finite quota, +you'd have to size the namespace's quota to the sum of the sizes of the machines in the +dedicated node group, and update it when nodes are added/removed to/from the group.) + +It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). +We have chosen not to do this for the reasons described in the "Future work" section of that doc. + +## Backward compatibility + +Old scheduler versions will ignore taints and tolerations. New scheduler versions +will respect them. + +Users should not start using taints and tolerations until the full implementation +has been in Kubelet and the master for enough binary versions that we +feel comfortable that we will not need to roll back either Kubelet or +master to a version that does not support them. Longer-term we will +use a progamatic approach to enforcing this (#4855). + +## Related issues + +This proposal is based on the discussion in #17190. There are a number of other +related issues, all of which are linked to from #17190. + +The relationship between taints and node drains is discussed in #1574. + +The concepts of taints and tolerations were originally developed as part of the +Omega project at Google. + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() + -- cgit v1.2.3 From e8e4d9c7f9bcd2f8093b80aff31cde38e84065b1 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Thu, 21 Jan 2016 10:47:44 -0500 Subject: Update proposal for ConfigMap volume --- configmap.md | 323 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 323 insertions(+) create mode 100644 configmap.md diff --git a/configmap.md b/configmap.md new file mode 100644 index 00000000..aceb3342 --- /dev/null +++ b/configmap.md @@ -0,0 +1,323 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Generic Configuration Object + +## Abstract + +The `ConfigMap` API resource stores data used for the configuration of applications deployed on +Kubernetes. + +The main focus of this resource is to: + +* Provide dynamic distribution of configuration data to deployed applications. +* Encapsulate configuration information and simplify `Kubernetes` deployments. +* Create a flexible configuration model for `Kubernetes`. + +## Motivation + +A `Secret`-like API resource is needed to store configuration data that pods can consume. + +Goals of this design: + +1. Describe a `ConfigMap` API resource +2. Describe the semantics of consuming `ConfigMap` as environment variables +3. Describe the semantics of consuming `ConfigMap` as files in a volume + +## Use Cases + +1. As a user, I want to be able to consume configuration data as environment variables +2. As a user, I want to be able to consume configuration data as files in a volume +3. As a user, I want my view of configuration data in files to be eventually consistent with changes + to the data + +### Consuming `ConfigMap` as Environment Variables + +Many programs read their configuration from environment variables. `ConfigMap` should be possible +to consume in environment variables. The rough series of events for consuming `ConfigMap` this way +is: + +1. A `ConfigMap` object is created +2. A pod that consumes the configuration data via environment variables is created +3. The pod is scheduled onto a node +4. The kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and starts the container + processes with the appropriate data in environment variables + +### Consuming `ConfigMap` in Volumes + +Many programs read their configuration from configuration files. `ConfigMap` should be possible +to consume in a volume. The rough series of events for consuming `ConfigMap` this way +is: + +1. A `ConfigMap` object is created +2. A new pod using the `ConfigMap` via the volume plugin is created +3. The pod is scheduled onto a node +4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` method +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects + the appropriate data into the volume + +### Consuming `ConfigMap` Updates + +Any long-running system has configuration that is mutated over time. Changes made to configuration +data must be made visible to pods consuming data in volumes so that they can respond to those +changes. + +The `resourceVersion` of the `ConfigMap` object will be updated by the API server every time the +object is modified. After an update, modifications will be made visible to the consumer container: + +1. A `ConfigMap` object is created +2. A new pod using the `ConfigMap` via the volume plugin is created +3. The pod is scheduled onto a node +4. During the sync loop, the Kubelet creates an instance of the volume plugin and calls its + `Setup()` method +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects + the appropriate data into the volume +6. The `ConfigMap` referenced by the pod is updated +7. During the next iteration of the `syncLoop`, the Kubelet creates an instance of the volume plugin + and calls its `Setup()` method +8. The volume plugin projects the updated data into the volume atomically + +It is the consuming pod's responsibility to make use of the updated data once it is made visible. + +Because environment variables cannot be updated without restarting a container, configuration data +consumed in environment variables will not be updated. + +### Advantages + +* Easy to consume in pods; consumer-agnostic +* Configuration data is persistent and versioned +* Consumers of configuration data in volumes can respond to changes in the data + +## Proposed Design + +### API Resource + +The `ConfigMap` resource will be added to the main API: + +```go +package api + +// ConfigMap holds configuration data for pods to consume. +type ConfigMap struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Data contains the configuration data. Each key must be a valid DNS_SUBDOMAIN or leading + // dot followed by valid DNS_SUBDOMAIN. + Data map[string]string `json:"data,omitempty"` +} + +type ConfigMapList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty"` + + Items []ConfigMap `json:"items"` +} +``` + +A `Registry` implementation for `ConfigMap` will be added to `pkg/registry/configmap`. + +### Environment Variables + +The `EnvVarSource` will be extended with a new selector for `ConfigMap`: + +```go +package api + +// EnvVarSource represents a source for the value of an EnvVar. +type EnvVarSource struct { + // other fields omitted + + // Specifies a ConfigMap key + ConfigMap *ConfigMapSelector `json:"configMap,omitempty"` +} + +// ConfigMapSelector selects a key of a ConfigMap. +type ConfigMapSelector struct { + // The name of the ConfigMap to select a key from. + ConfigMapName string `json:"configMapName"` + // The key of the ConfigMap to select. + Key string `json:"key"` +} +``` + +### Volume Source + +A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` object will be +added to the `VolumeSource` struct in the API: + +```go +package api + +type VolumeSource struct { + // other fields omitted + ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"` +} + +// Represents a volume that holds configuration data. +type ConfigMapVolumeSource struct { + LocalObjectReference `json:",inline"` + // A list of keys to project into the volume. + // If unspecified, each key-value pair in the Data field of the + // referenced ConfigMap will be projected into the volume as a file whose name + // is the key and content is the value. + // If specified, the listed keys will be project into the specified paths, and + // unlisted keys will not be present. + Items []KeyToPath `json:"items,omitempty"` +} + +// Represents a mapping of a key to a relative path. +type KeyToPath struct { + // The name of the key to select + Key string `json:"key"` + + // The relative path name of the file to be created. + // Must not be absolute or contain the '..' path. Must be utf-8 encoded. + // The first item of the relative path must not start with '..' + Path string `json:"path"` +} +``` + +**Note:** The update logic used in the downward API volume plug-in will be extracted and re-used in +the volume plug-in for `ConfigMap`. + +### Changes to Secret + +We will update the Secret volume plugin to have a similar API to the new ConfigMap volume plugin. +The secret volume plugin will also begin updating secret content in the volume when secrets change. + +## Examples + +#### Consuming `ConfigMap` as Environment Variables + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: etcd-env-config +data: + number-of-members: 1 + initial-cluster-state: new + initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN + discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN + discovery-url: http://etcd-discovery:2379 + etcdctl-peers: http://etcd:2379 +``` + +This pod consumes the `ConfigMap` as environment variables: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: config-env-example +spec: + containers: + - name: etcd + image: openshift/etcd-20-centos7 + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + env: + - name: ETCD_NUM_MEMBERS + valueFrom: + configMap: + configMapName: etcd-env-config + key: number-of-members + - name: ETCD_INITIAL_CLUSTER_STATE + valueFrom: + configMap: + configMapName: etcd-env-config + key: initial-cluster-state + - name: ETCD_DISCOVERY_TOKEN + valueFrom: + configMap: + configMapName: etcd-env-config + key: discovery-token + - name: ETCD_DISCOVERY_URL + valueFrom: + configMap: + configMapName: etcd-env-config + key: discovery-url + - name: ETCDCTL_PEERS + valueFrom: + configMap: + configMapName: etcd-env-config + key: etcdctl-peers +``` + +#### Consuming `ConfigMap` as Volumes + +`redis-volume-config` is intended to be used as a volume containing a config file: + +```yaml +apiVersion: extensions/v1beta1 +kind: ConfigMap +metadata: + name: redis-volume-config +data: + redis.conf: "pidfile /var/run/redis.pid\nport6379\ntcp-backlog 511\n databases 1\ntimeout 0\n" +``` + +The following pod consumes the `redis-volume-config` in a volume: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: config-volume-example +spec: + containers: + - name: redis + image: kubernetes/redis + command: "redis-server /mnt/config-map/etc/redis.conf" + ports: + - containerPort: 6379 + volumeMounts: + - name: config-map-volume + mountPath: /mnt/config-map + volumes: + - name: config-map-volume + configMap: + name: redis-volume-config + items: + - path: "etc/redis.conf" + key: redis.conf +``` + +## Future Improvements + +In the future, we may add the ability to specify an init-container that can watch the volume +contents for updates and respond to changes when they occur. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() + -- cgit v1.2.3 From 7ef51e5a7487a5e3ae476cec091d7a8aafe96fed Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Mon, 25 Jan 2016 15:33:52 -0800 Subject: Cleanup. --- podaffinity.md | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/podaffinity.md b/podaffinity.md index 108697c6..8c1372b9 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -31,7 +31,7 @@ Documentation for other releases can be found at ## Introduction -NOTE: It is useful to read about [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) first. +NOTE: It is useful to read about [node affinity](nodeaffinity.md) first. This document describes a proposal for specifying and implementing inter-pod topological affinity and anti-affinity. By that we mean: rules that specify that certain pods should be placed @@ -61,11 +61,11 @@ be met later on at runtime, for one of them the system will try to eventually ev while for the other the system may not try to do so. The third variant simply provides scheduling-time *hints* that the scheduler will try to satisfy but may not be able to. These three variants are directly analogous to the three -variants of [node affinity](https://github.com/kubernetes/kubernetes/pull/18261). +variants of [node affinity](nodeaffinity.md). Note that this proposal is only about *inter-pod* topological affinity and anti-affinity. There are other forms of topological affinity and anti-affinity. For example, -you can use [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) to require (prefer) +you can use [node affinity](nodeaffinity.md) to require (prefer) that a set of pods all be scheduled in some specific zone Z. Node affinity is not capable of expressing inter-pod dependencies, and conversely the API we descibe in this document is not capable of expressing node affinity rules. @@ -159,7 +159,7 @@ type PodAffinityTerm struct { // nil list means "this pod's namespace," empty list means "all namespaces" // The json tag here is not "omitempty" since we need to distinguish nil and empty. // See https://golang.org/pkg/encoding/json/#Marshal for more details. - Namespaces []api.Namespace `json:"namespaces"` + Namespaces []api.Namespace `json:"namespaces,omitempty"` // empty topology key is interpreted by the scheduler as "all topologies" TopologyKey string `json:"topologyKey,omitempty"` } @@ -405,12 +405,27 @@ RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` (i.e. exclusive access to a node), it could charge for the resources of the average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the -entire cluster. If a cluster administrator wants to overcommit quota, for +entire cluster. If node affinity is used to +constrain the pod to a particular topology domain, then the admission-time quota +charging should take that into account (e.g. not charge for the average/largest machine +if the PodSpec constrains the pod to a specific machine with a known size; instead charge +for the size of the actual machine that the pod was constrained to). In all cases +once the pod is scheduled, the quota charge should be adjusted down to the +actual amount of resources allocated (e.g. the size of the actual machine that was +assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for example to allow more than N pods across all users to request exclusive node access in a cluster with N nodes, then a priority/preemption scheme should be added so that the most important pods run when resource demand exceeds supply. -Our initial implementation will use quota that charges based on opportunity cost. +An alternative approach, which is a bit of a blunt hammer, is to use a +capability mechanism to restrict use of RequiredDuringScheduling anti-affinity +to trusted users. A more complex capability mechanism might only restrict it when +using a non-"node" TopologyKey. + +Our initial implementation will use a variant of the capability approach, which +requires no configuration: we will simply reject ALL requests, regardless of user, +that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity. +This allows the "exclusive node" use case while prohibiting the more dangerous ones. A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade the scheduling quality of another pod, but not completely block it from scheduling. @@ -505,8 +520,8 @@ affinity). 2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod). 3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account -4. Implement a quota mechanism that charges for the entire topology domain when `RequiredDuringScheduling` anti-affinity is used. Later -this should be refined to only apply when it is used to request exclusive access, not when it is used to express conflict with specific pods. +4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity. +This admission controller should be enabled by default. 5. Implement the recommended solution to the "co-existing with daemons" issue 6. At this point, the feature can be deployed. 7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure @@ -565,9 +580,9 @@ specified by adding an integer `Limit` to `PodAffinityTerm` just for the system and we do not intend to implement it. It is likely that the specification and implementation of pod anti-affinity -can be unified with [taints and tolerations](https://github.com/kubernetes/kubernetes/pull/18263), +can be unified with [taints and tolerations](taint-toleration-dedicated.md), and likewise that the specification and implementation of pod affinity -can be unified with [node affinity](https://github.com/kubernetes/kubernetes/pull/18261). +can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod labels would be "inherited" by the node, and pods would only be able to specify affinity and anti-affinity for a node's labels. Our main motivation for not unifying taints and tolerations with -- cgit v1.2.3 From c45cdde2d11b36d5ca63eeabc53cb3718ab3b11d Mon Sep 17 00:00:00 2001 From: Prayag Verma Date: Mon, 1 Feb 2016 01:39:12 +0530 Subject: Fix Typos Minor spelling mistakes - descibe > describe menioned > mentioned compatiblity > compatibility programatic > programmatic --- podaffinity.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/podaffinity.md b/podaffinity.md index 8c1372b9..4e30303b 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -68,7 +68,7 @@ There are other forms of topological affinity and anti-affinity. For example, you can use [node affinity](nodeaffinity.md) to require (prefer) that a set of pods all be scheduled in some specific zone Z. Node affinity is not capable of expressing inter-pod dependencies, and conversely the API -we descibe in this document is not capable of expressing node affinity rules. +we describe in this document is not capable of expressing node affinity rules. For simplicity, we will use the terms "affinity" and "anti-affinity" to mean "inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively, in the remainder of this document. @@ -492,7 +492,7 @@ The design we have described says that the symmetry property for RequiredDuringS is weaker: if a pod P says it can only schedule onto nodes running pod Q, this does not mean Q can only run on a node that is running P, but the scheduler will try to schedule Q onto a node that is running P (i.e. treats the reverse direction as -preferred). This raises the same scheduling quality concern as we menioned at the +preferred). This raises the same scheduling quality concern as we mentioned at the end of the Denial of Service section above, and can be addressed in similar ways. The nature of affinity (as opposed to anti-affinity) means that there is no issue of @@ -538,7 +538,7 @@ Do so in a way that addresses the "determining which pod(s) to kill" issue. We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling domains (e.g. node name, rack name, availability zone name, etc.). See #9044. -## Backward compatiblity +## Backward compatibility Old versions of the scheduler will ignore `Affinity`. @@ -546,7 +546,7 @@ Users should not start using `Affinity` until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will -use a programatic approach to enforcing this (#4855). +use a programmatic approach to enforcing this (#4855). ## Extensibility -- cgit v1.2.3 From e5f17e1f4ab0a23eb75df9d8ec9f1594646f4dd6 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Wed, 3 Feb 2016 01:30:10 -0500 Subject: Add boilerplate checks for Makefiles --- clustering/Makefile | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/clustering/Makefile b/clustering/Makefile index f6aa53ed..b1743cf4 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -1,3 +1,17 @@ +# Copyright 2016 The Kubernetes Authors All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + FONT := DroidSansMono.ttf PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag)) -- cgit v1.2.3 From 2ca0b80d70c336df29f591b2e07f3e010dd66942 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Wed, 3 Feb 2016 02:11:37 -0500 Subject: Add boilerplate checks for Dockerfiles --- clustering/Dockerfile | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/clustering/Dockerfile b/clustering/Dockerfile index 3353419d..60d258c4 100644 --- a/clustering/Dockerfile +++ b/clustering/Dockerfile @@ -1,3 +1,17 @@ +# Copyright 2016 The Kubernetes Authors All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + FROM debian:jessie RUN apt-get update -- cgit v1.2.3 From 2002ccb1f356ec36bb35172af4171758cafd7363 Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Sat, 5 Dec 2015 14:10:25 -0800 Subject: MetadataPolicy design doc. --- metadata-policy.md | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 160 insertions(+) create mode 100644 metadata-policy.md diff --git a/metadata-policy.md b/metadata-policy.md new file mode 100644 index 00000000..1d02fcf4 --- /dev/null +++ b/metadata-policy.md @@ -0,0 +1,160 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system + +## Introduction + +This document describes a new API resource, `MetadataPolicy`, that configures an +admission controller to take one or more actions based on an object's metadata. +Initially the metadata fields that the predicates can examine are labels and annotations, +and the actions are to add one or more labels and/or annotations, or to reject creation/update +of the object. In the future other actions might be supported, such as applying an initializer. + +The first use of `MetadataPolicy` will be to decide which scheduler should schedule a pod +in a [multi-scheduler](../proposals/multiple-schedulers.md) Kubernetes system. In particular, the +policy will add the scheduler name annotation to a pod based on an annotation that +is already on the pod that indicates the QoS of the pod. +(That annotation was presumably set by a simpler admission controller that +uses code, rather than configuration, to map the resource requests and limits of a pod +to QoS, and attaches the corresponding annotation.) + +We anticipate a number of other uses for `MetadataPolicy`, such as defaulting for +labels and annotations, prohibiting/requiring particular labels or annotations, or +choosing a scheduling policy within a scheduler. We do not discuss them in this doc. + + +## API + +```go +// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource. +// Every rule is applied, in an unspecified order, but if the action for any rule +// that matches is to reject the object, then the object is rejected without being mutated. +type MetadataPolicySpec struct { + Rules []MetadataPolicyRule `json:"rules,omitempty"` +} + +// If the PolicyPredicate is met, then the PolicyAction is applied. +// Example rules: +// reject object if label with key X is present (i.e. require X) +// reject object if label with key X is not present (i.e. forbid X) +// add label X=Y if label with key X is not present (i.e. default X) +// add annotation A=B if object has annotation C=D or E=F +type MetadataPolicyRule struct { + PolicyPredicate PolicyPredicate `json:"policyPredicate"` + PolicyAction PolicyAction `json:policyAction"` +} + +// All criteria must be met for the PolicyPredicate to be considered met. +type PolicyPredicate struct { + // Note that Namespace is not listed here because MetadataPolicy is per-Namespace. + LabelSelector *LabelSelector `json:"labelSelector,omitempty"` + AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"` +} + +// Apply the indicated Labels and/or Annotations (if present), unless Reject is set +// to true, in which case reject the object without mutating it. +type PolicyAction struct { + // If true, the object will be rejected and not mutated. + Reject bool `json:"reject"` + // The labels to add or update, if any. + UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"` + // The annotations to add or update, if any. + UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"` +} + +// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying +// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's +// are applied to all objects in the namespace; the order of evaluation is not guaranteed, +// but if any of the matching policies have an action of rejecting the object, then the object +// will be rejected without being mutated. +type MetadataPolicy struct { + unversioned.TypeMeta `json:",inline"` + // Standard object's metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the metadata policy that should be enforced. + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + Spec MetadataPolicySpec `json:"spec,omitempty"` +} + +// MetadataPolicyList is a list of MetadataPolicy items. +type MetadataPolicyList struct { + unversioned.TypeMeta `json:",inline"` + // Standard list metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + unversioned.ListMeta `json:"metadata,omitempty"` + + // Items is a list of MetadataPolicy objects. + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + Items []MetadataPolicy `json:"items"` +} +``` + +## Implementation plan + +1. Create `MetadataPolicy` API resource +1. Create admission controller that implements policies defined in `MetadataPolicy` +1. Create admission controller that sets annotation +`scheduler.alpha.kubernetes.io/qos: ` +(where `QOS` is one of `Guaranteed, Burstable, BestEffort`) +based on pod's resource request and limit. + +## Future work + +Longer-term we will have QoS be set on create and update by the registry, similar to `Pending` phase today, +instead of having an admission controller (that runs before the one that takes `MetadataPolicy` as input) +do it. + +We plan to eventually move from having an admission controller +set the scheduler name as a pod annotation, to using the initializer concept. In particular, the +scheduler will be an initializer, and the admission controller that decides which scheduler to use +will add the scheduler's name to the list of initializers for the pod (presumably the scheduler +will be the last initializer to run on each pod). +The admission controller would still be configured using the `MetadataPolicy` described here, only the +mechanism the admission controller uses to record its decision of which scheduler to use would change. + +## Related issues + +The main issue for multiple schedulers is #11793. There was also a lot of discussion +in PRs #17197 and #17865. + +We could use the approach described here to choose a scheduling +policy within a single scheduler, as opposed to choosing a scheduler, a desire mentioned in #9920. +Issue #17097 describes a scenario unrelated to scheduler-choosing where `MetadataPolicy` could be used. +Issue #17324 proposes to create a generalized API for matching +"claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() + -- cgit v1.2.3 From 6b860503708c4546c4b2b57a736d673176685b96 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Mon, 8 Feb 2016 13:07:44 -0800 Subject: Move selector-generation from proposal to design --- selector-generation.md | 133 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 selector-generation.md diff --git a/selector-generation.md b/selector-generation.md new file mode 100644 index 00000000..27107c31 --- /dev/null +++ b/selector-generation.md @@ -0,0 +1,133 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +Proposed Design +============= + +# Goals + +Make it really hard to accidentally create a job which has an overlapping selector, while still making it possible to chose an arbitrary selector, and without adding complex constraint solving to the APIserver. + +# Use Cases + +1. user can leave all label and selector fields blank and system will fill in reasonable ones: non-overlappingness guaranteed. +2. user can put on the pod template some labels that are useful to the user, without reasoning about non-overlappingness. System adds additional label to assure not overlapping. +3. If user wants to reparent pods to new job (very rare case) and knows what they are doing, they can completely disable this behavior and specify explicit selector. +4. If a controller that makes jobs, like scheduled job, wants to use different labels, such as the time and date of the run, it can do that. +5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and just changes the API group, the user should not automatically be allowed to specify a selector, since this is very rarely what people want to do and is error prone. +6. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, he should not create an overlapping job. +7. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, and he accidentally copies the uniquifying label from the old one, then he should not get an error from a label-key conflict, nor get erratic behavior. +8. If user reads swagger docs and sees the selector field, he should not be able to set it without realizing the risks. +8. (Deferred requirement:) If user wants to specify a preferred name for the non-overlappingness key, they can pick a name. + +# Proposed changes + +## APIserver + +`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as follows. + +There are two allowed usage modes: + +### Automatic Mode + +- User does not specify `job.spec.selector`. +- User is probably unaware of the `job.spec.noAutoSelector` field and does not think about it. +- User optionally puts labels on pod template (optional). user does not think about uniqueness, just labeling for user's own reasons. +- Defaulting logic sets `job.spec.selector` to `matchLabels["controller-uid"]="$UIDOFJOB"` +- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. + - The first label is controller-uid=$UIDOFJOB. + - The second label is "job-name=$NAMEOFJOB". + +### Manual Mode + +- User means User or Controller for the rest of this list. +- User does specify `job.spec.selector`. +- User does specify `job.spec.noAutoSelector=true` +- User puts a unique label or label(s) on pod template (required). user does think carefully about uniqueness. +- No defaulting of pod labels or the selector happen. + +### Common to both modes + +- Validation code ensures that the selector on the job selects the labels on the pod template, and rejects if not. + +### Rationale + +UID is better than Name in that: +- it allows cross-namespace control someday if we need it. +- it is unique across all kinds. `controller-name=foo` does not ensure uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the latter cannot use label `job-name=foo`, though there is a temptation to do so. +- it uniquely identifies the controller across time. This prevents the case where, for example, someone deletes a job via the REST api or client (where cascade=false), leaving pods around. We don't want those to be picked up unintentionally. It also prevents the case where a user looks at an old job that finished but is not deleted, and tries to select its pods, and gets the wrong impression that it is still running. + +Job name is more user friendly. It is self documenting + +Commands like `kubectl get pods -l job-name=myjob` should do exactly what is wanted 99.9% of the time. Automated control loops should still use the controller-uid=label. + +Using both gets the benefits of both, at the cost of some label verbosity. + +### Overriding Unique Labels + +If user does specify `job.spec.selector` then the user must also specify `job.spec.noAutoSelector`. +This ensures the user knows that what he is doing is not the normal thing to do. + +To prevent users from copying the `job.spec.noAutoSelector` flag from existing jobs, it will be +optional and default to false, which means when you ask GET and existing job back that didn't use this feature, you don't even see the `job.spec.noAutoSelector` flag, so you are not tempted to wonder if you should fiddle with it. + +## Job Controller + +No changes + +## Kubectl + +No required changes. +Suggest moving SELECTOR to wide output of `kubectl get jobs` since users don't write the selector. + +## Docs + +Remove examples that use selector and remove labels from pod templates. +Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. + +# Cross Version Compat + +`v1beta1` will not have a `job.spec.noAutoSelector` and will not provide a default selector. + +Conversion from v1beta1 to v1 will use the user-provided selector and set `job.spec.noAutoSelector=true`. + +# Future Work + +Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if it works well for job. + +Docs will be edited to show examples without a `job.spec.selector`. + +We probably want as much as possible the same behavior for Job and ReplicationController. + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selector-generation.md?pixel)]() + -- cgit v1.2.3 From 8fb57ee0a7c18ccc6cd865a3b0eebf7ba3fc9614 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Mon, 8 Feb 2016 13:11:43 -0800 Subject: Update paths after move. Also improve doc slightly. --- selector-generation.md | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/selector-generation.md b/selector-generation.md index 27107c31..032129b6 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -48,11 +48,17 @@ Make it really hard to accidentally create a job which has an overlapping select # Proposed changes -## APIserver +## API `extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as follows. -There are two allowed usage modes: +Field `job.spec.noAutoSelector` is added. It controls whether selectors are automatically +generated. In automatic mode, user cannot make the mistake of creating non-unique selectors. +In manual mode, certain rare use cases are supported. + +Validation is not changed. A selector must be provided, and it must select the pod template. + +Defaulting changes. Defaulting happens in one of two modes: ### Automatic Mode @@ -72,10 +78,6 @@ There are two allowed usage modes: - User puts a unique label or label(s) on pod template (required). user does think carefully about uniqueness. - No defaulting of pod labels or the selector happen. -### Common to both modes - -- Validation code ensures that the selector on the job selects the labels on the pod template, and rejects if not. - ### Rationale UID is better than Name in that: @@ -89,6 +91,10 @@ Commands like `kubectl get pods -l job-name=myjob` should do exactly what is wa Using both gets the benefits of both, at the cost of some label verbosity. +The field is a `*bool`. Since false is expected to be much more common, +and since the feature is complex, it is better to leave it unspecified so that +users looking at a stored pod spec do not need to be aware of this field. + ### Overriding Unique Labels If user does specify `job.spec.selector` then the user must also specify `job.spec.noAutoSelector`. @@ -128,6 +134,8 @@ We probably want as much as possible the same behavior for Job and ReplicationCo + + -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selector-generation.md?pixel)]() +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() -- cgit v1.2.3 From 48e81b76a3d9042b11b58a022c80787e53c30e1c Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Fri, 12 Feb 2016 14:06:32 -0800 Subject: Renamed noAutoSelector to manualSelector Avoids double negative. --- selector-generation.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/selector-generation.md b/selector-generation.md index 032129b6..545de9cf 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -52,7 +52,7 @@ Make it really hard to accidentally create a job which has an overlapping select `extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as follows. -Field `job.spec.noAutoSelector` is added. It controls whether selectors are automatically +Field `job.spec.manualSelector` is added. It controls whether selectors are automatically generated. In automatic mode, user cannot make the mistake of creating non-unique selectors. In manual mode, certain rare use cases are supported. @@ -63,7 +63,7 @@ Defaulting changes. Defaulting happens in one of two modes: ### Automatic Mode - User does not specify `job.spec.selector`. -- User is probably unaware of the `job.spec.noAutoSelector` field and does not think about it. +- User is probably unaware of the `job.spec.manualSelector` field and does not think about it. - User optionally puts labels on pod template (optional). user does not think about uniqueness, just labeling for user's own reasons. - Defaulting logic sets `job.spec.selector` to `matchLabels["controller-uid"]="$UIDOFJOB"` - Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. @@ -74,7 +74,7 @@ Defaulting changes. Defaulting happens in one of two modes: - User means User or Controller for the rest of this list. - User does specify `job.spec.selector`. -- User does specify `job.spec.noAutoSelector=true` +- User does specify `job.spec.manualSelector=true` - User puts a unique label or label(s) on pod template (required). user does think carefully about uniqueness. - No defaulting of pod labels or the selector happen. @@ -97,11 +97,11 @@ users looking at a stored pod spec do not need to be aware of this field. ### Overriding Unique Labels -If user does specify `job.spec.selector` then the user must also specify `job.spec.noAutoSelector`. +If user does specify `job.spec.selector` then the user must also specify `job.spec.manualSelector`. This ensures the user knows that what he is doing is not the normal thing to do. -To prevent users from copying the `job.spec.noAutoSelector` flag from existing jobs, it will be -optional and default to false, which means when you ask GET and existing job back that didn't use this feature, you don't even see the `job.spec.noAutoSelector` flag, so you are not tempted to wonder if you should fiddle with it. +To prevent users from copying the `job.spec.manualSelector` flag from existing jobs, it will be +optional and default to false, which means when you ask GET and existing job back that didn't use this feature, you don't even see the `job.spec.manualSelector` flag, so you are not tempted to wonder if you should fiddle with it. ## Job Controller @@ -119,9 +119,9 @@ Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. # Cross Version Compat -`v1beta1` will not have a `job.spec.noAutoSelector` and will not provide a default selector. +`v1beta1` will not have a `job.spec.manualSelector` and will not provide a default selector. -Conversion from v1beta1 to v1 will use the user-provided selector and set `job.spec.noAutoSelector=true`. +Conversion from v1beta1 to v1 will use the user-provided selector and set `job.spec.manualSelector=true`. # Future Work -- cgit v1.2.3 From e53f03ae47499aaacd684ab074200f04e4ac1ec5 Mon Sep 17 00:00:00 2001 From: laushinka Date: Sat, 13 Feb 2016 02:33:32 +0700 Subject: Spelling fixes inspired by github.com/client9/misspell --- clustering/dynamic.seqdiag | 2 +- enhance-pluggable-policy.md | 2 +- indexed-job.md | 16 ++++++++-------- taint-toleration-dedicated.md | 4 ++-- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/clustering/dynamic.seqdiag b/clustering/dynamic.seqdiag index 95bb395e..567d5bf9 100644 --- a/clustering/dynamic.seqdiag +++ b/clustering/dynamic.seqdiag @@ -15,7 +15,7 @@ seqdiag { user ->> kubelet [label="start\n- bootstrap-cluster-uri"]; kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"]; - kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="retuns\n- kubelet-cert"]; + kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"]; user => master [label="getSignRequests"]; user => master [label="approveSignRequests"]; kubelet <<-- master [label="returns\n- kubelet-cert"]; diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 1ee9bf29..9cdd9a2d 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -54,7 +54,7 @@ An API request has the following attributes that can be considered for authoriza - resourceVersion - the API version of the resource being accessed - resource - which resource is being accessed - applies only to the API endpoints, such as - `/api/v1beta1/pods`. For miscelaneous endpoints, like `/version`, the kind is the empty string. + `/api/v1beta1/pods`. For miscellaneous endpoints, like `/version`, the kind is the empty string. - resourceName - the name of the resource during a get, update, or delete action. - subresource - which subresource is being accessed diff --git a/indexed-job.md b/indexed-job.md index 7b72cc0f..b928f722 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -33,18 +33,18 @@ Documentation for other releases can be found at ## Summary This design extends kubernetes with user-friendly support for -running embarassingly parallel jobs. +running embarrassingly parallel jobs. Here, *parallel* means on multiple nodes, which means multiple pods. -By *embarassingly parallel*, it is meant that the pods +By *embarrassingly parallel*, it is meant that the pods have no dependencies between each other. In particular, neither ordering between pods nor gang scheduling are supported. -Users already have two other options for running embarassingly parallel +Users already have two other options for running embarrassingly parallel Jobs (described in the next section), but both have ease-of-use issues. Therefore, this document proposes extending the Job resource type to support -a third way to run embarassingly parallel programs, with a focus on +a third way to run embarrassingly parallel programs, with a focus on ease of use. This new style of Job is called an *indexed job*, because each Pod of the Job @@ -53,7 +53,7 @@ is specialized to work on a particular *index* from a fixed length array of work ## Background The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports -the embarassingly parallel use case through *workqueue jobs*. +the embarrassingly parallel use case through *workqueue jobs*. While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very flexible, they can be difficult to use. They: (1) typically require running a message queue @@ -242,7 +242,7 @@ In the above example: - `--restart=OnFailure` implies creating a job instead of replicationController. - Each pods command line is `/usr/local/bin/process_file $F`. - `--per-completion-env=` implies the jobs `.spec.completions` is set to the length of the argument array (3 in the example). -- `--per-completion-env=F=` causes env var with `F` to be available in the enviroment when the command line is evaluated. +- `--per-completion-env=F=` causes env var with `F` to be available in the environment when the command line is evaluated. How exactly this happens is discussed later in the doc: this is a sketch of the user experience. @@ -269,7 +269,7 @@ Another case we do not try to handle is where the input file does not exist yet #### Multiple parameters -The user may also have multiple paramters, like in [work list 2](#work-list-2). +The user may also have multiple parameters, like in [work list 2](#work-list-2). One way is to just list all the command lines already expanded, one per line, in a file, like this: ``` @@ -491,7 +491,7 @@ The index-only approach: - requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage. - makes no changes to the JobSpec. - Drawback: while in separate storage, they could be mutatated, which would have unexpected effects -- Drawback: Logic for using index to lookup paramters needs to be in the Pod. +- Drawback: Logic for using index to lookup parameters needs to be in the Pod. - Drawback: CLIs and UIs are limited to using the "index" as the identity of a pod from a job. They cannot easily say, for example `repeated failures on the pod processing banana.txt`. diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index cca2ee44..7eb37da9 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -51,7 +51,7 @@ on a node but do not prevent it, taints that prevent a pod from starting on Kube if the pod's `NodeName` was written directly (i.e. pod did not go through the scheduler), and taints that evict already-running pods. [This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) -has more background on these diffrent scenarios. We will focus on the first +has more background on these different scenarios. We will focus on the first kind of taint in this doc, since it is the kind required for the "dedicated nodes" use case. Implementing dedicated nodes using taints and tolerations is straightforward: in essence, a node that @@ -264,7 +264,7 @@ their taint. Thus we need to ensure that a new node does not become "Ready" unti configured with its taints. One way to do this is to have an admission controller that adds the taint whenever a Node object is created. -A quota policy may want to treat nodes diffrently based on what taints, if any, +A quota policy may want to treat nodes differently based on what taints, if any, they have. For example, if a particular namespace is only allowed to access dedicated nodes, then it may be convenient to give the namespace unlimited quota. (To use finite quota, you'd have to size the namespace's quota to the sum of the sizes of the machines in the -- cgit v1.2.3 From d4a1d08cfc63e9de357ae503a4676a6436c9f354 Mon Sep 17 00:00:00 2001 From: Eric Tune Date: Mon, 22 Feb 2016 09:59:53 -0800 Subject: Explain conversion for manualSelector --- selector-generation.md | 31 ++++++++++++++++++++++++++----- 1 file changed, 26 insertions(+), 5 deletions(-) diff --git a/selector-generation.md b/selector-generation.md index 545de9cf..3ada304c 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -27,7 +27,7 @@ Documentation for other releases can be found at -Proposed Design +Design ============= # Goals @@ -110,18 +110,39 @@ No changes ## Kubectl No required changes. -Suggest moving SELECTOR to wide output of `kubectl get jobs` since users don't write the selector. +Suggest moving SELECTOR to wide output of `kubectl get jobs` since users do not write the selector. ## Docs Remove examples that use selector and remove labels from pod templates. Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. -# Cross Version Compat +# Conversion -`v1beta1` will not have a `job.spec.manualSelector` and will not provide a default selector. +The following applies to Job, as well as to other types that adopt this pattern. -Conversion from v1beta1 to v1 will use the user-provided selector and set `job.spec.manualSelector=true`. +- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`. +- Both the internal type and the `batch/v1` type will get `job.spec.manualSelector`. +- The fields `manualSelector` and `autoSelector` have opposite meanings. +- Each field defaults to false when unset, and so v1beta1 has a different default than v1 and internal. This is intentional: we want new + uses to default to the less error-prone behavior, and we do not want to change the behavior + of v1beta1. + +*Note*: since the internal default is changing, client +library consumers that create Jobs may need to add "job.spec.manualSelector=true" to keep working, or switch +to auto selectors. + +Conversion is as follows: +- `extensions/__internal` to `extensions/v1beta1`: the value of `__internal.Spec.ManualSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. +- `extensions/v1beta1` to `extensions/__internal`: the value of `v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written to `__internal.Spec.ManualSelector`. + +This conversion gives the following properties. + +1. Users that previously used v1beta1 do not start seeing a new field when they get back objects. +2. Distinction between originally unset versus explicitly set to false is not preserved (would have been nice to do so, but requires more complicated + solution). +3. Users who only created v1beta1 examples or v1 examples, will not ever see the existence of either field. +4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) does not need to change, allowing scriptable rollforward/rollback. # Future Work -- cgit v1.2.3 From 87e88066dfe295c78bdec1706a1a64bb742c8239 Mon Sep 17 00:00:00 2001 From: Chao Xu Date: Wed, 24 Feb 2016 10:41:16 -0800 Subject: fix links --- event_compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/event_compression.md b/event_compression.md index c8030559..3a05db21 100644 --- a/event_compression.md +++ b/event_compression.md @@ -100,7 +100,7 @@ Each binary that generates events: * `event.Reason` * `event.Message` * The LRU cache is capped at 4096 events for both `EventAggregator` and `EventLogger`. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/unversioned/record/event.go)). + * When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). -- cgit v1.2.3 From b231e7c92753e3be3b05e16de430e19afe659c4c Mon Sep 17 00:00:00 2001 From: Gao Zheng Date: Wed, 2 Mar 2016 08:27:00 +0800 Subject: clerical error of nodeaffinity.md clerical error of nodeaffinity.md --- nodeaffinity.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nodeaffinity.md b/nodeaffinity.md index 5deda322..a8ee2a18 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -37,7 +37,7 @@ intended to be used only for selecting nodes. In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler currently uses as part of restricting the set of nodes onto which a pod is -eligible to schedule, with a field of type `Affinity` that contains contains one or +eligible to schedule, with a field of type `Affinity` that contains one or more affinity specifications. In this document we discuss `NodeAffinity`, which contains one or more of the following * a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be -- cgit v1.2.3 From bbbede5eb69076167d635489d2cf48657be00dc1 Mon Sep 17 00:00:00 2001 From: Quinton Hoole Date: Tue, 5 Jan 2016 17:53:54 -0800 Subject: RFC design docs for Cluster Federation/Ubernetes. --- control-plane-resilience.md | 269 ++++++++++++++++++++++ federated-services.md | 550 ++++++++++++++++++++++++++++++++++++++++++++ federation-phase-1.md | 434 ++++++++++++++++++++++++++++++++++ ubernetes-cluster-state.png | Bin 0 -> 13824 bytes ubernetes-design.png | Bin 0 -> 20358 bytes ubernetes-scheduling.png | Bin 0 -> 39094 bytes 6 files changed, 1253 insertions(+) create mode 100644 control-plane-resilience.md create mode 100644 federated-services.md create mode 100644 federation-phase-1.md create mode 100644 ubernetes-cluster-state.png create mode 100644 ubernetes-design.png create mode 100644 ubernetes-scheduling.png diff --git a/control-plane-resilience.md b/control-plane-resilience.md new file mode 100644 index 00000000..8becccec --- /dev/null +++ b/control-plane-resilience.md @@ -0,0 +1,269 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Kubernetes/Ubernetes Control Plane Resilience + +## Long Term Design and Current Status + +### by Quinton Hoole, Mike Danese and Justin Santa-Barbara + +### December 14, 2015 + +## Summary + +Some amount of confusion exists around how we currently, and in future +want to ensure resilience of the Kubernetes (and by implication +Ubernetes) control plane. This document is an attempt to capture that +definitively. It covers areas including self-healing, high +availability, bootstrapping and recovery. Most of the information in +this document already exists in the form of github comments, +PR's/proposals, scattered documents, and corridor conversations, so +document is primarily a consolidation and clarification of existing +ideas. + +## Terms + +* **Self-healing:** automatically restarting or replacing failed + processes and machines without human intervention +* **High availability:** continuing to be available and work correctly + even if some components are down or uncontactable. This typically + involves multiple replicas of critical services, and a reliable way + to find available replicas. Note that it's possible (but not + desirable) to have high + availability properties (e.g. multiple replicas) in the absence of + self-healing properties (e.g. if a replica fails, nothing replaces + it). Fairly obviously, given enough time, such systems typically + become unavailable (after enough replicas have failed). +* **Bootstrapping**: creating an empty cluster from nothing +* **Recovery**: recreating a non-empty cluster after perhaps + catastrophic failure/unavailability/data corruption + +## Overall Goals + +1. **Resilience to single failures:** Kubernetes clusters constrained + to single availability zones should be resilient to individual + machine and process failures by being both self-healing and highly + available (within the context of such individual failures). +1. **Ubiquitous resilience by default:** The default cluster creation + scripts for (at least) GCE, AWS and basic bare metal should adhere + to the above (self-healing and high availability) by default (with + options available to disable these features to reduce control plane + resource requirements if so required). It is hoped that other + cloud providers will also follow the above guidelines, but the + above 3 are the primary canonical use cases. +1. **Resilience to some correlated failures:** Kubernetes clusters + which span multiple availability zones in a region should by + default be resilient to complete failure of one entire availability + zone (by similarly providing self-healing and high availability in + the default cluster creation scripts as above). +1. **Default implementation shared across cloud providers:** The + differences between the default implementations of the above for + GCE, AWS and basic bare metal should be minimized. This implies + using shared libraries across these providers in the default + scripts in preference to highly customized implementations per + cloud provider. This is not to say that highly differentiated, + customized per-cloud cluster creation processes (e.g. for GKE on + GCE, or some hosted Kubernetes provider on AWS) are discouraged. + But those fall squarely outside the basic cross-platform OSS + Kubernetes distro. +1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms + for achieving system resilience (replication controllers, health + checking, service load balancing etc) should be used in preference + to building a separate set of mechanisms to achieve the same thing. + This implies that self hosting (the kubernetes control plane on + kubernetes) is strongly preferred, with the caveat below. +1. **Recovery from catastrophic failure:** The ability to quickly and + reliably recover a cluster from catastrophic failure is critical, + and should not be compromised by the above goal to self-host + (i.e. it goes without saying that the cluster should be quickly and + reliably recoverable, even if the cluster control plane is + broken). This implies that such catastrophic failure scenarios + should be carefully thought out, and the subject of regular + continuous integration testing, and disaster recovery exercises. + +## Relative Priorities + +1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all + applications running inside it, disappear forever perhaps is the worst + possible failure mode. So it is critical that we be able to + recover the applications running inside a cluster from such + failures in some well-bounded time period. + 1. In theory a cluster can be recovered by replaying all API calls + that have ever been executed against it, in order, but most + often that state has been lost, and/or is scattered across + multiple client applications or groups. So in general it is + probably infeasible. + 1. In theory a cluster can also be recovered to some relatively + recent non-corrupt backup/snapshot of the disk(s) backing the + etcd cluster state. But we have no default consistent + backup/snapshot, verification or restoration process. And we + don't routinely test restoration, so even if we did routinely + perform and verify backups, we have no hard evidence that we + can in practise effectively recover from catastrophic cluster + failure or data corruption by restoring from these backups. So + there's more work to be done here. +1. **Self-healing:** Most major cloud providers provide the ability to + easily and automatically replace failed virtual machines within a + small number of minutes (e.g. GCE + [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) + and Managed Instance Groups, + AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) + and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This + can fairly trivially be used to reduce control-plane down-time due + to machine failure to a small number of minutes per failure + (i.e. typically around "3 nines" availability), provided that: + 1. cluster persistent state (i.e. etcd disks) is either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using etcd [dynamic member + addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) + or [backup and + recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). + + 1. and boot disks are either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using boot-from-snapshot, + boot-from-pre-configured-image or + boot-from-auto-initializing image). +1. **High Availability:** This has the potential to increase + availability above the approximately "3 nines" level provided by + automated self-healing, but it's somewhat more complex, and + requires additional resources (e.g. redundant API servers and etcd + quorum members). In environments where cloud-assisted automatic + self-healing might be infeasible (e.g. on-premise bare-metal + deployments), it also gives cluster administrators more time to + respond (e.g. replace/repair failed machines) without incurring + system downtime. + +## Design and Status (as of December 2015) + + + + + + + + + + + + + + + + + + + + + + +
Control Plane ComponentResilience PlanCurrent Status
API Server + +Multiple stateless, self-hosted, self-healing API servers behind a HA +load balancer, built out by the default "kube-up" automation on GCE, +AWS and basic bare metal (BBM). Note that the single-host approach of +hving etcd listen only on localhost to ensure that onyl API server can +connect to it will no longer work, so alternative security will be +needed in the regard (either using firewall rules, SSL certs, or +something else). All necessary flags are currently supported to enable +SSL between API server and etcd (OpenShift runs like this out of the +box), but this needs to be woven into the "kube-up" and related +scripts. Detailed design of self-hosting and related bootstrapping +and catastrophic failure recovery will be detailed in a separate +design doc. + + + +No scripted self-healing or HA on GCE, AWS or basic bare metal +currently exists in the OSS distro. To be clear, "no self healing" +means that even if multiple e.g. API servers are provisioned for HA +purposes, if they fail, nothing replaces them, so eventually the +system will fail. Self-healing and HA can be set up +manually by following documented instructions, but this is not +currently an automated process, and it is not tested as part of +continuous integration. So it's probably safest to assume that it +doesn't actually work in practise. + +
Controller manager and scheduler + +Multiple self-hosted, self healing warm standby stateless controller +managers and schedulers with leader election and automatic failover of API server +clients, automatically installed by default "kube-up" automation. + +As above.
etcd + +Multiple (3-5) etcd quorum members behind a load balancer with session +affinity (to prevent clients from being bounced from one to another). + +Regarding self-healing, if a node running etcd goes down, it is always necessary to do three +things: +
    +
  1. allocate a new node (not necessary if running etcd as a pod, in +which case specific measures are required to prevent user pods from +interfering with system pods, for example using node selectors as +described in dynamic member + addition. +In the case of remote persistent disk, the etcd state can be recovered +by attaching the remote persistent disk to the replacement node, thus +the state is recoverable even if all other replicas are down. + +There are also significant performance differences between local disks and remote +persistent disks. For example, the sustained throughput +local disks in GCE is approximatley 20x that of remote disks. + +Hence we suggest that self-healing be provided by remotely mounted persistent disks in +non-performance critical, single-zone cloud deployments. For +performance critical installations, faster local SSD's should be used, +in which case remounting on node failure is not an option, so +etcd runtime configuration +should be used to replace the failed machine. Similarly, for +cross-zone self-healing, cloud persistent disks are zonal, so +automatic +runtime configuration +is required. Similarly, basic bare metal deployments cannot generally +rely on +remote persistent disks, so the same approach applies there. +
+ +Somewhat vague instructions exist +on how to set some of this up manually in a self-hosted +configuration. But automatic bootstrapping and self-healing is not +described (and is not implemented for the non-PD cases). This all +still needs to be automated and continuously tested. +
+ + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() + diff --git a/federated-services.md b/federated-services.md new file mode 100644 index 00000000..6febfb21 --- /dev/null +++ b/federated-services.md @@ -0,0 +1,550 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Kubernetes Cluster Federation (a.k.a. "Ubernetes") + +## Cross-cluster Load Balancing and Service Discovery + +### Requirements and System Design + +### by Quinton Hoole, Dec 3 2015 + +## Requirements + +### Discovery, Load-balancing and Failover + +1. **Internal discovery and connection**: Pods/containers (running in + a Kubernetes cluster) must be able to easily discover and connect + to endpoints for Kubernetes services on which they depend in a + consistent way, irrespective of whether those services exist in a + different kubernetes cluster within the same cluster federation. + Hence-forth referred to as "cluster-internal clients", or simply + "internal clients". +1. **External discovery and connection**: External clients (running + outside a Kubernetes cluster) must be able to discover and connect + to endpoints for Kubernetes services on which they depend. + 1. **External clients predominantly speak HTTP(S)**: External + clients are most often, but not always, web browsers, or at + least speak HTTP(S) - notable exceptions include Enterprise + Message Busses (Java, TLS), DNS servers (UDP), + SIP servers and databases) +1. **Find the "best" endpoint:** Upon initial discovery and + connection, both internal and external clients should ideally find + "the best" endpoint if multiple eligible endpoints exist. "Best" + in this context implies the closest (by network topology) endpoint + that is both operational (as defined by some positive health check) + and not overloaded (by some published load metric). For example: + 1. An internal client should find an endpoint which is local to its + own cluster if one exists, in preference to one in a remote + cluster (if both are operational and non-overloaded). + Similarly, one in a nearby cluster (e.g. in the same zone or + region) is preferable to one further afield. + 1. An external client (e.g. in New York City) should find an + endpoint in a nearby cluster (e.g. U.S. East Coast) in + preference to one further away (e.g. Japan). +1. **Easy fail-over:** If the endpoint to which a client is connected + becomes unavailable (no network response/disconnected) or + overloaded, the client should reconnect to a better endpoint, + somehow. + 1. In the case where there exist one or more connection-terminating + load balancers between the client and the serving Pod, failover + might be completely automatic (i.e. the client's end of the + connection remains intact, and the client is completely + oblivious of the fail-over). This approach incurs network speed + and cost penalties (by traversing possibly multiple load + balancers), but requires zero smarts in clients, DNS libraries, + recursing DNS servers etc, as the IP address of the endpoint + remains constant over time. + 1. In a scenario where clients need to choose between multiple load + balancer endpoints (e.g. one per cluster), multiple DNS A + records associated with a single DNS name enable even relatively + dumb clients to try the next IP address in the list of returned + A records (without even necessarily re-issuing a DNS resolution + request). For example, all major web browsers will try all A + records in sequence until a working one is found (TBD: justify + this claim with details for Chrome, IE, Safari, Firefox). + 1. In a slightly more sophisticated scenario, upon disconnection, a + smarter client might re-issue a DNS resolution query, and + (modulo DNS record TTL's which can typically be set as low as 3 + minutes, and buggy DNS resolvers, caches and libraries which + have been known to completely ignore TTL's), receive updated A + records specifying a new set of IP addresses to which to + connect. + +### Portability + +A Kubernetes application configuration (e.g. for a Pod, Replication +Controller, Service etc) should be able to be successfully deployed +into any Kubernetes Cluster or Ubernetes Federation of Clusters, +without modification. More specifically, a typical configuration +should work correctly (although possibly not optimally) across any of +the following environments: + +1. A single Kubernetes Cluster on one cloud provider (e.g. Google + Compute Engine, GCE) +1. A single Kubernetes Cluster on a different cloud provider + (e.g. Amazon Web Services, AWS) +1. A single Kubernetes Cluster on a non-cloud, on-premise data center +1. A Federation of Kubernetes Clusters all on the same cloud provider + (e.g. GCE) +1. A Federation of Kubernetes Clusters across multiple different cloud + providers and/or on-premise data centers (e.g. one cluster on + GCE/GKE, one on AWS, and one on-premise). + +### Trading Portability for Optimization + +It should be possible to explicitly opt out of portability across some +subset of the above environments in order to take advantage of +non-portable load balancing and DNS features of one or more +environments. More specifically, for example: + +1. For HTTP(S) applications running on GCE-only Federations, + [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + should be usable. These provide single, static global IP addresses + which load balance and fail over globally (i.e. across both regions + and zones). These allow for really dumb clients, but they only + work on GCE, and only for HTTP(S) traffic. +1. For non-HTTP(S) applications running on GCE-only Federations within + a single region, + [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + should be usable. These provide TCP (i.e. both HTTP/S and + non-HTTP/S) load balancing and failover, but only on GCE, and only + within a single region. + [Google Cloud DNS](https://cloud.google.com/dns) can be used to + route traffic between regions (and between different cloud + providers and on-premise clusters, as it's plain DNS, IP only). +1. For applications running on AWS-only Federations, + [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + should be usable. These provide both L7 (HTTP(S)) and L4 load + balancing, but only within a single region, and only on AWS + ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be + used to load balance and fail over across multiple regions, and is + also capable of resolving to non-AWS endpoints). + +## Component Cloud Services + +Ubernetes cross-cluster load balancing is built on top of the following: + +1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + provide single, static global IP addresses which load balance and + fail over globally (i.e. across both regions and zones). These + allow for really dumb clients, but they only work on GCE, and only + for HTTP(S) traffic. +1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + provide both HTTP(S) and non-HTTP(S) load balancing and failover, + but only on GCE, and only within a single region. +1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + provide both L7 (HTTP(S)) and L4 load balancing, but only within a + single region, and only on AWS. +1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other + programmable DNS service, like + [CloudFlare](http://www.cloudflare.com) can be used to route + traffic between regions (and between different cloud providers and + on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS + doesn't provide any built-in geo-DNS, latency-based routing, health + checking, weighted round robin or other advanced capabilities. + It's plain old DNS. We would need to build all the aforementioned + on top of it. It can provide internal DNS services (i.e. serve RFC + 1918 addresses). + 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can + be used to load balance and fail over across regions, and is also + capable of routing to non-AWS endpoints). It provides built-in + geo-DNS, latency-based routing, health checking, weighted + round robin and optional tight integration with some other + AWS services (e.g. Elastic Load Balancers). +1. Kubernetes L4 Service Load Balancing: This provides both a + [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) + and a + [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) + service IP which is load-balanced (currently simple round-robin) + across the healthy pods comprising a service within a single + Kubernetes cluster. +1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy. + +## Ubernetes API + +The Ubernetes API for load balancing should be compatible with the +equivalent Kubernetes API, to ease porting of clients between +Ubernetes and Kubernetes. Further details below. + +## Common Client Behavior + +To be useful, our load balancing solution needs to work properly with +real client applications. There are a few different classes of +those... + +### Browsers + +These are the most common external clients. These are all well-written. See below. + +### Well-written clients + +1. Do a DNS resolution every time they connect. +1. Don't cache beyond TTL (although a small percentage of the DNS + servers on which they rely might). +1. Do try multiple A records (in order) to connect. +1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. + +Examples: + ++ all common browsers (except for SRV records) ++ ... + +### Dumb clients + +1. Don't do a DNS resolution every time they connect (or do cache + beyond the TTL). +1. Do try multiple A records + +Examples: + ++ ... + +### Dumber clients + +1. Only do a DNS lookup once on startup. +1. Only try the first returned DNS A record. + +Examples: + ++ ... + +### Dumbest clients + +1. Never do a DNS lookup - are pre-configured with a single (or + possibly multiple) fixed server IP(s). Nothing else matters. + +## Architecture and Implementation + +### General control plane architecture + +Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and +etcd quorum members. This is documented in more detail in a +[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). + +In the description below, assume that 'n' clusters, named +'cluster-1'... 'cluster-n' have been registered against an Ubernetes +Federation "federation-1", each with their own set of Kubernetes API +endpoints,so, +"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), +[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) +... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . + +### Federated Services + +Ubernetes Services are pretty straight-forward. They're comprised of +multiple equivalent underlying Kubernetes Services, each with their +own external endpoint, and a load balancing mechanism across them. +Let's work through how exactly that works in practice. + +Our user creates the following Ubernetes Service (against an Ubernetes +API endpoint): + + $ kubectl create -f my-service.yaml --context="federation-1" + +where service.yaml contains the following: + + kind: Service + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + ports: + - port: 2379 + protocol: TCP + targetPort: 2379 + name: client + - port: 2380 + protocol: TCP + targetPort: 2380 + name: peer + selector: + run: my-service + type: LoadBalancer + +Ubernetes in turn creates one equivalent service (identical config to +the above) in each of the underlying Kubernetes clusters, each of +which results in something like this: + + $ kubectl get -o yaml --context="cluster-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:25Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "147365" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00002 + spec: + clusterIP: 10.0.153.185 + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - ip: 104.197.117.10 + +Similar services are created in `cluster-2` and `cluster-3`, each of +which are allocated their own `spec.clusterIP`, and +`status.loadBalancer.ingress.ip`. + +In Ubernetes `federation-1`, the resulting federated service looks as follows: + + $ kubectl get -o yaml --context="federation-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:23Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "157333" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00007 + spec: + clusterIP: + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - hostname: my-service.my-namespace.my-federation.my-domain.com + +Note that the federated service: + +1. Is API-compatible with a vanilla Kubernetes service. +1. has no clusterIP (as it is cluster-independent) +1. has a federation-wide load balancer hostname + +In addition to the set of underlying Kubernetes services (one per +cluster) described above, Ubernetes has also created a DNS name +(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or +[AWS Route 53](https://aws.amazon.com/route53/), depending on +configuration) which provides load balancing across all of those +services. For example, in a very basic configuration: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +Each of the above IP addresses (which are just the external load +balancer ingress IP's of each cluster service) is of course load +balanced across the pods comprising the service in each cluster. + +In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes +automatically creates a +[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) +which exposes a single, globally load-balanced IP: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 + +Optionally, Ubernetes also configures the local DNS servers (SkyDNS) +in each Kubernetes cluster to preferentially return the local +clusterIP for the service in that cluster, with other clusters' +external service IP's (or a global load-balanced IP) also configured +for failover purposes: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +If Ubernetes Global Service Health Checking is enabled, multiple +service health checkers running across the federated clusters +collaborate to monitor the health of the service endpoints, and +automatically remove unhealthy endpoints from the DNS record (e.g. a +majority quorum is required to vote a service endpoint unhealthy, to +avoid false positives due to individual health checker network +isolation). + +### Federated Replication Controllers + +So far we have a federated service defined, with a resolvable load +balancer hostname by which clients can reach it, but no pods serving +traffic directed there. So now we need a Federated Replication +Controller. These are also fairly straight-forward, being comprised +of multiple underlying Kubernetes Replication Controllers which do the +hard work of keeping the desired number of Pod replicas alive in each +Kubernetes cluster. + + $ kubectl create -f my-service-rc.yaml --context="federation-1" + +where `my-service-rc.yaml` contains the following: + + kind: ReplicationController + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + replicas: 6 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + +Ubernetes in turn creates one equivalent replication controller +(identical config to the above, except for the replica count) in each +of the underlying Kubernetes clusters, each of which results in +something like this: + + $ ./kubectl get -o yaml rc my-service --context="cluster-1" + kind: ReplicationController + metadata: + creationTimestamp: 2015-12-02T23:00:47Z + labels: + run: my-service + name: my-service + namespace: my-namespace + selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service + uid: 86542109-9948-11e5-a38c-42010af00002 + spec: + replicas: 2 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + resources: {} + dnsPolicy: ClusterFirst + restartPolicy: Always + status: + replicas: 2 + +The exact number of replicas created in each underlying cluster will +of course depend on what scheduling policy is in force. In the above +example, the scheduler created an equal number of replicas (2) in each +of the three underlying clusters, to make up the total of 6 replicas +required. To handle entire cluster failures, various approaches are possible, +including: +1. **simple overprovisioing**, such that sufficient replicas remain even if a + cluster fails. This wastes some resources, but is simple and + reliable. +2. **pod autoscaling**, where the replication controller in each + cluster automatically and autonomously increases the number of + replicas in its cluster in response to the additional traffic + diverted from the + failed cluster. This saves resources and is reatively simple, + but there is some delay in the autoscaling. +3. **federated replica migration**, where the Ubernetes Federation + Control Plane detects the cluster failure and automatically + increases the replica count in the remainaing clusters to make up + for the lost replicas in the failed cluster. This does not seem to + offer any benefits relative to pod autoscaling above, and is + arguably more complex to implement, but we note it here as a + possibility. + +### Implementation Details + +The implementation approach and architecture is very similar to +Kubernetes, so if you're familiar with how Kubernetes works, none of +what follows will be surprising. One additional design driver not +present in Kubernetes is that Ubernetes aims to be resilient to +individual cluster and availability zone failures. So the control +plane spans multiple clusters. More specifically: + ++ Ubernetes runs it's own distinct set of API servers (typically one + or more per underlying Kubernetes cluster). These are completely + distinct from the Kubernetes API servers for each of the underlying + clusters. ++ Ubernetes runs it's own distinct quorum-based metadata store (etcd, + by default). Approximately 1 quorum member runs in each underlying + cluster ("approximately" because we aim for an odd number of quorum + members, and typically don't want more than 5 quorum members, even + if we have a larger number of federated clusters, so 2 clusters->3 + quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). + +Cluster Controllers in Ubernetes watch against the Ubernetes API +server/etcd state, and apply changes to the underlying kubernetes +clusters accordingly. They also have the anti-entropy mechanism for +reconciling ubernetes "desired desired" state against kubernetes +"actual desired" state. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() + diff --git a/federation-phase-1.md b/federation-phase-1.md new file mode 100644 index 00000000..baf1e472 --- /dev/null +++ b/federation-phase-1.md @@ -0,0 +1,434 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Ubernetes Design Spec (phase one) + +**Huawei PaaS Team** + +## INTRODUCTION + +In this document we propose a design for the “Control Plane” of +Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of +this work please refer to +[this proposal](../../docs/proposals/federation.md). +The document is arranged as following. First we briefly list scenarios +and use cases that motivate K8S federation work. These use cases drive +the design and they also verify the design. We summarize the +functionality requirements from these use cases, and define the “in +scope” functionalities that will be covered by this design (phase +one). After that we give an overview of the proposed architecture, API +and building blocks. And also we go through several activity flows to +see how these building blocks work together to support use cases. + +## REQUIREMENTS + +There are many reasons why customers may want to build a K8S +federation: + ++ **High Availability:** Customers want to be immune to the outage of + a single availability zone, region or even a cloud provider. ++ **Sensitive workloads:** Some workloads can only run on a particular + cluster. They cannot be scheduled to or migrated to other clusters. ++ **Capacity overflow:** Customers prefer to run workloads on a + primary cluster. But if the capacity of the cluster is not + sufficient, workloads should be automatically distributed to other + clusters. ++ **Vendor lock-in avoidance:** Customers want to spread their + workloads on different cloud providers, and can easily increase or + decrease the workload proportion of a specific provider. ++ **Cluster Size Enhancement:** Currently K8S cluster can only support +a limited size. While the community is actively improving it, it can +be expected that cluster size will be a problem if K8S is used for +large workloads or public PaaS infrastructure. While we can separate +different tenants to different clusters, it would be good to have a +unified view. + +Here are the functionality requirements derived from above use cases: + ++ Clients of the federation control plane API server can register and deregister clusters. ++ Workloads should be spread to different clusters according to the + workload distribution policy. ++ Pods are able to discover and connect to services hosted in other + clusters (in cases where inter-cluster networking is necessary, + desirable and implemented). ++ Traffic to these pods should be spread across clusters (in a manner + similar to load balancing, although it might not be strictly + speaking balanced). ++ The control plane needs to know when a cluster is down, and migrate + the workloads to other clusters. ++ Clients have a unified view and a central control point for above + activities. + +## SCOPE + +It’s difficult to have a perfect design with one click that implements +all the above requirements. Therefore we will go with an iterative +approach to design and build the system. This document describes the +phase one of the whole work. In phase one we will cover only the +following objectives: + ++ Define the basic building blocks and API objects of control plane ++ Implement a basic end-to-end workflow + + Clients register federated clusters + + Clients submit a workload + + The workload is distributed to different clusters + + Service discovery + + Load balancing + +The following parts are NOT covered in phase one: + ++ Authentication and authorization (other than basic client + authentication against the ubernetes API, and from ubernetes control + plane to the underlying kubernetes clusters). ++ Deployment units other than replication controller and service ++ Complex distribution policy of workloads ++ Service affinity and migration + +## ARCHITECTURE + +The overall architecture of a control plane is shown as following: + +![Ubernetes Architecture](ubernetes-design.png) + +Some design principles we are following in this architecture: + +1. Keep the underlying K8S clusters independent. They should have no + knowledge of control plane or of each other. +1. Keep the Ubernetes API interface compatible with K8S API as much as + possible. +1. Re-use concepts from K8S as much as possible. This reduces +customers’ learning curve and is good for adoption. Below is a brief +description of each module contained in above diagram. + +## Ubernetes API Server + +The API Server in the Ubernetes control plane works just like the API +Server in K8S. It talks to a distributed key-value store to persist, +retrieve and watch API objects. This store is completely distinct +from the kubernetes key-value stores (etcd) in the underlying +kubernetes clusters. We still use `etcd` as the distributed +storage so customers don’t need to learn and manage a different +storage system, although it is envisaged that other storage systems +(consol, zookeeper) will probably be developedand supported over +time. + +## Ubernetes Scheduler + +The Ubernetes Scheduler schedules resources onto the underlying +Kubernetes clusters. For example it watches for unscheduled Ubernetes +replication controllers (those that have not yet been scheduled onto +underlying Kubernetes clusters) and performs the global scheduling +work. For each unscheduled replication controller, it calls policy +engine to decide how to spit workloads among clusters. It creates a +Kubernetes Replication Controller on one ore more underlying cluster, +and post them back to `etcd` storage. + +One sublety worth noting here is that the scheduling decision is +arrived at by combining the application-specific request from the user (which might +include, for example, placement constraints), and the global policy specified +by the federation administrator (for example, "prefer on-premise +clusters over AWS clusters" or "spread load equally across clusters"). + +## Ubernetes Cluster Controller + +The cluster controller +performs the following two kinds of work: + +1. It watches all the sub-resources that are created by Ubernetes + components, like a sub-RC or a sub-service. And then it creates the + corresponding API objects on the underlying K8S clusters. +1. It periodically retrieves the available resources metrics from the + underlying K8S cluster, and updates them as object status of the + `cluster` API object. An alternative design might be to run a pod + in each underlying cluster that reports metrics for that cluster to + the Ubernetes control plane. Which approach is better remains an + open topic of discussion. + +## Ubernetes Service Controller + +The Ubernetes service controller is a federation-level implementation +of K8S service controller. It watches service resources created on +control plane, creates corresponding K8S services on each involved K8S +clusters. Besides interacting with services resources on each +individual K8S clusters, the Ubernetes service controller also +performs some global DNS registration work. + +## API OBJECTS + +## Cluster + +Cluster is a new first-class API object introduced in this design. For +each registered K8S cluster there will be such an API resource in +control plane. The way clients register or deregister a cluster is to +send corresponding REST requests to following URL: +`/api/{$version}/clusters`. Because control plane is behaving like a +regular K8S client to the underlying clusters, the spec of a cluster +object contains necessary properties like K8S cluster address and +credentials. The status of a cluster API object will contain +following information: + +1. Which phase of its lifecycle +1. Cluster resource metrics for scheduling decisions. +1. Other metadata like the version of cluster + +$version.clusterSpec + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Address
+
address of the cluster
+
yes
+
address
+

Credential
+
the type (e.g. bearer token, client +certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)
+
yes
+
string
+

+ +$version.clusterStatus + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Phase
+
the recently observed lifecycle phase of the cluster
+
yes
+
enum
+

Capacity
+
represents the available resources of a cluster
+
yes
+
any
+

ClusterMeta
+
Other cluster metadata like the version
+
yes
+
ClusterMeta
+

+ +**For simplicity we didn’t introduce a separate “cluster metrics” API +object here**. The cluster resource metrics are stored in cluster +status section, just like what we did to nodes in K8S. In phase one it +only contains available CPU resources and memory resources. The +cluster controller will periodically poll the underlying cluster API +Server to get cluster capability. In phase one it gets the metrics by +simply aggregating metrics from all nodes. In future we will improve +this with more efficient ways like leveraging heapster, and also more +metrics will be supported. Similar to node phases in K8S, the “phase” +field includes following values: + ++ pending: newly registered clusters or clusters suspended by admin + for various reasons. They are not eligible for accepting workloads ++ running: clusters in normal status that can accept workloads ++ offline: clusters temporarily down or not reachable ++ terminated: clusters removed from federation + +Below is the state transition diagram. + +![Cluster State Transition Diagram](ubernetes-cluster-state.png) + +## Replication Controller + +A global workload submitted to control plane is represented as an +Ubernetes replication controller. When a replication controller +is submitted to control plane, clients need a way to express its +requirements or preferences on clusters. Depending on different use +cases it may be complex. For example: + ++ This workload can only be scheduled to cluster Foo. It cannot be + scheduled to any other clusters. (use case: sensitive workloads). ++ This workload prefers cluster Foo. But if there is no available + capacity on cluster Foo, it’s OK to be scheduled to cluster Bar + (use case: workload ) ++ Seventy percent of this workload should be scheduled to cluster Foo, + and thirty percent should be scheduled to cluster Bar (use case: + vendor lock-in avoidance). In phase one, we only introduce a + _clusterSelector_ field to filter acceptable clusters. In default + case there is no such selector and it means any cluster is + acceptable. + +Below is a sample of the YAML to create such a replication controller. + +``` +apiVersion: v1 +kind: ReplicationController +metadata: + name: nginx-controller +spec: + replicas: 5 + selector: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: nginx + ports: + - containerPort: 80 + clusterSelector: + name in (Foo, Bar) +``` + +Currently clusterSelector (implemented as a +[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) +only supports a simple list of acceptable clusters. Workloads will be +evenly distributed on these acceptable clusters in phase one. After +phase one we will define syntax to represent more advanced +constraints, like cluster preference ordering, desired number of +splitted workloads, desired ratio of workloads spread on different +clusters, etc. + +Besides this explicit “clusterSelector” filter, a workload may have +some implicit scheduling restrictions. For example it defines +“nodeSelector” which can only be satisfied on some particular +clusters. How to handle this will be addressed after phase one. + +## Ubernetes Services + +The Service API object exposed by Ubernetes is similar to service +objects on Kubernetes. It defines the access to a group of pods. The +Ubernetes service controller will create corresponding Kubernetes +service objects on underlying clusters. These are detailed in a +separate design document: [Federated Services](federated-services.md). + +## Pod + +In phase one we only support scheduling replication controllers. Pod +scheduling will be supported in later phase. This is primarily in +order to keep the Ubernetes API compatible with the Kubernetes API. + +## ACTIVITY FLOWS + +## Scheduling + +The below diagram shows how workloads are scheduled on the Ubernetes control plane: + +1. A replication controller is created by the client. +1. APIServer persists it into the storage. +1. Cluster controller periodically polls the latest available resource + metrics from the underlying clusters. +1. Scheduler is watching all pending RCs. It picks up the RC, make + policy-driven decisions and split it into different sub RCs. +1. Each cluster control is watching the sub RCs bound to its + corresponding cluster. It picks up the newly created sub RC. +1. The cluster controller issues requests to the underlying cluster +API Server to create the RC. In phase one we don’t support complex +distribution policies. The scheduling rule is basically: + 1. If a RC does not specify any nodeSelector, it will be scheduled + to the least loaded K8S cluster(s) that has enough available + resources. + 1. If a RC specifies _N_ acceptable clusters in the + clusterSelector, all replica will be evenly distributed among + these clusters. + +There is a potential race condition here. Say at time _T1_ the control +plane learns there are _m_ available resources in a K8S cluster. As +the cluster is working independently it still accepts workload +requests from other K8S clients or even another Ubernetes control +plane. The Ubernetes scheduling decision is based on this data of +available resources. However when the actual RC creation happens to +the cluster at time _T2_, the cluster may don’t have enough resources +at that time. We will address this problem in later phases with some +proposed solutions like resource reservation mechanisms. + +![Ubernetes Scheduling](ubernetes-scheduling.png) + +## Service Discovery + +This part has been included in the section “Federated Service” of +document +“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please +refer to that document for details. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() + diff --git a/ubernetes-cluster-state.png b/ubernetes-cluster-state.png new file mode 100644 index 00000000..56ec2df8 Binary files /dev/null and b/ubernetes-cluster-state.png differ diff --git a/ubernetes-design.png b/ubernetes-design.png new file mode 100644 index 00000000..44924846 Binary files /dev/null and b/ubernetes-design.png differ diff --git a/ubernetes-scheduling.png b/ubernetes-scheduling.png new file mode 100644 index 00000000..01774882 Binary files /dev/null and b/ubernetes-scheduling.png differ -- cgit v1.2.3 From 77ef1108ced94037f20b47f5544fdfeba88d5497 Mon Sep 17 00:00:00 2001 From: Vladimir Rutsky Date: Fri, 4 Mar 2016 14:51:06 +0300 Subject: add missing comma in JSON --- namespaces.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/namespaces.md b/namespaces.md index 45e07f72..73ebc7b1 100644 --- a/namespaces.md +++ b/namespaces.md @@ -294,7 +294,7 @@ User deletes the Namespace in Kubernetes, and Namespace now has following state: "kind": "Namespace", "metadata": { "name": "development", - "deletionTimestamp": "..." + "deletionTimestamp": "...", "labels": { "name": "development" } @@ -319,7 +319,7 @@ removing *kubernetes* from the list of finalizers: "kind": "Namespace", "metadata": { "name": "development", - "deletionTimestamp": "..." + "deletionTimestamp": "...", "labels": { "name": "development" } @@ -347,7 +347,7 @@ This results in the following state: "kind": "Namespace", "metadata": { "name": "development", - "deletionTimestamp": "..." + "deletionTimestamp": "...", "labels": { "name": "development" } -- cgit v1.2.3 From 90806da373f288e3281b612ef9ca6455078763ee Mon Sep 17 00:00:00 2001 From: mdshuai Date: Mon, 7 Mar 2016 10:58:27 +0800 Subject: Update configmaps doc charaters error --- configmap.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configmap.md b/configmap.md index aceb3342..7c337d13 100644 --- a/configmap.md +++ b/configmap.md @@ -222,7 +222,7 @@ kind: ConfigMap metadata: name: etcd-env-config data: - number-of-members: 1 + number-of-members: "1" initial-cluster-state: new initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN -- cgit v1.2.3 From b00a17a66ea36282baf8189c8f836b54049b28db Mon Sep 17 00:00:00 2001 From: mdshuai Date: Mon, 7 Mar 2016 14:39:05 +0800 Subject: Update configmap design doc --- configmap.md | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/configmap.md b/configmap.md index 7c337d13..cf9c6d65 100644 --- a/configmap.md +++ b/configmap.md @@ -154,15 +154,15 @@ package api type EnvVarSource struct { // other fields omitted - // Specifies a ConfigMap key - ConfigMap *ConfigMapSelector `json:"configMap,omitempty"` + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` } -// ConfigMapSelector selects a key of a ConfigMap. -type ConfigMapSelector struct { - // The name of the ConfigMap to select a key from. - ConfigMapName string `json:"configMapName"` - // The key of the ConfigMap to select. +// Selects a key from a ConfigMap. +type ConfigMapKeySelector struct { + // The ConfigMap to select from. + LocalObjectReference `json:",inline"` + // The key to select. Key string `json:"key"` } ``` @@ -249,28 +249,28 @@ spec: env: - name: ETCD_NUM_MEMBERS valueFrom: - configMap: - configMapName: etcd-env-config + configMapKeyRef: + name: etcd-env-config key: number-of-members - name: ETCD_INITIAL_CLUSTER_STATE valueFrom: - configMap: - configMapName: etcd-env-config + configMapKeyRef: + name: etcd-env-config key: initial-cluster-state - name: ETCD_DISCOVERY_TOKEN valueFrom: - configMap: - configMapName: etcd-env-config + configMapKeyRef: + name: etcd-env-config key: discovery-token - name: ETCD_DISCOVERY_URL valueFrom: - configMap: - configMapName: etcd-env-config + configMapKeyRef: + name: etcd-env-config key: discovery-url - name: ETCDCTL_PEERS valueFrom: - configMap: - configMapName: etcd-env-config + configMapKeyRef: + name: etcd-env-config key: etcdctl-peers ``` @@ -279,12 +279,12 @@ spec: `redis-volume-config` is intended to be used as a volume containing a config file: ```yaml -apiVersion: extensions/v1beta1 +apiVersion: v1 kind: ConfigMap metadata: name: redis-volume-config data: - redis.conf: "pidfile /var/run/redis.pid\nport6379\ntcp-backlog 511\n databases 1\ntimeout 0\n" + redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n" ``` The following pod consumes the `redis-volume-config` in a volume: @@ -298,7 +298,7 @@ spec: containers: - name: redis image: kubernetes/redis - command: "redis-server /mnt/config-map/etc/redis.conf" + command: ["redis-server", "/mnt/config-map/etc/redis.conf"] ports: - containerPort: 6379 volumeMounts: -- cgit v1.2.3 From 12a89672f5e9c908eba2fbfd93c1bc9fb27dc0bb Mon Sep 17 00:00:00 2001 From: David McMahon Date: Tue, 8 Mar 2016 18:06:40 -0800 Subject: Update the latestReleaseBranch to release-1.2 in the munger. --- README.md | 2 +- access.md | 2 +- admission_control.md | 2 +- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- architecture.md | 2 +- aws_under_the_hood.md | 5 +++++ clustering.md | 2 +- clustering/README.md | 2 +- command_execution_port_forwarding.md | 2 +- configmap.md | 5 +++++ daemon.md | 2 +- enhance-pluggable-policy.md | 5 +++++ event_compression.md | 2 +- expansion.md | 2 +- extending-api.md | 2 +- horizontal-pod-autoscaler.md | 2 +- identifiers.md | 2 +- indexed-job.md | 5 +++++ metadata-policy.md | 5 +++++ namespaces.md | 2 +- networking.md | 2 +- nodeaffinity.md | 5 +++++ persistent-storage.md | 2 +- podaffinity.md | 5 +++++ principles.md | 2 +- resources.md | 2 +- scheduler_extender.md | 5 +++++ secrets.md | 2 +- security.md | 2 +- security_context.md | 2 +- selector-generation.md | 5 +++++ service_accounts.md | 2 +- simple-rolling-update.md | 2 +- taint-toleration-dedicated.md | 5 +++++ versioning.md | 2 +- 36 files changed, 76 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index e7beb90b..ac32e59e 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/README.md). +[here](http://releases.k8s.io/release-1.2/docs/design/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/access.md b/access.md index fa173392..d7b6b8ec 100644 --- a/access.md +++ b/access.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/access.md). +[here](http://releases.k8s.io/release-1.2/docs/design/access.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control.md b/admission_control.md index 37cf5e1f..d85f1bfa 100644 --- a/admission_control.md +++ b/admission_control.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/admission_control.md). +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 890ba37d..26b424f4 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/admission_control_limit_range.md). +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 2b01ea7e..99e92b8f 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md). +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/architecture.md b/architecture.md index 93213066..f3ff5e2f 100644 --- a/architecture.md +++ b/architecture.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/architecture.md). +[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 019b07d6..da24fb19 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering.md b/clustering.md index 01df7410..fbf6892c 100644 --- a/clustering.md +++ b/clustering.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/clustering.md). +[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering/README.md b/clustering/README.md index 6f3d379c..3bfd1905 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/clustering/README.md). +[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 89ed7665..d687f3e2 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/command_execution_port_forwarding.md). +[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/configmap.md b/configmap.md index 7c337d13..72bb4415 100644 --- a/configmap.md +++ b/configmap.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/daemon.md b/daemon.md index 2c393374..a08e4c3b 100644 --- a/daemon.md +++ b/daemon.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/daemon.md). +[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 9cdd9a2d..4eef8831 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/event_compression.md b/event_compression.md index 3a05db21..b94d6560 100644 --- a/event_compression.md +++ b/event_compression.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/event_compression.md). +[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/expansion.md b/expansion.md index 371f7c86..9012b2c5 100644 --- a/expansion.md +++ b/expansion.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/expansion.md). +[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/extending-api.md b/extending-api.md index 5f5e6c0a..ee53a7d6 100644 --- a/extending-api.md +++ b/extending-api.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/extending-api.md). +[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 7c54da06..a5969c01 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/horizontal-pod-autoscaler.md). +[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/identifiers.md b/identifiers.md index ca2c95df..fc5e6925 100644 --- a/identifiers.md +++ b/identifiers.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/identifiers.md). +[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/indexed-job.md b/indexed-job.md index b928f722..b4d06dde 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/metadata-policy.md b/metadata-policy.md index 1d02fcf4..090241d4 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/namespaces.md b/namespaces.md index 73ebc7b1..e2a532b2 100644 --- a/namespaces.md +++ b/namespaces.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/namespaces.md). +[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/networking.md b/networking.md index e5807b50..711a709a 100644 --- a/networking.md +++ b/networking.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/networking.md). +[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/nodeaffinity.md b/nodeaffinity.md index a8ee2a18..dda04a51 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/persistent-storage.md b/persistent-storage.md index 5db565c7..4c3b08e6 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/persistent-storage.md). +[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/podaffinity.md b/podaffinity.md index 4e30303b..e0245a52 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/principles.md b/principles.md index 52b839fb..5e0e8252 100644 --- a/principles.md +++ b/principles.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/principles.md). +[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resources.md b/resources.md index 069ddd6c..6a7ee449 100644 --- a/resources.md +++ b/resources.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/resources.md). +[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/scheduler_extender.md b/scheduler_extender.md index 3a55139d..8612c39c 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/secrets.md b/secrets.md index a9941cb3..f73c1c22 100644 --- a/secrets.md +++ b/secrets.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/secrets.md). +[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security.md b/security.md index db380250..b9c7942a 100644 --- a/security.md +++ b/security.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/security.md). +[here](http://releases.k8s.io/release-1.2/docs/design/security.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security_context.md b/security_context.md index 8b9b8c12..24a34878 100644 --- a/security_context.md +++ b/security_context.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/security_context.md). +[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/selector-generation.md b/selector-generation.md index 3ada304c..28db17fc 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/service_accounts.md b/service_accounts.md index 72c3df81..445de310 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/service_accounts.md). +[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/simple-rolling-update.md b/simple-rolling-update.md index e34e695c..0ac77d23 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/simple-rolling-update.md). +[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index 7eb37da9..5b4ebcb6 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/versioning.md b/versioning.md index 99caa6e6..15102178 100644 --- a/versioning.md +++ b/versioning.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.1/docs/design/versioning.md). +[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- cgit v1.2.3 From 2bc06274dc8bcd1451e92c45e9b861671cbd9cfb Mon Sep 17 00:00:00 2001 From: Michail Kargakis Date: Sat, 12 Mar 2016 12:53:11 +0100 Subject: docs: pod affinity proposal fix --- podaffinity.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/podaffinity.md b/podaffinity.md index e0245a52..1a2da4af 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -189,10 +189,10 @@ PodAntiAffinity { ``` Then when scheduling pod P, the scheduler -* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key "node" and value specifying their node name.) -* Should try to schedule P onto zones that are running pods that satisfy `P3`. (Assumes all nodes have a label with key "zone" and value specifying their zone.) -* Cannot schedule P onto any racks that are running pods that satisfy `P2`. (Assumes all nodes have a label with key "rack" and value specifying their rack name.) -* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key "power" and value specifying their power domain.) +* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.) +* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.) +* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.) +* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.) When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed. For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and -- cgit v1.2.3 From 3f92a009dae62fcb572fcde63e5d967c617d3150 Mon Sep 17 00:00:00 2001 From: yeasy Date: Tue, 15 Mar 2016 15:19:15 +0800 Subject: Remove duplicated words --- access.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/access.md b/access.md index d7b6b8ec..dafceb6a 100644 --- a/access.md +++ b/access.md @@ -181,7 +181,7 @@ Improvements: ### Namespaces -K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. +K8s will have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. Namespaces are described in [namespaces.md](namespaces.md). -- cgit v1.2.3 From c2b9df1eabdcf677bd7833f46a90db0ae6bec629 Mon Sep 17 00:00:00 2001 From: David McMahon Date: Tue, 19 Jan 2016 15:16:02 -0800 Subject: Add section on branched patch releases. Some minor formatting changes. Add semver.org reference to file. Ref #19849 --- versioning.md | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/versioning.md b/versioning.md index 99caa6e6..f0a6b9cc 100644 --- a/versioning.md +++ b/versioning.md @@ -34,19 +34,37 @@ Documentation for other releases can be found at # Kubernetes API and Release Versioning +Reference: [Semantic Versioning](http://semver.org) + Legend: -* **Kube X.Y.Z** refers to the version of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the major version, **Y** is the minor version, and **Z** is the patch version.) +* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the major version, **Y** is the minor version, and **Z** is the patch version.) * **API vX[betaY]** refers to the version of the HTTP API. ## Release versioning ### Minor version scheme and timeline -* Kube X.Y.0-alpha.W, W > 0: Alpha releases are released roughly every two weeks directly from the master branch. No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. -* Kube X.Y.Z-beta.W: When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary. -* Kube X.Y.0: Final release, cut from the release-X.Y branch cut two weeks prior. X.Y.1-beta.0 will be tagged at the same commit on the same branch. X.Y.0 occur 3 to 4 months after X.Y-1.0. -* Kube X.Y.Z, Z > 0: [Patch releases](#patch-releases) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the same commit. +* Kube X.Y.0-alpha.W, W > 0 (Branch: master) + * Alpha releases are released roughly every two weeks directly from the master branch. + * No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. +* Kube X.Y.Z-beta.W (Branch: release-X.Y) + * When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. + * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. + * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary. +* Kube X.Y.0 (Branch: release-X.Y) + * Final release, cut from the release-X.Y branch cut two weeks prior. + * X.Y.1-beta.0 will be tagged at the same commit on the same branch. + * X.Y.0 occur 3 to 4 months after X.(Y-1).0. +* Kube X.Y.Z, Z > 0 (Branch: release-X.Y) + * [Patch releases](#patch-releases) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. + * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the followup commit that updates pkg/version/base.go with the beta version. +* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z) + * These are special and different in that the X.Y.Z tag is branched to isolate the emergency/critical fix from all other changes that have landed on the release branch since the previous tag + * Cut release-X.Y.Z branch to hold the isolated patch release + * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1) + * Branched [patch releases](#patch-releases) are rarely needed but used for emergency/critical fixes to the latest release + * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed for this kind of release to be possible. ### Major version timeline -- cgit v1.2.3 From 3dc5930b7ffcb40de988b3d72cc9086e65ca296e Mon Sep 17 00:00:00 2001 From: saadali Date: Tue, 22 Mar 2016 22:12:21 -0700 Subject: Rename volume.Builder to Mounter and volume.Cleaner to Unmounter --- secrets.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/secrets.md b/secrets.md index f73c1c22..a403ce4f 100644 --- a/secrets.md +++ b/secrets.md @@ -391,11 +391,11 @@ type Host interface { The secret volume plugin will be responsible for: -1. Returning a `volume.Builder` implementation from `NewBuilder` that: +1. Returning a `volume.Mounter` implementation from `NewMounter` that: 1. Retrieves the secret data for the volume from the API server 2. Places the secret data onto the container's filesystem 3. Sets the correct security attributes for the volume based on the pod's `SecurityContext` -2. Returning a `volume.Cleaner` implementation from `NewClear` that cleans the volume from the +2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that cleans the volume from the container's filesystem ### Kubelet: Node-level secret storage -- cgit v1.2.3 From d2ab00b82036d3396df1e51f5a43ff4755f8f915 Mon Sep 17 00:00:00 2001 From: mikebrow Date: Wed, 13 Apr 2016 19:55:22 -0500 Subject: address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree Signed-off-by: mikebrow --- README.md | 66 +++- access.md | 291 +++++++++----- admission_control.md | 38 +- admission_control_limit_range.md | 67 ++-- admission_control_resource_quota.md | 91 +++-- architecture.md | 61 ++- aws_under_the_hood.md | 182 +++++---- clustering.md | 116 ++++-- clustering/README.md | 17 +- command_execution_port_forwarding.md | 53 +-- configmap.md | 122 +++--- control-plane-resilience.md | 91 ++--- daemon.md | 154 ++++++-- enhance-pluggable-policy.md | 217 +++++++---- event_compression.md | 132 +++++-- expansion.md | 282 +++++++------- extending-api.md | 111 +++--- federated-services.md | 173 ++++----- federation-phase-1.md | 38 +- horizontal-pod-autoscaler.md | 184 +++++---- identifiers.md | 108 +++--- indexed-job.md | 424 ++++++++++---------- metadata-policy.md | 72 ++-- namespaces.md | 180 +++++---- networking.md | 40 +- nodeaffinity.md | 165 ++++---- persistent-storage.md | 96 +++-- podaffinity.md | 723 +++++++++++++++++++---------------- principles.md | 77 +++- resources.md | 249 +++++++++--- scheduler_extender.md | 37 +- secrets.md | 399 ++++++++++--------- security.md | 202 +++++++--- security_context.md | 109 ++++-- selector-generation.md | 147 ++++--- service_accounts.md | 212 +++++----- simple-rolling-update.md | 77 ++-- taint-toleration-dedicated.md | 306 ++++++++------- versioning.md | 136 +++++-- 39 files changed, 3807 insertions(+), 2438 deletions(-) diff --git a/README.md b/README.md index ac32e59e..2f1de058 100644 --- a/README.md +++ b/README.md @@ -34,19 +34,59 @@ Documentation for other releases can be found at # Kubernetes Design Overview -Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. - -Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration. - -Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways. - -Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary. - -Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts. - -A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) for more details). - -Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner. +Kubernetes is a system for managing containerized applications across multiple +hosts, providing basic mechanisms for deployment, maintenance, and scaling of +applications. + +Kubernetes establishes robust declarative primitives for maintaining the desired +state requested by the user. We see these primitives as the main value added by +Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and +replicating containers require active controllers, not just imperative +orchestration. + +Kubernetes is primarily targeted at applications composed of multiple +containers, such as elastic, distributed micro-services. It is also designed to +facilitate migration of non-containerized application stacks to Kubernetes. It +therefore includes abstractions for grouping containers in both loosely coupled +and tightly coupled formations, and provides ways for containers to find and +communicate with each other in relatively familiar ways. + +Kubernetes enables users to ask a cluster to run a set of containers. The system +automatically chooses hosts to run those containers on. While Kubernetes's +scheduler is currently very simple, we expect it to grow in sophistication over +time. Scheduling is a policy-rich, topology-aware, workload-specific function +that significantly impacts availability, performance, and capacity. The +scheduler needs to take into account individual and collective resource +requirements, quality of service requirements, hardware/software/policy +constraints, affinity and anti-affinity specifications, data locality, +inter-workload interference, deadlines, and so on. Workload-specific +requirements will be exposed through the API as necessary. + +Kubernetes is intended to run on a number of cloud providers, as well as on +physical hosts. + +A single Kubernetes cluster is not intended to span multiple availability zones. +Instead, we recommend building a higher-level layer to replicate complete +deployments of highly available applications across multiple zones (see +[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) +for more details). + +Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS +platform and toolkit. Therefore, architecturally, we want Kubernetes to be built +as a collection of pluggable components and layers, with the ability to use +alternative schedulers, controllers, storage systems, and distribution +mechanisms, and we're evolving its current code in that direction. Furthermore, +we want others to be able to extend Kubernetes functionality, such as with +higher-level PaaS functionality or multi-cluster layers, without modification of +core Kubernetes source. Therefore, its API isn't just (or even necessarily +mainly) targeted at end users, but at tool and extension developers. Its APIs +are intended to serve as the foundation for an open ecosystem of tools, +automation systems, and higher-level API layers. Consequently, there are no +"internal" inter-component APIs. All APIs are visible and available, including +the APIs used by the scheduler, the node controller, the replication-controller +manager, Kubelet's API, etc. There's no glass to break -- in order to handle +more complex use cases, one can just access the lower-level APIs in a fully +transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). diff --git a/access.md b/access.md index dafceb6a..7cf1ad39 100644 --- a/access.md +++ b/access.md @@ -34,23 +34,30 @@ Documentation for other releases can be found at # K8s Identity and Access Management Sketch -This document suggests a direction for identity and access management in the Kubernetes system. +This document suggests a direction for identity and access management in the +Kubernetes system. ## Background High level goals are: - - Have a plan for how identity, authentication, and authorization will fit in to the API. - - Have a plan for partitioning resources within a cluster between independent organizational units. + - Have a plan for how identity, authentication, and authorization will fit in +to the API. + - Have a plan for partitioning resources within a cluster between independent +organizational units. - Ease integration with existing enterprise and hosted scenarios. ### Actors Each of these can act as normal users or attackers. - - External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access. - - K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods) + - External Users: People who are accessing applications running on K8s (e.g. +a web site served by webserver running in a container on K8s), but who do not +have K8s API access. + - K8s Users: People who access the K8s API (e.g. create K8s API objects like +Pods) - K8s Project Admins: People who manage access for some K8s Users - - K8s Cluster Admins: People who control the machines, networks, or binaries that make up a K8s cluster. + - K8s Cluster Admins: People who control the machines, networks, or binaries +that make up a K8s cluster. - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. ### Threats @@ -58,22 +65,31 @@ Each of these can act as normal users or attackers. Both intentional attacks and accidental use of privilege are concerns. For both cases it may be useful to think about these categories differently: - - Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s. + - Application Path - attack by sending network messages from the internet to +the IP/port of any application running on K8s. May exploit weakness in +application or misconfiguration of K8s. - K8s API Path - attack by sending network messages to any K8s API endpoint. - - Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category. + - Insider Path - attack on K8s system components. Attacker may have +privileged access to networks, machines or K8s software and data. Software +errors in K8s system components and administrator error are some types of threat +in this category. -This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document. +This document is primarily concerned with K8s API paths, and secondarily with +Internal paths. The Application path also needs to be secure, but is not the +focus of this document. ### Assets to protect External User assets: - - Personal information like private messages, or images uploaded by External Users. + - Personal information like private messages, or images uploaded by External +Users. - web server logs. K8s User assets: - External User assets of each K8s User. - things private to the K8s app, like: - - credentials for accessing other services (docker private repos, storage services, facebook, etc) + - credentials for accessing other services (docker private repos, storage +services, facebook, etc) - SSL certificates for web servers - proprietary data and code @@ -82,38 +98,51 @@ K8s Cluster assets: - Machine Certificates or secrets. - The value of K8s cluster computing resources (cpu, memory, etc). -This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins. +This document is primarily about protecting K8s User assets and K8s cluster +assets from other K8s Users and K8s Project and Cluster Admins. ### Usage environments Cluster in Small organization: - K8s Admins may be the same people as K8s Users. - - few K8s Admins. - - prefer ease of use to fine-grained access control/precise accounting, etc. - - Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster. + - Few K8s Admins. + - Prefer ease of use to fine-grained access control/precise accounting, etc. + - Product requirement that it be easy for potential K8s Cluster Admin to try +out setting up a simple cluster. Cluster in Large organization: - - K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles. + - K8s Admins typically distinct people from K8s Users. May need to divide +K8s Cluster Admin access by roles. - K8s Users need to be protected from each other. - Auditing of K8s User and K8s Admin actions important. - - flexible accurate usage accounting and resource controls important. + - Flexible accurate usage accounting and resource controls important. - Lots of automated access to APIs. - - Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure. + - Need to integrate with existing enterprise directory, authentication, +accounting, auditing, and security policy infrastructure. Org-run cluster: - - organization that runs K8s master components is same as the org that runs apps on K8s. + - Organization that runs K8s master components is same as the org that runs +apps on K8s. - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. Hosted cluster: - Offering K8s API as a service, or offering a Paas or Saas built on K8s. - - May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure. - - May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.) - - Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded). + - May already offer web services, and need to integrate with existing customer +account concept, and existing authentication, accounting, auditing, and security +policy infrastructure. + - May want to leverage K8s User accounts and accounting to manage their User +accounts (not a priority to support this use case.) + - Precise and accurate accounting of resources needed. Resource controls +needed for hard limits (Users given limited slice of data) and soft limits +(Users can grow up to some limit and then be expanded). K8s ecosystem services: - - There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case. + - There may be companies that want to offer their existing services (Build, CI, +A/B-test, release automation, etc) for use with K8s. There should be some story +for this case. -Pods configs should be largely portable between Org-run and hosted configurations. +Pods configs should be largely portable between Org-run and hosted +configurations. # Design @@ -123,65 +152,99 @@ Related discussion: - http://issue.k8s.io/443 This doc describes two security profiles: - - Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users. - - Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure. + - Simple profile: like single-user mode. Make it easy to evaluate K8s +without lots of configuring accounts and policies. Protects from unauthorized +users, but does not partition authorized users. + - Enterprise profile: Provide mechanisms needed for large numbers of users. +Defense in depth. Should integrate with existing enterprise security +infrastructure. -K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix. +K8s distribution should include templates of config, and documentation, for +simple and enterprise profiles. System should be flexible enough for +knowledgeable users to create intermediate profiles, but K8s developers should +only reason about those two Profiles, not a matrix. -Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00. +Features in this doc are divided into "Initial Feature", and "Improvements". +Initial features would be candidates for version 1.00. ## Identity ### userAccount K8s will have a `userAccount` API object. -- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs. -- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field. -- `userAccount` is not related to the unix username of processes in Pods created by that userAccount. +- `userAccount` has a UID which is immutable. This is used to associate users +with objects and to record actions in audit logs. +- `userAccount` has a name which is a string and human readable and unique among +userAccounts. It is used to refer to users in Policies, to ensure that the +Policies are human readable. It can be changed only when there are no Policy +objects or other objects which refer to that name. An email address is a +suggested format for this field. +- `userAccount` is not related to the unix username of processes in Pods created +by that userAccount. - `userAccount` API objects can have labels. The system may associate one or more Authentication Methods with a `userAccount` (but they are not formally part of the userAccount object.) -In a simple deployment, the authentication method for a -user might be an authentication token which is verified by a K8s server. In a -more complex deployment, the authentication might be delegated to -another system which is trusted by the K8s API to authenticate users, but where -the authentication details are unknown to K8s. + +In a simple deployment, the authentication method for a user might be an +authentication token which is verified by a K8s server. In a more complex +deployment, the authentication might be delegated to another system which is +trusted by the K8s API to authenticate users, but where the authentication +details are unknown to K8s. Initial Features: -- there is no superuser `userAccount` -- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this. -- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed. -- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount. +- There is no superuser `userAccount` +- `userAccount` objects are statically populated in the K8s API store by reading +a config file. Only a K8s Cluster Admin can do this. +- `userAccount` can have a default `namespace`. If API call does not specify a +`namespace`, the default `namespace` for that caller is assumed. +- `userAccount` is global. A single human with access to multiple namespaces is +recommended to only have one userAccount. Improvements: -- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management. +- Make `userAccount` part of a separate API group from core K8s objects like +`pod.` Facilitates plugging in alternate Access Management. Simple Profile: - - single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all. + - Single `userAccount`, used by all K8s Users and Project Admins. One access +token shared by all. Enterprise Profile: - - every human user has own `userAccount`. - - `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles. - - each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`) - - automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file. + - Every human user has own `userAccount`. + - `userAccount`s have labels that indicate both membership in groups, and +ability to act in certain roles. + - Each service using the API has own `userAccount` too. (e.g. `scheduler`, +`repcontroller`) + - Automated jobs to denormalize the ldap group info into the local system +list of users into the K8s userAccount file. ### Unix accounts -A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity. +A `userAccount` is not a Unix user account. The fact that a pod is started by a +`userAccount` does not mean that the processes in that pod's containers run as a +Unix user with a corresponding name or identity. Initially: -- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest. -- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts. +- The unix accounts available in a container, and used by the processes running +in a container are those that are provided by the combination of the base +operating system and the Docker manifest. +- Kubernetes doesn't enforce any relation between `userAccount` and unix +accounts. Improvements: -- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) -- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids. -- any features that help users avoid use of privileged containers (http://issue.k8s.io/391) +- Kubelet allocates disjoint blocks of root-namespace uids for each container. +This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) +- requires docker to integrate user namespace support, and deciding what +getpwnam() does for these uids. +- any features that help users avoid use of privileged containers +(http://issue.k8s.io/391) ### Namespaces -K8s will have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies. +K8s will have a `namespace` API object. It is similar to a Google Compute +Engine `project`. It provides a namespace for objects created by a group of +people co-operating together, preventing name collisions with non-cooperating +groups. It also serves as a reference point for authorization policies. Namespaces are described in [namespaces.md](namespaces.md). @@ -192,20 +255,36 @@ In the Simple Profile: - There is a single `namespace` used by the single user. Namespaces versus userAccount vs Labels: -- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. -- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities. -- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people. +- `userAccount`s are intended for audit logging (both name and UID should be +logged), and to define who has access to `namespace`s. +- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) +should be used to distinguish pods, users, and other objects that cooperate +towards a common goal but are different in some way, such as version, or +responsibilities. +- `namespace`s prevent name collisions between uncoordinated groups of people, +and provide a place to attach common policies for co-operating groups of people. ## Authentication Goals for K8s authentication: -- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required. -- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The Kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users. - - For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication. - - So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver. - - For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal. -- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code. +- Include a built-in authentication system with no configuration required to use +in single-user mode, and little configuration required to add several user +accounts, and no https proxy required. +- Allow for authentication to be handled by a system external to Kubernetes, to +allow integration with existing to enterprise authorization systems. The +Kubernetes namespace itself should avoid taking contributions of multiple +authorization schemes. Instead, a trusted proxy in front of the apiserver can be +used to authenticate users. + - For organizations whose security requirements only allow FIPS compliant +implementations (e.g. apache) for authentication. + - So the proxy can terminate SSL, and isolate the CA-signed certificate from +less trusted, higher-touch APIserver. + - For organizations that already have existing SaaS web services (e.g. +storage, VMs) and want a common authentication portal. +- Avoid mixing authentication and authorization, so that authorization policies +be centrally managed, and to allow changes in authentication methods without +affecting authorization code. Initially: - Tokens used to authenticate a user. @@ -213,9 +292,12 @@ Initially: - Administrator utility generates tokens at cluster setup. - OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750 - No scopes for tokens. Authorization happens in the API server -- Tokens dynamically generated by apiserver to identify pods which are making API calls. +- Tokens dynamically generated by apiserver to identify pods which are making +API calls. - Tokens checked in a module of the APIserver. -- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller. +- Authentication in apiserver can be disabled by flag, to allow testing without +authorization enabled, and to allow use of an authenticating proxy. In this +mode, a query parameter or header added by the proxy will identify the caller. Improvements: - Refresh of tokens. @@ -228,54 +310,86 @@ To be considered for subsequent versions: - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf - http://www.browserauth.net - ## Authorization K8s authorization should: -- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems. -- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service). -- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults. -- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Replication Controllers, Services, and the identities and policies for those Pods and Replication Controllers. -- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies. +- Allow for a range of maturity levels, from single-user for those test driving +the system, to integration with existing to enterprise authorization systems. +- Allow for centralized management of users and policies. In some +organizations, this will mean that the definition of users and access policies +needs to reside on a system other than k8s and encompass other web services +(such as a storage service). +- Allow processes running in K8s Pods to take on identity, and to allow narrow +scoping of permissions for those identities in order to limit damage from +software faults. +- Have Authorization Policies exposed as API objects so that a single config +file can create or delete Pods, Replication Controllers, Services, and the +identities and policies for those Pods and Replication Controllers. +- Be separate as much as practical from Authentication, to allow Authentication +methods to change over time and space, without impacting Authorization policies. K8s will implement a relatively simple [Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model. -The model will be described in more detail in a forthcoming document. The model will + +The model will be described in more detail in a forthcoming document. The model +will: - Be less complex than XACML - Be easily recognizable to those familiar with Amazon IAM Policies. -- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control. +- Have a subset/aliases/defaults which allow it to be used in a way comfortable +to those users more familiar with Role-Based Access Control. Authorization policy is set by creating a set of Policy objects. -The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance). - -Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.) - +The API Server will be the Enforcement Point for Policy. For each API call that +it receives, it will construct the Attributes needed to evaluate the policy +(what user is making the call, what resource they are accessing, what they are +trying to do that resource, etc) and pass those attributes to a Decision Point. +The Decision Point code evaluates the Attributes against all the Policies and +allows or denies the API call. The system will be modular enough that the +Decision Point code can either be linked into the APIserver binary, or be +another service that the apiserver calls for each Decision (with appropriate +time-limited caching as needed for performance). + +Policy objects may be applicable only to a single namespace or to all +namespaces; K8s Project Admins would be able to create those as needed. Other +Policy objects may be applicable to all namespaces; a K8s Cluster Admin might +create those in order to authorize a new type of controller to be used by all +namespaces, or to make a K8s User into a K8s Project Admin.) ## Accounting -The API should have a `quota` concept (see http://issue.k8s.io/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources.md)). +The API should have a `quota` concept (see http://issue.k8s.io/442). A quota +object relates a namespace (and optionally a label selector) to a maximum +quantity of resources that may be used (see [resources design doc](resources.md)). Initially: -- a `quota` object is immutable. -- for hosted K8s systems that do billing, Project is recommended level for billing accounts. -- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`. +- A `quota` object is immutable. +- For hosted K8s systems that do billing, Project is recommended level for +billing accounts. +- Every object that consumes resources should have a `namespace` so that +Resource usage stats are roll-up-able to `namespace`. - K8s Cluster Admin sets quota objects by writing a config file. Improvements: -- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object. -- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores. -- tools to help write consistent quota config files based on number of nodes, historical namespace usages, QoS needs, etc. -- way for K8s Cluster Admin to incrementally adjust Quota objects. +- Allow one namespace to charge the quota for one or more other namespaces. This +would be controlled by a policy which allows changing a billing_namespace = +label on an object. +- Allow quota to be set by namespace owners for (namespace x label) combinations +(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't +allow "webserver" namespace and "instance=test" use more than 10 cores. +- Tools to help write consistent quota config files based on number of nodes, +historical namespace usages, QoS needs, etc. +- Way for K8s Cluster Admin to incrementally adjust Quota objects. Simple profile: - - a single `namespace` with infinite resource limits. + - A single `namespace` with infinite resource limits. Enterprise profile: - - multiple namespaces each with their own limits. + - Multiple namespaces each with their own limits. Issues: -- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations. +- Need for locking or "eventual consistency" when multiple apiserver goroutines +are accessing the object store and handling pod creations. ## Audit Logging @@ -287,7 +401,8 @@ Initial implementation: Improvements: - API server does logging instead. -- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions. +- Policies to drop logging for high rate trusted API calls, or by users +performing audit or other sensitive functions. diff --git a/admission_control.md b/admission_control.md index d85f1bfa..eef323b7 100644 --- a/admission_control.md +++ b/admission_control.md @@ -43,24 +43,30 @@ Documentation for other releases can be found at ## Background High level goals: +* Enable an easy-to-use mechanism to provide admission control to cluster. +* Enable a provider to support multiple admission control strategies or author +their own. +* Ensure any rejected request can propagate errors back to the caller with why +the request failed. -* Enable an easy-to-use mechanism to provide admission control to cluster -* Enable a provider to support multiple admission control strategies or author their own -* Ensure any rejected request can propagate errors back to the caller with why the request failed - -Authorization via policy is focused on answering if a user is authorized to perform an action. +Authorization via policy is focused on answering if a user is authorized to +perform an action. Admission Control is focused on if the system will accept an authorized action. -Kubernetes may choose to dismiss an authorized action based on any number of admission control strategies. +Kubernetes may choose to dismiss an authorized action based on any number of +admission control strategies. -This proposal documents the basic design, and describes how any number of admission control plug-ins could be injected. +This proposal documents the basic design, and describes how any number of +admission control plug-ins could be injected. -Implementation of specific admission control strategies are handled in separate documents. +Implementation of specific admission control strategies are handled in separate +documents. ## kube-apiserver -The kube-apiserver takes the following OPTIONAL arguments to enable admission control +The kube-apiserver takes the following OPTIONAL arguments to enable admission +control: | Option | Behavior | | ------ | -------- | @@ -72,7 +78,8 @@ An **AdmissionControl** plug-in is an implementation of the following interface: ```go package admission -// Attributes is an interface used by a plug-in to make an admission decision on a individual request. +// Attributes is an interface used by a plug-in to make an admission decision +// on a individual request. type Attributes interface { GetNamespace() string GetKind() string @@ -88,8 +95,8 @@ type Interface interface { } ``` -A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation -of admission.Interface. +A **plug-in** must be compiled with the binary, and is registered as an +available option by providing a name, and implementation of admission.Interface. ```go func init() { @@ -97,9 +104,12 @@ func init() { } ``` -Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations. +Invocation of admission control is handled by the **APIServer** and not +individual **RESTStorage** implementations. -This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow will ensure the following: +This design assumes that **Issue 297** is adopted, and as a consequence, the +general framework of the APIServer request/response flow will ensure the +following: 1. Incoming request 2. Authenticate user diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 26b424f4..8a6c751d 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -36,7 +36,8 @@ Documentation for other releases can be found at ## Background -This document proposes a system for enforcing resource requirements constraints as part of admission control. +This document proposes a system for enforcing resource requirements constraints +as part of admission control. ## Use cases @@ -64,7 +65,8 @@ const ( LimitTypeContainer LimitType = "Container" ) -// LimitRangeItem defines a min/max usage limit for any resource that matches on kind. +// LimitRangeItem defines a min/max usage limit for any resource that matches +// on kind. type LimitRangeItem struct { // Type of resource that this limit applies to. Type LimitType `json:"type,omitempty"` @@ -72,29 +74,38 @@ type LimitRangeItem struct { Max ResourceList `json:"max,omitempty"` // Min usage constraints on this kind by resource name. Min ResourceList `json:"min,omitempty"` - // Default resource requirement limit value by resource name if resource limit is omitted. + // Default resource requirement limit value by resource name if resource limit + // is omitted. Default ResourceList `json:"default,omitempty"` - // DefaultRequest is the default resource requirement request value by resource name if resource request is omitted. + // DefaultRequest is the default resource requirement request value by + // resource name if resource request is omitted. DefaultRequest ResourceList `json:"defaultRequest,omitempty"` - // MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource. + // MaxLimitRequestRatio if specified, the named resource must have a request + // and limit that are both non-zero where limit divided by request is less + // than or equal to the enumerated value; this represents the max burst for + // the named resource. MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` } -// LimitRangeSpec defines a min/max usage limit for resources that match on kind. +// LimitRangeSpec defines a min/max usage limit for resources that match +// on kind. type LimitRangeSpec struct { // Limits is the list of LimitRangeItem objects that are enforced. Limits []LimitRangeItem `json:"limits"` } -// LimitRange sets resource usage limits for each kind of resource in a Namespace. +// LimitRange sets resource usage limits for each kind of resource in a +// Namespace. type LimitRange struct { TypeMeta `json:",inline"` // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the limits enforced. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status Spec LimitRangeSpec `json:"spec,omitempty"` } @@ -102,24 +113,29 @@ type LimitRange struct { type LimitRangeList struct { TypeMeta `json:",inline"` // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds ListMeta `json:"metadata,omitempty"` // Items is a list of LimitRange objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + // More info: + // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md Items []LimitRange `json:"items"` } ``` ### Validation -Validation of a **LimitRange** enforces that for a given named resource the following rules apply: +Validation of a **LimitRange** enforces that for a given named resource the +following rules apply: -Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified) +Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) +<= Max (if specified) ### Default Value Behavior -The following default value behaviors are applied to a LimitRange for a given named resource. +The following default value behaviors are applied to a LimitRange for a given +named resource. ``` if LimitRangeItem.Default[resourceName] is undefined @@ -137,11 +153,14 @@ if LimitRangeItem.DefaultRequest[resourceName] is undefined ## AdmissionControl plugin: LimitRanger -The **LimitRanger** plug-in introspects all incoming pod requests and evaluates the constraints defined on a LimitRange. +The **LimitRanger** plug-in introspects all incoming pod requests and evaluates +the constraints defined on a LimitRange. -If a constraint is not specified for an enumerated resource, it is not enforced or tracked. +If a constraint is not specified for an enumerated resource, it is not enforced +or tracked. -To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows: +To enable the plug-in and support for LimitRange, the kube-apiserver must be +configured as follows: ```console $ kube-apiserver --admission-control=LimitRanger @@ -158,7 +177,7 @@ Supported Resources: Supported Constraints: -Per container, the following must hold true +Per container, the following must hold true: | Constraint | Behavior | | ---------- | -------- | @@ -168,8 +187,10 @@ Per container, the following must hold true Supported Defaults: -1. Default - if the named resource has no enumerated value, the Limit is equal to the Default -2. DefaultRequest - if the named resource has no enumerated value, the Request is equal to the DefaultRequest +1. Default - if the named resource has no enumerated value, the Limit is equal +to the Default +2. DefaultRequest - if the named resource has no enumerated value, the Request +is equal to the DefaultRequest **Type: Pod** @@ -190,7 +211,8 @@ Across all containers in pod, the following must hold true ## Run-time configuration -The default ```LimitRange``` that is applied via Salt configuration will be updated as follows: +The default ```LimitRange``` that is applied via Salt configuration will be +updated as follows: ``` apiVersion: "v1" @@ -219,7 +241,8 @@ the following would happen. 1. The incoming container cpu would request 250m with a limit of 500m. 2. The incoming container memory would request 250Mi with a limit of 500Mi -3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. +3. If the container is later resized, it's cpu would be constrained to between +.1 and 1 and the ratio of limit to request could not exceed 4. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 99e92b8f..bfac66eb 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -36,7 +36,8 @@ Documentation for other releases can be found at ## Background -This document describes a system for enforcing hard resource usage limits per namespace as part of admission control. +This document describes a system for enforcing hard resource usage limits per +namespace as part of admission control. ## Use cases @@ -103,7 +104,7 @@ type ResourceQuotaList struct { ## Quota Tracked Resources -The following resources are supported by the quota system. +The following resources are supported by the quota system: | Resource | Description | | ------------ | ----------- | @@ -116,16 +117,19 @@ The following resources are supported by the quota system. | secrets | Total number of secrets | | persistentvolumeclaims | Total number of persistent volume claims | -If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed -by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) +If a third-party wants to track additional resources, it must follow the +resource naming conventions prescribed by Kubernetes. This means the resource +must have a fully-qualified name (i.e. mycompany.org/shinynewresource) ## Resource Requirements: Requests vs Limits -If a resource supports the ability to distinguish between a request and a limit for a resource, -the quota tracking system will only cost the request value against the quota usage. If a resource -is tracked by quota, and no request value is provided, the associated entity is rejected as part of admission. +If a resource supports the ability to distinguish between a request and a limit +for a resource, the quota tracking system will only cost the request value +against the quota usage. If a resource is tracked by quota, and no request value +is provided, the associated entity is rejected as part of admission. -For an example, consider the following scenarios relative to tracking quota on CPU: +For an example, consider the following scenarios relative to tracking quota on +CPU: | Pod | Container | Request CPU | Limit CPU | Result | | --- | --------- | ----------- | --------- | ------ | @@ -134,13 +138,14 @@ For an example, consider the following scenarios relative to tracking quota on C | Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit | | Z | C3 | none | none | The pod is rejected since it does not enumerate a request. | -The rationale for accounting for the requested amount of a resource versus the limit is the belief -that a user should only be charged for what they are scheduled against in the cluster. In addition, -attempting to track usage against actual usage, where request < actual < limit, is considered highly -volatile. +The rationale for accounting for the requested amount of a resource versus the +limit is the belief that a user should only be charged for what they are +scheduled against in the cluster. In addition, attempting to track usage against +actual usage, where request < actual < limit, is considered highly volatile. -As a consequence of this decision, the user is able to spread its usage of a resource across multiple tiers -of service. Let's demonstrate this via an example with a 4 cpu quota. +As a consequence of this decision, the user is able to spread its usage of a +resource across multiple tiers of service. Let's demonstrate this via an +example with a 4 cpu quota. The quota may be allocated as follows: @@ -150,48 +155,62 @@ The quota may be allocated as follows: | Y | C2 | 2 | 2 | Guaranteed | 2 | | Z | C3 | 1 | 3 | Burstable | 1 | -It is possible that the pods may consume 9 cpu over a given time period depending on the nodes available cpu -that held pod X and Z, but since we scheduled X and Z relative to the request, we only track the requesting -value against their allocated quota. If one wants to restrict the ratio between the request and limit, -it is encouraged that the user define a **LimitRange** with **LimitRequestRatio** to control burst out behavior. -This would in effect, let an administrator keep the difference between request and limit more in line with +It is possible that the pods may consume 9 cpu over a given time period +depending on the nodes available cpu that held pod X and Z, but since we +scheduled X and Z relative to the request, we only track the requesting value +against their allocated quota. If one wants to restrict the ratio between the +request and limit, it is encouraged that the user define a **LimitRange** with +**LimitRequestRatio** to control burst out behavior. This would in effect, let +an administrator keep the difference between request and limit more in line with tracked usage if desired. ## Status API -A REST API endpoint to update the status section of the **ResourceQuota** is exposed. It requires an atomic compare-and-swap -in order to keep resource usage tracking consistent. +A REST API endpoint to update the status section of the **ResourceQuota** is +exposed. It requires an atomic compare-and-swap in order to keep resource usage +tracking consistent. ## Resource Quota Controller -A resource quota controller monitors observed usage for tracked resources in the **Namespace**. +A resource quota controller monitors observed usage for tracked resources in the +**Namespace**. -If there is observed difference between the current usage stats versus the current **ResourceQuota.Status**, the controller -posts an update of the currently observed usage metrics to the **ResourceQuota** via the /status endpoint. +If there is observed difference between the current usage stats versus the +current **ResourceQuota.Status**, the controller posts an update of the +currently observed usage metrics to the **ResourceQuota** via the /status +endpoint. -The resource quota controller is the only component capable of monitoring and recording usage updates after a DELETE operation -since admission control is incapable of guaranteeing a DELETE request actually succeeded. +The resource quota controller is the only component capable of monitoring and +recording usage updates after a DELETE operation since admission control is +incapable of guaranteeing a DELETE request actually succeeded. ## AdmissionControl plugin: ResourceQuota The **ResourceQuota** plug-in introspects all incoming admission requests. -To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows: +To enable the plug-in and support for ResourceQuota, the kube-apiserver must be +configured as follows: ``` $ kube-apiserver --admission-control=ResourceQuota ``` -It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request -namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied. +It makes decisions by evaluating the incoming object against all defined +**ResourceQuota.Status.Hard** resource limits in the request namespace. If +acceptance of the resource would cause the total usage of a named resource to +exceed its hard limit, the request is denied. -If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a -**ResourceQuota.Status** document to the server to atomically update the observed usage based on the previously read -**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally) -into the system. +If the incoming request does not cause the total usage to exceed any of the +enumerated hard resource limits, the plug-in will post a +**ResourceQuota.Status** document to the server to atomically update the +observed usage based on the previously read **ResourceQuota.ResourceVersion**. +This keeps incremental usage atomically consistent, but does introduce a +bottleneck (intentionally) into the system. -To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document in a **Namespace**. As a result, its encouraged to impose a cap on the total number of individual quotas that are tracked in the **Namespace** -to 1 in the **ResourceQuota** document. +To optimize system performance, it is encouraged that all resource quotas are +tracked on the same **ResourceQuota** document in a **Namespace**. As a result, +it is encouraged to impose a cap on the total number of individual quotas that +are tracked in the **Namespace** to 1 in the **ResourceQuota** document. ## kubectl @@ -199,7 +218,7 @@ kubectl is modified to support the **ResourceQuota** resource. `kubectl describe` provides a human-readable output of quota. -For example, +For example: ```console $ kubectl create -f docs/admin/resourcequota/namespace.yaml diff --git a/architecture.md b/architecture.md index f3ff5e2f..b8ce990f 100644 --- a/architecture.md +++ b/architecture.md @@ -34,49 +34,84 @@ Documentation for other releases can be found at # Kubernetes architecture -A running Kubernetes cluster contains node agents (`kubelet`) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making `kubelet` itself (all our components, really) run within containers, and making the scheduler 100% pluggable. +A running Kubernetes cluster contains node agents (`kubelet`) and master +components (APIs, scheduler, etc), on top of a distributed storage solution. +This diagram shows our desired eventual state, though we're still working on a +few things, like making `kubelet` itself (all our components, really) run within +containers, and making the scheduler 100% pluggable. ![Architecture Diagram](architecture.png?raw=true "Architecture overview") ## The Kubernetes Node -When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that compose the cluster-level control plane. +When looking at the architecture of the system, we'll break it down to services +that run on the worker node and services that compose the cluster-level control +plane. -The Kubernetes node has the services necessary to run application containers and be managed from the master systems. +The Kubernetes node has the services necessary to run application containers and +be managed from the master systems. -Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers. +Each node runs Docker, of course. Docker takes care of the details of +downloading images and running containers. ### `kubelet` -The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their images, their volumes, etc. +The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their +images, their volumes, etc. ### `kube-proxy` -Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends. +Each node also runs a simple network proxy and load balancer (see the +[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for +more details). This reflects `services` (see +[the services doc](../user-guide/services.md) for more details) as defined in +the Kubernetes API on each node and can do simple TCP and UDP stream forwarding +(round robin) across a set of backends. -Service endpoints are currently found via [DNS](../admin/dns.md) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy. +Service endpoints are currently found via [DNS](../admin/dns.md) or through +environment variables (both +[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and +Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are +supported). These variables resolve to ports managed by the service proxy. ## The Kubernetes Control Plane -The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster. +The Kubernetes control plane is split into a set of components. Currently they +all run on a single _master_ node, but that is expected to change soon in order +to support high-availability clusters. These components work together to provide +a unified view of the cluster. ### `etcd` -All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes. +All persistent master state is stored in an instance of `etcd`. This provides a +great way to store configuration data reliably. With `watch` support, +coordinating components can be notified very quickly of changes. ### Kubernetes API Server -The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a CRUD-y server, with most/all business logic implemented in separate components or in plug-ins. It mainly processes REST operations, validates them, and updates the corresponding objects in `etcd` (and eventually other stores). +The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a +CRUD-y server, with most/all business logic implemented in separate components +or in plug-ins. It mainly processes REST operations, validates them, and updates +the corresponding objects in `etcd` (and eventually other stores). ### Scheduler -The scheduler binds unscheduled pods to nodes via the `/binding` API. The scheduler is pluggable, and we expect to support multiple cluster schedulers and even user-provided schedulers in the future. +The scheduler binds unscheduled pods to nodes via the `/binding` API. The +scheduler is pluggable, and we expect to support multiple cluster schedulers and +even user-provided schedulers in the future. ### Kubernetes Controller Manager Server -All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable. +All other cluster-level functions are currently performed by the Controller +Manager. For instance, `Endpoints` objects are created and updated by the +endpoints controller, and nodes are discovered, managed, and monitored by the +node controller. These could eventually be split into separate components to +make them independently pluggable. -The [`replicationcontroller`](../user-guide/replication-controller.md) is a mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. +The [`replicationcontroller`](../user-guide/replication-controller.md) is a +mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) +API. We eventually plan to port it to a generic plug-in mechanism, once one is +implemented. diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index da24fb19..98d18251 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -35,7 +35,7 @@ Documentation for other releases can be found at # Peeking under the hood of Kubernetes on AWS This document provides high-level insight into how Kubernetes works on AWS and -maps to AWS objects. We assume that you are familiar with AWS. +maps to AWS objects. We assume that you are familiar with AWS. We encourage you to use [kube-up](../getting-started-guides/aws.md) to create clusters on AWS. We recommend that you avoid manual configuration but are aware @@ -72,7 +72,7 @@ By default on AWS: * Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently modern kernel that pairs well with Docker and doesn't require a - reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) * Nodes use aufs instead of ext4 as the filesystem / container storage (mostly because this is what Google Compute Engine uses). @@ -81,35 +81,36 @@ kube-up. ### Storage -AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). These can then be -attached to pods that should store persistent data (e.g. if you're running a -database). +AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). +These can then be attached to pods that should store persistent data (e.g. if +you're running a database). By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) unless you create pods with persistent volumes -[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes +[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes containers do not have persistent storage unless you attach a persistent volume, and so nodes on AWS use instance storage. Instance storage is cheaper, -often faster, and historically more reliable. Unless you can make do with whatever -space is left on your root partition, you must choose an instance type that provides -you with sufficient instance storage for your needs. +often faster, and historically more reliable. Unless you can make do with +whatever space is left on your root partition, you must choose an instance type +that provides you with sufficient instance storage for your needs. -Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track -its state. Similar to nodes, containers are mostly run against instance +Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to +track its state. Similar to nodes, containers are mostly run against instance storage, except that we repoint some important data onto the persistent volume. -The default storage driver for Docker images is aufs. Specifying btrfs (by passing the environment -variable `DOCKER_STORAGE=btrfs` to kube-up) is also a good choice for a filesystem. btrfs -is relatively reliable with Docker and has improved its reliability with modern -kernels. It can easily span multiple volumes, which is particularly useful -when we are using an instance type with multiple ephemeral instance disks. +The default storage driver for Docker images is aufs. Specifying btrfs (by +passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a +good choice for a filesystem. btrfs is relatively reliable with Docker and has +improved its reliability with modern kernels. It can easily span multiple +volumes, which is particularly useful when we are using an instance type with +multiple ephemeral instance disks. ### Auto Scaling group Nodes (but not the master) are run in an [Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) -on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled -([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means +on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled +([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means that AWS will relaunch any nodes that are terminated. We do not currently run the master in an AutoScalingGroup, but we should @@ -117,13 +118,13 @@ We do not currently run the master in an AutoScalingGroup, but we should ### Networking -Kubernetes uses an IP-per-pod model. This means that a node, which runs many -pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced +Kubernetes uses an IP-per-pod model. This means that a node, which runs many +pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then configured to route to an instance in the VPC routing table. -It is also possible to use overlay networking on AWS, but that is not the default -configuration of the kube-up script. +It is also possible to use overlay networking on AWS, but that is not the +default configuration of the kube-up script. ### NodePort and LoadBalancer services @@ -137,8 +138,8 @@ the nodes. This traffic reaches kube-proxy where it is then forwarded to the pods. ELB has some restrictions: -* it requires that all nodes listen on a single port, -* it acts as a forwarding proxy (i.e. the source IP is not preserved). +* ELB requires that all nodes listen on a single port, +* ELB acts as a forwarding proxy (i.e. the source IP is not preserved). To work with these restrictions, in Kubernetes, [LoadBalancer services](../user-guide/services.md#type-loadbalancer) are exposed as @@ -146,18 +147,18 @@ services](../user-guide/services.md#type-loadbalancer) are exposed as kube-proxy listens externally on the cluster-wide port that's assigned to NodePort services and forwards traffic to the corresponding pods. -So for example, if we configure a service of Type LoadBalancer with a +For example, if we configure a service of Type LoadBalancer with a public port of 80: -* Kubernetes will assign a NodePort to the service (e.g. 31234) +* Kubernetes will assign a NodePort to the service (e.g. port 31234) * ELB is configured to proxy traffic on the public port 80 to the NodePort - that is assigned to the service (31234). -* Then any in-coming traffic that ELB forwards to the NodePort (e.g. port 31234) - is recognized by kube-proxy and sent to the correct pods for that service. +assigned to the service (in this example port 31234). +* Then any in-coming traffic that ELB forwards to the NodePort (31234) +is recognized by kube-proxy and sent to the correct pods for that service. Note that we do not automatically open NodePort services in the AWS firewall -(although we do open LoadBalancer services). This is because we expect that +(although we do open LoadBalancer services). This is because we expect that NodePort services are more of a building block for things like inter-cluster -services or for LoadBalancer. To consume a NodePort service externally, you +services or for LoadBalancer. To consume a NodePort service externally, you will likely have to open the port in the node security group (`kubernetes-minion-`). @@ -169,19 +170,19 @@ and one for the nodes called [kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). The master is responsible for creating ELBs and configuring them, as well as -setting up advanced VPC routing. Currently it has blanket permissions on EC2, +setting up advanced VPC routing. Currently it has blanket permissions on EC2, along with rights to create and destroy ELBs. -The nodes do not need a lot of access to the AWS APIs. They need to download +The nodes do not need a lot of access to the AWS APIs. They need to download a distribution file, and then are responsible for attaching and detaching EBS volumes from itself. -The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR +The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR authorization tokens, refresh them every 12 hours if needed, and fetch Docker -images from it, as long as the appropriate permissions are enabled. Those in +images from it, as long as the appropriate permissions are enabled. Those in [AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly), -without write access, should suffice. The master policy is probably overly -permissive. The security conscious may want to lock-down the IAM policies +without write access, should suffice. The master policy is probably overly +permissive. The security conscious may want to lock-down the IAM policies further ([#11936](http://issues.k8s.io/11936)). We should make it easier to extend IAM permissions and also ensure that they @@ -190,106 +191,101 @@ are correctly configured ([#14226](http://issues.k8s.io/14226)). ### Tagging All AWS resources are tagged with a tag named "KubernetesCluster", with a value -that is the unique cluster-id. This tag is used to identify a particular +that is the unique cluster-id. This tag is used to identify a particular 'instance' of Kubernetes, even if two clusters are deployed into the same VPC. Resources are considered to belong to the same cluster if and only if they have -the same value in the tag named "KubernetesCluster". (The kube-up script is +the same value in the tag named "KubernetesCluster". (The kube-up script is not configured to create multiple clusters in the same VPC by default, but it is possible to create another cluster in the same VPC.) Within the AWS cloud provider logic, we filter requests to the AWS APIs to -match resources with our cluster tag. By filtering the requests, we ensure +match resources with our cluster tag. By filtering the requests, we ensure that we see only our own AWS objects. -Important: If you choose not to use kube-up, you must pick a unique cluster-id -value, and ensure that all AWS resources have a tag with +** Important: ** If you choose not to use kube-up, you must pick a unique +cluster-id value, and ensure that all AWS resources have a tag with `Name=KubernetesCluster,Value=`. ### AWS objects The kube-up script does a number of things in AWS: - -* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes distribution - and the salt scripts into it. They are made world-readable and the HTTP URLs - are passed to instances; this is how Kubernetes code gets onto the machines. +* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes +distribution and the salt scripts into it. They are made world-readable and the +HTTP URLs are passed to instances; this is how Kubernetes code gets onto the +machines. * Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): * `kubernetes-master` is used by the master. * `kubernetes-minion` is used by nodes. -* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is - the OpenSSH key fingerprint, so that multiple users can run the script with - different keys and their keys will not collide (with near-certainty). It will - use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create - one there. (With the default Ubuntu images, if you have to SSH in: the user is - `ubuntu` and that user can `sudo`). +* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is +the OpenSSH key fingerprint, so that multiple users can run the script with +different keys and their keys will not collide (with near-certainty). It will +use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create +one there. (With the default Ubuntu images, if you have to SSH in: the user is +`ubuntu` and that user can `sudo`). * Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and - enables the `dns-support` and `dns-hostnames` options. +enables the `dns-support` and `dns-hostnames` options. * Creates an internet gateway for the VPC. * Creates a route table for the VPC, with the internet gateway as the default - route. +route. * Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` - (defaults to us-west-2a). Currently, each Kubernetes cluster runs in a - single AZ on AWS. Although, there are two philosophies in discussion on how to - achieve High Availability (HA): - * cluster-per-AZ: An independent cluster for each AZ, where each cluster - is entirely separate. - * cross-AZ-clusters: A single cluster spans multiple AZs. +(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a +single AZ on AWS. Although, there are two philosophies in discussion on how to +achieve High Availability (HA): + * cluster-per-AZ: An independent cluster for each AZ, where each cluster +is entirely separate. + * cross-AZ-clusters: A single cluster spans multiple AZs. The debate is open here, where cluster-per-AZ is discussed as more robust but cross-AZ-clusters are more convenient. * Associates the subnet to the route table * Creates security groups for the master (`kubernetes-master-`) - and the nodes (`kubernetes-minion-`). +and the nodes (`kubernetes-minion-`). * Configures security groups so that masters and nodes can communicate. This - includes intercommunication between masters and nodes, opening SSH publicly - for both masters and nodes, and opening port 443 on the master for the HTTPS - API endpoints. +includes intercommunication between masters and nodes, opening SSH publicly +for both masters and nodes, and opening port 443 on the master for the HTTPS +API endpoints. * Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type - `MASTER_DISK_TYPE`. +`MASTER_DISK_TYPE`. * Launches a master with a fixed IP address (172.20.0.9) that is also - configured for the security group and all the necessary IAM credentials. An - instance script is used to pass vital configuration information to Salt. Note: - The hope is that over time we can reduce the amount of configuration - information that must be passed in this way. +configured for the security group and all the necessary IAM credentials. An +instance script is used to pass vital configuration information to Salt. Note: +The hope is that over time we can reduce the amount of configuration +information that must be passed in this way. * Once the instance is up, it attaches the EBS volume and sets up a manual - routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to - 10.246.0.0/24). +routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to +10.246.0.0/24). * For auto-scaling, on each nodes it creates a launch configuration and group. - The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default - name is kubernetes-minion-group. The auto-scaling group has a min and max size - that are both set to NUM_NODES. You can change the size of the auto-scaling - group to add or remove the total number of nodes from within the AWS API or - Console. Each nodes self-configures, meaning that they come up; run Salt with - the stored configuration; connect to the master; are assigned an internal CIDR; - and then the master configures the route-table with the assigned CIDR. The - kube-up script performs a health-check on the nodes but it's a self-check that - is not required. - - -If attempting this configuration manually, I highly recommend following along +The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default +name is kubernetes-minion-group. The auto-scaling group has a min and max size +that are both set to NUM_NODES. You can change the size of the auto-scaling +group to add or remove the total number of nodes from within the AWS API or +Console. Each nodes self-configures, meaning that they come up; run Salt with +the stored configuration; connect to the master; are assigned an internal CIDR; +and then the master configures the route-table with the assigned CIDR. The +kube-up script performs a health-check on the nodes but it's a self-check that +is not required. + +If attempting this configuration manually, it is recommend to follow along with the kube-up script, and being sure to tag everything with a tag with name -`KubernetesCluster` and value set to a unique cluster-id. Also, passing the +`KubernetesCluster` and value set to a unique cluster-id. Also, passing the right configuration options to Salt when not using the script is tricky: the plan here is to simplify this by having Kubernetes take on more node configuration, and even potentially remove Salt altogether. - ### Manual infrastructure creation While this work is not yet complete, advanced users might choose to manually -create certain AWS objects while still making use of the kube-up script (to configure -Salt, for example). These objects can currently be manually created: - +create certain AWS objects while still making use of the kube-up script (to +configure Salt, for example). These objects can currently be manually created: * Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. * Set the `VPC_ID` environment variable to reuse an existing VPC. * Set the `SUBNET_ID` environment variable to reuse an existing subnet. -* If your route table has a matching `KubernetesCluster` tag, it will - be reused. +* If your route table has a matching `KubernetesCluster` tag, it will be reused. * If your security groups are appropriately named, they will be reused. Currently there is no way to do the following with kube-up: - * Use an existing AWS SSH key with an arbitrary name. * Override the IAM credentials in a sensible way - ([#14226](http://issues.k8s.io/14226)). +([#14226](http://issues.k8s.io/14226)). * Use different security group permissions. * Configure your own auto-scaling groups. @@ -312,8 +308,6 @@ Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually install Kubernetes. - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() diff --git a/clustering.md b/clustering.md index fbf6892c..327456b3 100644 --- a/clustering.md +++ b/clustering.md @@ -37,60 +37,122 @@ Documentation for other releases can be found at ## Overview -The term "clustering" refers to the process of having all members of the Kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address. +The term "clustering" refers to the process of having all members of the +Kubernetes cluster find and trust each other. There are multiple different ways +to achieve clustering with different security and usability profiles. This +document attempts to lay out the user experiences for clustering that Kubernetes +aims to address. Once a cluster is established, the following is true: -1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity. - 1. **Location** The master knows the name and location of all of the nodes in the cluster. - * For the purposes of this doc, location and name should be enough information so that the master can open a TCP connection to the Node. Most probably we will make this either an IP address or a DNS name. It is going to be important to be consistent here (master must be able to reach kubelet on that DNS name) so that we can verify certificates appropriately. - 2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node. - 3. **Caller AuthN/Z** This would be the master verifying itself (and permissions) when calling the node. Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though. -2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events. +1. **Master -> Node** The master needs to know which nodes can take work and +what their current status is wrt capacity. + 1. **Location** The master knows the name and location of all of the nodes in +the cluster. + * For the purposes of this doc, location and name should be enough +information so that the master can open a TCP connection to the Node. Most +probably we will make this either an IP address or a DNS name. It is going to be +important to be consistent here (master must be able to reach kubelet on that +DNS name) so that we can verify certificates appropriately. + 2. **Target AuthN** A way to securely talk to the kubelet on that node. +Currently we call out to the kubelet over HTTP. This should be over HTTPS and +the master should know what CA to trust for that node. + 3. **Caller AuthN/Z** This would be the master verifying itself (and +permissions) when calling the node. Currently, this is only used to collect +statistics as authorization isn't critical. This may change in the future +though. +2. **Node -> Master** The nodes currently talk to the master to know which pods +have been assigned to them and to publish events. 1. **Location** The nodes must know where the master is at. - 2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to. - 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also. - -**Note:** While the description here refers to a singular Master, in the future we should enable multiple Masters operating in an HA mode. While the "Master" is currently the combination of the API Server, Scheduler and Controller Manager, we will restrict ourselves to thinking about the main API and policy engine -- the API Server. + 2. **Target AuthN** Since the master is assigning work to the nodes, it is +critical that they verify whom they are talking to. + 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to +the master. Ideally this authentication is specific to each node so that +authorization can be narrowly scoped. The details of the work to run (including +things like environment variables) might be considered sensitive and should be +locked down also. + +**Note:** While the description here refers to a singular Master, in the future +we should enable multiple Masters operating in an HA mode. While the "Master" is +currently the combination of the API Server, Scheduler and Controller Manager, +we will restrict ourselves to thinking about the main API and policy engine -- +the API Server. ## Current Implementation -A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i). +A central authority (generally the master) is responsible for determining the +set of machines which are members of the cluster. Calls to create and remove +worker nodes in the cluster are restricted to this single authority, and any +other requests to add or remove worker nodes are rejected. (1.i.) -Communication from the master to nodes is currently over HTTP and is not secured or authenticated in any way. (1.ii, 1.iii). +Communication from the master to nodes is currently over HTTP and is not secured +or authenticated in any way. (1.ii, 1.iii.) -The location of the master is communicated out of band to the nodes. For GCE, this is done via Salt. Other cluster instructions/scripts use other methods. (2.i) +The location of the master is communicated out of band to the nodes. For GCE, +this is done via Salt. Other cluster instructions/scripts use other methods. +(2.i.) -Currently most communication from the node to the master is over HTTP. When it is done over HTTPS there is currently no verification of the cert of the master (2.ii). +Currently most communication from the node to the master is over HTTP. When it +is done over HTTPS there is currently no verification of the cert of the master +(2.ii.) -Currently, the node/kubelet is authenticated to the master via a token shared across all nodes. This token is distributed out of band (using Salt for GCE) and is optional. If it is not present then the kubelet is unable to publish events to the master. (2.iii) +Currently, the node/kubelet is authenticated to the master via a token shared +across all nodes. This token is distributed out of band (using Salt for GCE) and +is optional. If it is not present then the kubelet is unable to publish events +to the master. (2.iii.) -Our current mix of out of band communication doesn't meet all of our needs from a security point of view and is difficult to set up and configure. +Our current mix of out of band communication doesn't meet all of our needs from +a security point of view and is difficult to set up and configure. ## Proposed Solution -The proposed solution will provide a range of options for setting up and maintaining a secure Kubernetes cluster. We want to both allow for centrally controlled systems (leveraging pre-existing trust and configuration systems) or more ad-hoc automagic systems that are incredibly easy to set up. +The proposed solution will provide a range of options for setting up and +maintaining a secure Kubernetes cluster. We want to both allow for centrally +controlled systems (leveraging pre-existing trust and configuration systems) or +more ad-hoc automagic systems that are incredibly easy to set up. The building blocks of an easier solution: -* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN. -* [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate. - * **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors. -* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give a node permission to register itself. - * To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic. -* [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master. +* **Move to TLS** We will move to using TLS for all intra-cluster communication. +We will explicitly identify the trust chain (the set of trusted CAs) as opposed +to trusting the system CAs. We will also use client certificates for all AuthN. +* [optional] **API driven CA** Optionally, we will run a CA in the master that +will mint certificates for the nodes/kubelets. There will be pluggable policies +that will automatically approve certificate requests here as appropriate. + * **CA approval policy** This is a pluggable policy object that can +automatically approve CA signing requests. Stock policies will include +`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would +be an API for evaluating and accepting/rejecting requests. Cloud providers could +implement a policy here that verifies other out of band information and +automatically approves/rejects based on other external factors. +* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give +a node permission to register itself. + * To start with, we'd have the kubelets generate a cert/account in the form of +`kubelet:`. To start we would then hard code policy such that we give that +particular account appropriate permissions. Over time, we can make the policy +engine more generic. +* [optional] **Bootstrap API endpoint** This is a helper service hosted outside +of the Kubernetes cluster that helps with initial discovery of the master. ### Static Clustering -In this sequence diagram there is out of band admin entity that is creating all certificates and distributing them. It is also making sure that the kubelets know where to find the master. This provides for a lot of control but is more difficult to set up as lots of information must be communicated outside of Kubernetes. +In this sequence diagram there is out of band admin entity that is creating all +certificates and distributing them. It is also making sure that the kubelets +know where to find the master. This provides for a lot of control but is more +difficult to set up as lots of information must be communicated outside of +Kubernetes. ![Static Sequence Diagram](clustering/static.png) ### Dynamic Clustering -This diagram dynamic clustering using the bootstrap API endpoint. That API endpoint is used to both find the location of the master and communicate the root CA for the master. +This diagram shows dynamic clustering using the bootstrap API endpoint. This +endpoint is used to both find the location of the master and communicate the +root CA for the master. -This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means. +This flow has the admin manually approving the kubelet signing requests. This is +the `queue` policy defined above. This manual intervention could be replaced by +code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) diff --git a/clustering/README.md b/clustering/README.md index 3bfd1905..193f343b 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -33,7 +33,8 @@ Documentation for other releases can be found at This directory contains diagrams for the clustering design doc. -This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). Assuming you have a non-borked python install, this should be installable with +This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). +Assuming you have a non-borked python install, this should be installable with: ```sh pip install seqdiag @@ -43,7 +44,8 @@ Just call `make` to regenerate the diagrams. ## Building with Docker -If you are on a Mac or your pip install is messed up, you can easily build with docker. +If you are on a Mac or your pip install is messed up, you can easily build with +docker: ```sh make docker @@ -51,13 +53,18 @@ make docker The first run will be slow but things should be fast after that. -To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`. +To clean up the docker containers that are created (and other cruft that is left +around) you can run `make docker-clean`. -If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`. +If you are using boot2docker and get warnings about clock skew (or if things +aren't building for some reason) then you can fix that up with +`make fix-clock-skew`. ## Automatically rebuild on file changes -If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`. +If you have the fswatch utility installed, you can have it monitor the file +system and automatically rebuild when files have changed. Just do a +`make watch`. diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index d687f3e2..4e579f8d 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -36,14 +36,13 @@ Documentation for other releases can be found at ## Abstract -This describes an approach for providing support for: - -- executing commands in containers, with stdin/stdout/stderr streams attached -- port forwarding to containers +This document describes how to use Kubernetes to execute commands in containers, +with stdin/stdout/stderr streams attached and how to implement port forwarding +to the containers. ## Background -There are several related issues/PRs: +See the following related issues/PRs: - [Support attach](http://issue.k8s.io/1521) - [Real container ssh](http://issue.k8s.io/1513) @@ -77,34 +76,39 @@ won't be able to work with this mechanism, unless adapters can be written. ## Constraints and Assumptions -- SSH support is not currently in scope -- CGroup confinement is ultimately desired, but implementing that support is not currently in scope -- SELinux confinement is ultimately desired, but implementing that support is not currently in scope +- SSH support is not currently in scope. +- CGroup confinement is ultimately desired, but implementing that support is not +currently in scope. +- SELinux confinement is ultimately desired, but implementing that support is +not currently in scope. ## Use Cases -- As a user of a Kubernetes cluster, I want to run arbitrary commands in a container, attaching my local stdin/stdout/stderr to the container -- As a user of a Kubernetes cluster, I want to be able to connect to local ports on my computer and have them forwarded to ports in the container +- A user of a Kubernetes cluster wants to run arbitrary commands in a +container with local stdin/stdout/stderr attached to the container. +- A user of a Kubernetes cluster wants to connect to local ports on his computer +and have them forwarded to ports in a container. ## Process Flow ### Remote Command Execution Flow -1. The client connects to the Kubernetes Master to initiate a remote command execution -request -2. The Master proxies the request to the Kubelet where the container lives -3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container +1. The client connects to the Kubernetes Master to initiate a remote command +execution request. +2. The Master proxies the request to the Kubelet where the container lives. +3. The Kubelet executes nsenter + the requested command and streams +stdin/stdout/stderr back and forth between the client and the container. ### Port Forwarding Flow -1. The client connects to the Kubernetes Master to initiate a remote command execution -request -2. The Master proxies the request to the Kubelet where the container lives -3. The client listens on each specified local port, awaiting local connections -4. The client connects to one of the local listening ports -4. The client notifies the Kubelet of the new connection -5. The Kubelet executes nsenter + socat and streams data back and forth between the client and the port in the container - +1. The client connects to the Kubernetes Master to initiate a remote command +execution request. +2. The Master proxies the request to the Kubelet where the container lives. +3. The client listens on each specified local port, awaiting local connections. +4. The client connects to one of the local listening ports. +4. The client notifies the Kubelet of the new connection. +5. The Kubelet executes nsenter + socat and streams data back and forth between +the client and the port in the container. ## Design Considerations @@ -177,7 +181,10 @@ functionality. We need to make sure that users are not allowed to execute remote commands or do port forwarding to containers they aren't allowed to access. -Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts. +Additional work is required to ensure that multiple command execution or port +forwarding connections from different clients are not able to see each other's +data. This can most likely be achieved via SELinux labeling and unique process + contexts. diff --git a/configmap.md b/configmap.md index 72bb4415..b12e051a 100644 --- a/configmap.md +++ b/configmap.md @@ -36,8 +36,8 @@ Documentation for other releases can be found at ## Abstract -The `ConfigMap` API resource stores data used for the configuration of applications deployed on -Kubernetes. +The `ConfigMap` API resource stores data used for the configuration of +applications deployed on Kubernetes. The main focus of this resource is to: @@ -47,71 +47,74 @@ The main focus of this resource is to: ## Motivation -A `Secret`-like API resource is needed to store configuration data that pods can consume. +A `Secret`-like API resource is needed to store configuration data that pods can +consume. Goals of this design: -1. Describe a `ConfigMap` API resource -2. Describe the semantics of consuming `ConfigMap` as environment variables -3. Describe the semantics of consuming `ConfigMap` as files in a volume +1. Describe a `ConfigMap` API resource. +2. Describe the semantics of consuming `ConfigMap` as environment variables. +3. Describe the semantics of consuming `ConfigMap` as files in a volume. ## Use Cases -1. As a user, I want to be able to consume configuration data as environment variables -2. As a user, I want to be able to consume configuration data as files in a volume -3. As a user, I want my view of configuration data in files to be eventually consistent with changes - to the data +1. As a user, I want to be able to consume configuration data as environment +variables. +2. As a user, I want to be able to consume configuration data as files in a +volume. +3. As a user, I want my view of configuration data in files to be eventually +consistent with changes to the data. ### Consuming `ConfigMap` as Environment Variables -Many programs read their configuration from environment variables. `ConfigMap` should be possible -to consume in environment variables. The rough series of events for consuming `ConfigMap` this way -is: +A series of events for consuming `ConfigMap` as environment variables: -1. A `ConfigMap` object is created -2. A pod that consumes the configuration data via environment variables is created -3. The pod is scheduled onto a node -4. The kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and starts the container - processes with the appropriate data in environment variables +1. Create a `ConfigMap` object. +2. Create a pod to consume the configuration data via environment variables. +3. The pod is scheduled onto a node. +4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and +starts the container processes with the appropriate configuration data from +environment variables. ### Consuming `ConfigMap` in Volumes -Many programs read their configuration from configuration files. `ConfigMap` should be possible -to consume in a volume. The rough series of events for consuming `ConfigMap` this way -is: +A series of events for consuming `ConfigMap` as configuration files in a volume: -1. A `ConfigMap` object is created -2. A new pod using the `ConfigMap` via the volume plugin is created -3. The pod is scheduled onto a node -4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` method -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects - the appropriate data into the volume +1. Create a `ConfigMap` object. +2. Create a new pod using the `ConfigMap` via a volume plugin. +3. The pod is scheduled onto a node. +4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` +method. +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod +and projects the appropriate configuration data into the volume. ### Consuming `ConfigMap` Updates -Any long-running system has configuration that is mutated over time. Changes made to configuration -data must be made visible to pods consuming data in volumes so that they can respond to those -changes. +Any long-running system has configuration that is mutated over time. Changes +made to configuration data must be made visible to pods consuming data in +volumes so that they can respond to those changes. -The `resourceVersion` of the `ConfigMap` object will be updated by the API server every time the -object is modified. After an update, modifications will be made visible to the consumer container: +The `resourceVersion` of the `ConfigMap` object will be updated by the API +server every time the object is modified. After an update, modifications will be +made visible to the consumer container: -1. A `ConfigMap` object is created -2. A new pod using the `ConfigMap` via the volume plugin is created -3. The pod is scheduled onto a node -4. During the sync loop, the Kubelet creates an instance of the volume plugin and calls its - `Setup()` method -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod and projects - the appropriate data into the volume -6. The `ConfigMap` referenced by the pod is updated -7. During the next iteration of the `syncLoop`, the Kubelet creates an instance of the volume plugin - and calls its `Setup()` method -8. The volume plugin projects the updated data into the volume atomically +1. Create a `ConfigMap` object. +2. Create a new pod using the `ConfigMap` via the volume plugin. +3. The pod is scheduled onto a node. +4. During the sync loop, the Kubelet creates an instance of the volume plugin +and calls its `Setup()` method. +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod +and projects the appropriate data into the volume. +6. The `ConfigMap` referenced by the pod is updated. +7. During the next iteration of the `syncLoop`, the Kubelet creates an instance +of the volume plugin and calls its `Setup()` method. +8. The volume plugin projects the updated data into the volume atomically. -It is the consuming pod's responsibility to make use of the updated data once it is made visible. +It is the consuming pod's responsibility to make use of the updated data once it +is made visible. -Because environment variables cannot be updated without restarting a container, configuration data -consumed in environment variables will not be updated. +Because environment variables cannot be updated without restarting a container, +configuration data consumed in environment variables will not be updated. ### Advantages @@ -133,8 +136,8 @@ type ConfigMap struct { TypeMeta `json:",inline"` ObjectMeta `json:"metadata,omitempty"` - // Data contains the configuration data. Each key must be a valid DNS_SUBDOMAIN or leading - // dot followed by valid DNS_SUBDOMAIN. + // Data contains the configuration data. Each key must be a valid + // DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN. Data map[string]string `json:"data,omitempty"` } @@ -146,7 +149,8 @@ type ConfigMapList struct { } ``` -A `Registry` implementation for `ConfigMap` will be added to `pkg/registry/configmap`. +A `Registry` implementation for `ConfigMap` will be added to +`pkg/registry/configmap`. ### Environment Variables @@ -174,8 +178,8 @@ type ConfigMapSelector struct { ### Volume Source -A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` object will be -added to the `VolumeSource` struct in the API: +A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` +object will be added to the `VolumeSource` struct in the API: ```go package api @@ -209,13 +213,14 @@ type KeyToPath struct { } ``` -**Note:** The update logic used in the downward API volume plug-in will be extracted and re-used in -the volume plug-in for `ConfigMap`. +**Note:** The update logic used in the downward API volume plug-in will be +extracted and re-used in the volume plug-in for `ConfigMap`. ### Changes to Secret -We will update the Secret volume plugin to have a similar API to the new ConfigMap volume plugin. -The secret volume plugin will also begin updating secret content in the volume when secrets change. +We will update the Secret volume plugin to have a similar API to the new +`ConfigMap` volume plugin. The secret volume plugin will also begin updating +secret content in the volume when secrets change. ## Examples @@ -281,7 +286,8 @@ spec: #### Consuming `ConfigMap` as Volumes -`redis-volume-config` is intended to be used as a volume containing a config file: +`redis-volume-config` is intended to be used as a volume containing a config +file: ```yaml apiVersion: extensions/v1beta1 @@ -320,8 +326,8 @@ spec: ## Future Improvements -In the future, we may add the ability to specify an init-container that can watch the volume -contents for updates and respond to changes when they occur. +In the future, we may add the ability to specify an init-container that can +watch the volume contents for updates and respond to changes when they occur. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 8becccec..39110e3a 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -54,7 +54,7 @@ ideas. * **High availability:** continuing to be available and work correctly even if some components are down or uncontactable. This typically involves multiple replicas of critical services, and a reliable way - to find available replicas. Note that it's possible (but not + to find available replicas. Note that it's possible (but not desirable) to have high availability properties (e.g. multiple replicas) in the absence of self-healing properties (e.g. if a replica fails, nothing replaces @@ -109,11 +109,11 @@ ideas. ## Relative Priorities -1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all - applications running inside it, disappear forever perhaps is the worst - possible failure mode. So it is critical that we be able to - recover the applications running inside a cluster from such - failures in some well-bounded time period. +1. **(Possibly manual) recovery from catastrophic failures:** having a +Kubernetes cluster, and all applications running inside it, disappear forever +perhaps is the worst possible failure mode. So it is critical that we be able to +recover the applications running inside a cluster from such failures in some +well-bounded time period. 1. In theory a cluster can be recovered by replaying all API calls that have ever been executed against it, in order, but most often that state has been lost, and/or is scattered across @@ -121,12 +121,12 @@ ideas. probably infeasible. 1. In theory a cluster can also be recovered to some relatively recent non-corrupt backup/snapshot of the disk(s) backing the - etcd cluster state. But we have no default consistent + etcd cluster state. But we have no default consistent backup/snapshot, verification or restoration process. And we don't routinely test restoration, so even if we did routinely perform and verify backups, we have no hard evidence that we can in practise effectively recover from catastrophic cluster - failure or data corruption by restoring from these backups. So + failure or data corruption by restoring from these backups. So there's more work to be done here. 1. **Self-healing:** Most major cloud providers provide the ability to easily and automatically replace failed virtual machines within a @@ -144,7 +144,6 @@ ideas. addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) or [backup and recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). - 1. and boot disks are either: 1. truely persistent (i.e. remote persistent disks), or 1. reconstructible (e.g. using boot-from-snapshot, @@ -157,7 +156,7 @@ ideas. quorum members). In environments where cloud-assisted automatic self-healing might be infeasible (e.g. on-premise bare-metal deployments), it also gives cluster administrators more time to - respond (e.g. replace/repair failed machines) without incurring + respond (e.g. replace/repair failed machines) without incurring system downtime. ## Design and Status (as of December 2015) @@ -174,7 +173,7 @@ ideas. Multiple stateless, self-hosted, self-healing API servers behind a HA load balancer, built out by the default "kube-up" automation on GCE, -AWS and basic bare metal (BBM). Note that the single-host approach of +AWS and basic bare metal (BBM). Note that the single-host approach of hving etcd listen only on localhost to ensure that onyl API server can connect to it will no longer work, so alternative security will be needed in the regard (either using firewall rules, SSL certs, or @@ -189,13 +188,13 @@ design doc. No scripted self-healing or HA on GCE, AWS or basic bare metal -currently exists in the OSS distro. To be clear, "no self healing" +currently exists in the OSS distro. To be clear, "no self healing" means that even if multiple e.g. API servers are provisioned for HA purposes, if they fail, nothing replaces them, so eventually the -system will fail. Self-healing and HA can be set up +system will fail. Self-healing and HA can be set up manually by following documented instructions, but this is not currently an automated process, and it is not tested as part of -continuous integration. So it's probably safest to assume that it +continuous integration. So it's probably safest to assume that it doesn't actually work in practise. @@ -205,8 +204,8 @@ doesn't actually work in practise. Multiple self-hosted, self healing warm standby stateless controller -managers and schedulers with leader election and automatic failover of API server -clients, automatically installed by default "kube-up" automation. +managers and schedulers with leader election and automatic failover of API +server clients, automatically installed by default "kube-up" automation. As above. @@ -218,47 +217,49 @@ clients, automatically installed by default "kube-up" automation. Multiple (3-5) etcd quorum members behind a load balancer with session affinity (to prevent clients from being bounced from one to another). -Regarding self-healing, if a node running etcd goes down, it is always necessary to do three -things: +Regarding self-healing, if a node running etcd goes down, it is always necessary +to do three things:
  1. allocate a new node (not necessary if running etcd as a pod, in which case specific measures are required to prevent user pods from interfering with system pods, for example using node selectors as -described in start an etcd replica on that new node, and
  2. have the new replica recover the etcd state.
In the case of local disk (which fails in concert with the machine), the etcd -state must be recovered from the other replicas. This is called dynamic member - addition. -In the case of remote persistent disk, the etcd state can be recovered -by attaching the remote persistent disk to the replacement node, thus -the state is recoverable even if all other replicas are down. +state must be recovered from the other replicas. This is called + +dynamic member addition. + +In the case of remote persistent disk, the etcd state can be recovered by +attaching the remote persistent disk to the replacement node, thus the state is +recoverable even if all other replicas are down. There are also significant performance differences between local disks and remote -persistent disks. For example, the sustained throughput -local disks in GCE is approximatley 20x that of remote disks. - -Hence we suggest that self-healing be provided by remotely mounted persistent disks in -non-performance critical, single-zone cloud deployments. For -performance critical installations, faster local SSD's should be used, -in which case remounting on node failure is not an option, so -etcd runtime configuration -should be used to replace the failed machine. Similarly, for -cross-zone self-healing, cloud persistent disks are zonal, so -automatic -runtime configuration -is required. Similarly, basic bare metal deployments cannot generally -rely on -remote persistent disks, so the same approach applies there. +persistent disks. For example, the + +sustained throughput local disks in GCE is approximatley 20x that of remote +disks. + +Hence we suggest that self-healing be provided by remotely mounted persistent +disks in non-performance critical, single-zone cloud deployments. For +performance critical installations, faster local SSD's should be used, in which +case remounting on node failure is not an option, so + +etcd runtime configuration should be used to replace the failed machine. +Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so +automatic +runtime configuration is required. Similarly, basic bare metal deployments +cannot generally rely on remote persistent disks, so the same approach applies +there. -Somewhat vague instructions exist -on how to set some of this up manually in a self-hosted -configuration. But automatic bootstrapping and self-healing is not -described (and is not implemented for the non-PD cases). This all -still needs to be automated and continuously tested. +Somewhat vague instructions exist on how to set some of this up manually in +a self-hosted configuration. But automatic bootstrapping and self-healing is not +described (and is not implemented for the non-PD cases). This all still needs to +be automated and continuously tested. diff --git a/daemon.md b/daemon.md index a08e4c3b..9b66e0e1 100644 --- a/daemon.md +++ b/daemon.md @@ -38,40 +38,68 @@ Documentation for other releases can be found at **Status**: Implemented. -This document presents the design of the Kubernetes DaemonSet, describes use cases, and gives an overview of the code. +This document presents the design of the Kubernetes DaemonSet, describes use +cases, and gives an overview of the code. ## Motivation -Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes. +Many users have requested for a way to run a daemon on every node in a +Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential +for use cases such as building a sharded datastore, or running a logger on every +node. In comes the DaemonSet, a way to conveniently create and manage +daemon-like workloads in Kubernetes. ## Use Cases -The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category. +The DaemonSet can be used for user-specified system services, cluster-level +applications with strong node ties, and Kubernetes node services. Below are +example use cases in each category. ### User-Specified System Services: -Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis. +Logging: Some users want a way to collect statistics about nodes in a cluster +and send those logs to an external database. For example, system administrators +might want to know if their machines are performing as expected, if they need to +add more machines to the cluster, or if they should switch cloud providers. The +DaemonSet can be used to run a data collection service (for example fluentd) on +every node and send the data to a service like ElasticSearch for analysis. ### Cluster-Level Applications -Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘app=datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore. +Datastore: Users might want to implement a sharded datastore in their cluster. A +few nodes in the cluster, labeled ‘app=datastore’, might be responsible for +storing data shards, and pods running on these nodes might serve data. This +architecture requires a way to bind pods to specific nodes, so it cannot be +achieved using a Replication Controller. A DaemonSet is a convenient way to +implement such a datastore. For other uses, see the related [feature request](https://issues.k8s.io/1518) ## Functionality The DaemonSet supports standard API features: -- create + - create - The spec for DaemonSets has a pod template field. - - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled ‘app=database’. - - Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node. - - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller. - - The initial implementation will not guarantee that DaemonSet pods are created on nodes before other pods. - - The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary. - - The DaemonSet controller adds an annotation "kubernetes.io/created-by: \" + - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate +over nodes that have a certain label. For example, suppose that in a cluster +some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a +datastore pod on exactly those nodes labeled ‘app=database’. + - Using the pod's nodeName field, DaemonSets can be restricted to operate on a +specified node. + - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec +used by the Replication Controller. + - The initial implementation will not guarantee that DaemonSet pods are +created on nodes before other pods. + - The initial implementation of DaemonSet does not guarantee that DaemonSet +pods show up on nodes (for example because of resource limitations of the node), +but makes a best effort to launch DaemonSet pods (like Replication Controllers +do with pods). Subsequent revisions might ensure that DaemonSet pods show up on +nodes, preempting other pods if necessary. + - The DaemonSet controller adds an annotation: +```"kubernetes.io/created-by: \"``` - YAML example: -```YAML + ```YAML apiVersion: extensions/v1beta1 kind: DaemonSet metadata: @@ -94,42 +122,83 @@ The DaemonSet supports standard API features: name: main ``` - - commands that get info + - commands that get info: - get (e.g. kubectl get daemonsets) - describe - - Modifiers - - delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is unlikely to be set on any node); then it deletes the DaemonSet; then it deletes the pods) + - Modifiers: + - delete (if --cascade=true, then first the client turns down all the pods +controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is +unlikely to be set on any node); then it deletes the DaemonSet; then it deletes +the pods) - label - - annotate - - update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template) - - DaemonSets have labels, so you could, for example, list all DaemonSets with certain labels (the same way you would for a Replication Controller). - - In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs. + - annotate + - update operations like patch and replace (only allowed to selector and to +nodeSelector and nodeName of pod template) + - DaemonSets have labels, so you could, for example, list all DaemonSets +with certain labels (the same way you would for a Replication Controller). + +In general, for all the supported features like get, describe, update, etc, +the DaemonSet works in a similar way to the Replication Controller. However, +note that the DaemonSet and the Replication Controller are different constructs. ### Persisting Pods - - Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running. - - If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node. + - Ordinary liveness probes specified in the pod template work to keep pods +created by a DaemonSet running. + - If a daemon pod is killed or stopped, the DaemonSet will create a new +replica of the daemon pod on the node. ### Cluster Mutations - - When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node’s labels. - - Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed. + - When a new node is added to the cluster, the DaemonSet controller starts +daemon pods on the node for DaemonSets whose pod template nodeSelectors match +the node’s labels. + - Suppose the user launches a DaemonSet that runs a logging daemon on all +nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label +to a node (that did not initially have the label), the logging daemon will +launch on the node. Additionally, if a user removes the label from a node, the +logging daemon on that node will be killed. ## Alternatives Considered -We considered several alternatives, that were deemed inferior to the approach of creating a new DaemonSet abstraction. - -One alternative is to include the daemon in the machine image. In this case it would run outside of Kubernetes proper, and thus not be monitored, health checked, usable as a service endpoint, easily upgradable, etc. - -A related alternative is to package daemons as static pods. This would address most of the problems described above, but they would still not be easily upgradable, and more generally could not be managed through the API server interface. - -A third alternative is to generalize the Replication Controller. We would do something like: if you set the `replicas` field of the ReplicationConrollerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number of nodes ever expected in the cluster -- and would use some anti-affinity mechanism to ensure that no more than one Pod from the ReplicationController runs on any given node. There are two downsides to this approach. First, there would always be a large number of Pending pods in the scheduler (these will be scheduled onto new machines when they are added to the cluster). The second downside is more philosophical: DaemonSet and the Replication Controller are very different concepts. We believe that having small, targeted controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having larger multi-functional controllers (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for some discussion of this topic). +We considered several alternatives, that were deemed inferior to the approach of +creating a new DaemonSet abstraction. + +One alternative is to include the daemon in the machine image. In this case it +would run outside of Kubernetes proper, and thus not be monitored, health +checked, usable as a service endpoint, easily upgradable, etc. + +A related alternative is to package daemons as static pods. This would address +most of the problems described above, but they would still not be easily +upgradable, and more generally could not be managed through the API server +interface. + +A third alternative is to generalize the Replication Controller. We would do +something like: if you set the `replicas` field of the ReplicationConrollerSpec +to -1, then it means "run exactly one replica on every node matching the +nodeSelector in the pod template." The ReplicationController would pretend +`replicas` had been set to some large number -- larger than the largest number +of nodes ever expected in the cluster -- and would use some anti-affinity +mechanism to ensure that no more than one Pod from the ReplicationController +runs on any given node. There are two downsides to this approach. First, +there would always be a large number of Pending pods in the scheduler (these +will be scheduled onto new machines when they are added to the cluster). The +second downside is more philosophical: DaemonSet and the Replication Controller +are very different concepts. We believe that having small, targeted controllers +for distinct purposes makes Kubernetes easier to understand and use, compared to +having larger multi-functional controllers (see +["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for +some discussion of this topic). ## Design #### Client -- Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. +- Add support for DaemonSet commands to kubectl and the client. Client code was +added to client/unversioned. The main files in Kubectl that were modified are +kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, +and Update, the client simply forwards the request to the backend via the REST +API. #### Apiserver @@ -137,18 +206,29 @@ A third alternative is to generalize the Replication Controller. We would do som - REST API calls are handled in registry/daemon - In particular, the api server will add the object to etcd - DaemonManager listens for updates to etcd (using Framework.informer) -- API objects for DaemonSet were created in expapi/v1/types.go and expapi/v1/register.go +- API objects for DaemonSet were created in expapi/v1/types.go and +expapi/v1/register.go - Validation code is in expapi/validation #### Daemon Manager -- Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet’s selector. -- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node. -- The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) +- Creates new DaemonSets when requested. Launches the corresponding daemon pod +on all nodes with labels matching the new DaemonSet’s selector. +- Listens for addition of new nodes to the cluster, by setting up a +framework.NewInformer that watches for the creation of Node API objects. When a +new node is added, the daemon manager will loop through each DaemonSet. If the +label of the node matches the selector of the DaemonSet, then the daemon manager +will create the corresponding daemon pod in the new node. +- The daemon manager creates a pod on a node by sending a command to the API +server, requesting for a pod to be bound to the node (the node will be specified +via its hostname.) #### Kubelet -- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don’t have restartPolicy set to Always. +- Does not need to be modified, but health checking will occur for the daemon +pods and revive the pods if they are killed (we set the pod restartPolicy to +Always). We reject DaemonSet objects with pod templates that don’t have +restartPolicy set to Always. ## Open Issues diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 4eef8831..8f184af9 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -34,33 +34,60 @@ Documentation for other releases can be found at # Enhance Pluggable Policy -While trying to develop an authorization plugin for Kubernetes, we found a few places where API extensions would ease development and add power. There are a few goals: - 1. Provide an authorization plugin that can evaluate a .Authorize() call based on the full content of the request to RESTStorage. This includes information like the full verb, the content of creates and updates, and the names of resources being acted upon. - 1. Provide a way to ask whether a user is permitted to take an action without running in process with the API Authorizer. For instance, a proxy for exec calls could ask whether a user can run the exec they are requesting. - 1. Provide a way to ask who can perform a given action on a given resource. This is useful for answering questions like, "who can create replication controllers in my namespace". - -This proposal adds to and extends the existing API to so that authorizers may provide the functionality described above. It does not attempt to describe how the policies themselves can be expressed, that is up the authorization plugins themselves. +While trying to develop an authorization plugin for Kubernetes, we found a few +places where API extensions would ease development and add power. There are a +few goals: + 1. Provide an authorization plugin that can evaluate a .Authorize() call based +on the full content of the request to RESTStorage. This includes information +like the full verb, the content of creates and updates, and the names of +resources being acted upon. + 1. Provide a way to ask whether a user is permitted to take an action without + running in process with the API Authorizer. For instance, a proxy for exec + calls could ask whether a user can run the exec they are requesting. + 1. Provide a way to ask who can perform a given action on a given resource. +This is useful for answering questions like, "who can create replication +controllers in my namespace". + +This proposal adds to and extends the existing API to so that authorizers may +provide the functionality described above. It does not attempt to describe how +the policies themselves can be expressed, that is up the authorization plugins +themselves. ## Enhancements to existing Authorization interfaces -The existing Authorization interfaces are described here: [docs/admin/authorization.md](../admin/authorization.md). A couple additions will allow the development of an Authorizer that matches based on different rules than the existing implementation. +The existing Authorization interfaces are described +[here](../admin/authorization.md). A couple additions will allow the development +of an Authorizer that matches based on different rules than the existing +implementation. ### Request Attributes -The existing authorizer.Attributes only has 5 attributes (user, groups, isReadOnly, kind, and namespace). If we add more detailed verbs, content, and resource names, then Authorizer plugins will have the same level of information available to RESTStorage components in order to express more detailed policy. The replacement excerpt is below. - -An API request has the following attributes that can be considered for authorization: - - user - the user-string which a user was authenticated as. This is included in the Context. - - groups - the groups to which the user belongs. This is included in the Context. - - verb - string describing the requesting action. Today we have: get, list, watch, create, update, and delete. The old `readOnly` behavior is equivalent to allowing get, list, watch. - - namespace - the namespace of the object being access, or the empty string if the endpoint does not support namespaced objects. This is included in the Context. +The existing authorizer.Attributes only has 5 attributes (user, groups, +isReadOnly, kind, and namespace). If we add more detailed verbs, content, and +resource names, then Authorizer plugins will have the same level of information +available to RESTStorage components in order to express more detailed policy. +The replacement excerpt is below. + +An API request has the following attributes that can be considered for +authorization: + - user - the user-string which a user was authenticated as. This is included +in the Context. + - groups - the groups to which the user belongs. This is included in the +Context. + - verb - string describing the requesting action. Today we have: get, list, +watch, create, update, and delete. The old `readOnly` behavior is equivalent to +allowing get, list, watch. + - namespace - the namespace of the object being access, or the empty string if +the endpoint does not support namespaced objects. This is included in the +Context. - resourceGroup - the API group of the resource being accessed - resourceVersion - the API version of the resource being accessed - resource - which resource is being accessed - - applies only to the API endpoints, such as - `/api/v1beta1/pods`. For miscellaneous endpoints, like `/version`, the kind is the empty string. - - resourceName - the name of the resource during a get, update, or delete action. + - applies only to the API endpoints, such as `/api/v1beta1/pods`. For +miscellaneous endpoints, like `/version`, the kind is the empty string. + - resourceName - the name of the resource during a get, update, or delete +action. - subresource - which subresource is being accessed A non-API request has 2 attributes: @@ -70,7 +97,14 @@ A non-API request has 2 attributes: ### Authorizer Interface -The existing Authorizer interface is very simple, but there isn't a way to provide details about allows, denies, or failures. The extended detail is useful for UIs that want to describe why certain actions are allowed or disallowed. Not all Authorizers will want to provide that information, but for those that do, having that capability is useful. In addition, adding a `GetAllowedSubjects` method that returns back the users and groups that can perform a particular action makes it possible to answer questions like, "who can see resources in my namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). +The existing Authorizer interface is very simple, but there isn't a way to +provide details about allows, denies, or failures. The extended detail is useful +for UIs that want to describe why certain actions are allowed or disallowed. Not +all Authorizers will want to provide that information, but for those that do, +having that capability is useful. In addition, adding a `GetAllowedSubjects` +method that returns back the users and groups that can perform a particular +action makes it possible to answer questions like, "who can see resources in my +namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). ```go // OLD @@ -81,41 +115,65 @@ type Authorizer interface { ```go // NEW -// Authorizer provides the ability to determine if a particular user can perform a particular action +// Authorizer provides the ability to determine if a particular user can perform +// a particular action type Authorizer interface { - // Authorize takes a Context (for namespace, user, and traceability) and Attributes to make a policy determination. - // reason is an optional return value that can describe why a policy decision was made. Reasons are useful during - // debugging when trying to figure out why a user or group has access to perform a particular action. + // Authorize takes a Context (for namespace, user, and traceability) and + // Attributes to make a policy determination. + // reason is an optional return value that can describe why a policy decision + // was made. Reasons are useful during debugging when trying to figure out + // why a user or group has access to perform a particular action. Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error) } -// AuthorizerIntrospection is an optional interface that provides the ability to determine which users and groups can perform a particular action. -// This is useful for building caches of who can see what: for instance, "which namespaces can this user see". That would allow -// someone to see only the namespaces they are allowed to view instead of having to choose between listing them all or listing none. +// AuthorizerIntrospection is an optional interface that provides the ability to +// determine which users and groups can perform a particular action. This is +// useful for building caches of who can see what. For instance, "which +// namespaces can this user see". That would allow someone to see only the +// namespaces they are allowed to view instead of having to choose between +// listing them all or listing none. type AuthorizerIntrospection interface { - // GetAllowedSubjects takes a Context (for namespace and traceability) and Attributes to determine which users and - // groups are allowed to perform the described action in the namespace. This API enables the ResourceBasedReview requests below + // GetAllowedSubjects takes a Context (for namespace and traceability) and + // Attributes to determine which users and groups are allowed to perform the + // described action in the namespace. This API enables the ResourceBasedReview + // requests below GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error) } ``` ### SubjectAccessReviews -This set of APIs answers the question: can a user or group (use authenticated user if none is specified) perform a given action. Given the Authorizer interface (proposed or existing), this endpoint can be implemented generically against any Authorizer by creating the correct Attributes and making an .Authorize() call. +This set of APIs answers the question: can a user or group (use authenticated +user if none is specified) perform a given action. Given the Authorizer +interface (proposed or existing), this endpoint can be implemented generically +against any Authorizer by creating the correct Attributes and making an +.Authorize() call. There are three different flavors: -1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this checks to see if a specified user or group can perform a given action at the cluster scope or across all namespaces. -This is a highly privileged operation. It allows a cluster-admin to inspect rights of any person across the entire cluster and against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - this checks to see if the current user (including his groups) can perform a given action at any specified scope. -This is an unprivileged operation. It doesn't expose any information that a user couldn't discover simply by trying an endpoint themselves. -3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - this checks to see if a specified user or group can perform a given action in **this** namespace. -This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin. -This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect whether a given user or group -can manipulate resources in his namespace. - - -SubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. +1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this +checks to see if a specified user or group can perform a given action at the +cluster scope or across all namespaces. This is a highly privileged operation. +It allows a cluster-admin to inspect rights of any person across the entire +cluster and against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - +this checks to see if the current user (including his groups) can perform a +given action at any specified scope. This is an unprivileged operation. It +doesn't expose any information that a user couldn't discover simply by trying an +endpoint themselves. +3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - +this checks to see if a specified user or group can perform a given action in +**this** namespace. This is a moderately privileged operation. In a multi-tenant +environment, having a namespace scoped resource makes it very easy to reason +about powers granted to a namespace admin. This allows a namespace admin +(someone able to manage permissions inside of one namespaces, but not all +namespaces), the power to inspect whether a given user or group can manipulate +resources in his namespace. + +SubjectAccessReview is runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets +a SubjectAccessReviewResponse back. Here is an example of a call and its +corresponding return: ``` // input @@ -141,10 +199,14 @@ accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessRev "apiVersion": "authorization.kubernetes.io/v1", "allowed": true } - -PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL and he gets a SubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. ``` +PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that +only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL +and he gets a SubjectAccessReviewResponse back. Here is an example of a call and +its corresponding return: + +``` // input { "kind": "PersonalSubjectAccessReview", @@ -167,8 +229,12 @@ accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectA "apiVersion": "authorization.kubernetes.io/v1", "allowed": true } +``` -LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and its corresponding return. +LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he +gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and +its corresponding return: ``` // input @@ -196,15 +262,14 @@ accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjec "namespace": "my-ns" "allowed": true } - - ``` The actual Go objects look like this: ```go type AuthorizationAttributes struct { - // Namespace is the namespace of the action being requested. Currently, there is no distinction between no namespace and all namespaces + // Namespace is the namespace of the action being requested. Currently, there + // is no distinction between no namespace and all namespaces Namespace string `json:"namespace" description:"namespace of the action being requested"` // Verb is one of: get, list, watch, create, update, delete Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"` @@ -214,13 +279,15 @@ type AuthorizationAttributes struct { ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"` // Resource is one of the existing resource types Resource string `json:"resource" description:"one of the existing resource types"` - // ResourceName is the name of the resource being requested for a "get" or deleted for a "delete" + // ResourceName is the name of the resource being requested for a "get" or + // deleted for a "delete" ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"` // Subresource is one of the existing subresources types Subresource string `json:"subresource" description:"one of the existing subresources"` } -// SubjectAccessReview is an object for requesting information about whether a user or group can perform an action +// SubjectAccessReview is an object for requesting information about whether a +// user or group can perform an action type SubjectAccessReview struct { kapi.TypeMeta `json:",inline"` @@ -232,7 +299,8 @@ type SubjectAccessReview struct { Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` } -// SubjectAccessReviewResponse describes whether or not a user or group can perform an action +// SubjectAccessReviewResponse describes whether or not a user or group can +// perform an action type SubjectAccessReviewResponse struct { kapi.TypeMeta @@ -242,7 +310,8 @@ type SubjectAccessReviewResponse struct { Reason string } -// PersonalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action +// PersonalSubjectAccessReview is an object for requesting information about +// whether a user or group can perform an action type PersonalSubjectAccessReview struct { kapi.TypeMeta `json:",inline"` @@ -250,7 +319,8 @@ type PersonalSubjectAccessReview struct { AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` } -// PersonalSubjectAccessReviewResponse describes whether this user can perform an action +// PersonalSubjectAccessReviewResponse describes whether this user can perform +// an action type PersonalSubjectAccessReviewResponse struct { kapi.TypeMeta @@ -262,7 +332,8 @@ type PersonalSubjectAccessReviewResponse struct { Reason string } -// LocalSubjectAccessReview is an object for requesting information about whether a user or group can perform an action +// LocalSubjectAccessReview is an object for requesting information about +// whether a user or group can perform an action type LocalSubjectAccessReview struct { kapi.TypeMeta `json:",inline"` @@ -274,7 +345,8 @@ type LocalSubjectAccessReview struct { Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` } -// LocalSubjectAccessReviewResponse describes whether or not a user or group can perform an action +// LocalSubjectAccessReviewResponse describes whether or not a user or group can +// perform an action type LocalSubjectAccessReviewResponse struct { kapi.TypeMeta @@ -287,21 +359,33 @@ type LocalSubjectAccessReviewResponse struct { } ``` - ### ResourceAccessReview -This set of APIs nswers the question: which users and groups can perform the specified verb on the specified resourceKind. Given the Authorizer interface described above, this endpoint can be implemented generically against any Authorizer by calling the .GetAllowedSubjects() function. +This set of APIs nswers the question: which users and groups can perform the +specified verb on the specified resourceKind. Given the Authorizer interface +described above, this endpoint can be implemented generically against any +Authorizer by calling the .GetAllowedSubjects() function. There are two different flavors: -1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this checks to see which users and groups can perform a given action at the cluster scope or across all namespaces. -This is a highly privileged operation. It allows a cluster-admin to inspect rights of all subjects across the entire cluster and against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - this checks to see which users and groups can perform a given action in **this** namespace. -This is a moderately privileged operation. In a multi-tenant environment, have a namespace scoped resource makes it very easy to reason about powers granted to a namespace admin. -This allows a namespace admin (someone able to manage permissions inside of one namespaces, but not all namespaces), the power to inspect which users and groups -can manipulate resources in his namespace. - -ResourceAccessReview is a runtime.Object with associated RESTStorage that only accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets a ResourceAccessReviewResponse back. Here is an example of a call and its corresponding return. +1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this +checks to see which users and groups can perform a given action at the cluster +scope or across all namespaces. This is a highly privileged operation. It allows +a cluster-admin to inspect rights of all subjects across the entire cluster and +against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - +this checks to see which users and groups can perform a given action in **this** +namespace. This is a moderately privileged operation. In a multi-tenant +environment, having a namespace scoped resource makes it very easy to reason +about powers granted to a namespace admin. This allows a namespace admin +(someone able to manage permissions inside of one namespaces, but not all +namespaces), the power to inspect which users and groups can manipulate +resources in his namespace. + +ResourceAccessReview is a runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets +a ResourceAccessReviewResponse back. Here is an example of a call and its +corresponding return: ``` // input @@ -332,8 +416,8 @@ accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessR The actual Go objects look like this: ```go -// ResourceAccessReview is a means to request a list of which users and groups are authorized to perform the -// action specified by spec +// ResourceAccessReview is a means to request a list of which users and groups +// are authorized to perform the action specified by spec type ResourceAccessReview struct { kapi.TypeMeta `json:",inline"` @@ -351,8 +435,8 @@ type ResourceAccessReviewResponse struct { Groups []string } -// LocalResourceAccessReview is a means to request a list of which users and groups are authorized to perform the -// action specified in a specific namespace +// LocalResourceAccessReview is a means to request a list of which users and +// groups are authorized to perform the action specified in a specific namespace type LocalResourceAccessReview struct { kapi.TypeMeta `json:",inline"` @@ -371,7 +455,6 @@ type LocalResourceAccessReviewResponse struct { // Groups is the list of groups who can perform the action Groups []string } - ``` diff --git a/event_compression.md b/event_compression.md index b94d6560..c4dfc154 100644 --- a/event_compression.md +++ b/event_compression.md @@ -42,40 +42,62 @@ Kubernetes components can get into a state where they generate tons of events. The events can be categorized in one of two ways: -1. same - the event is identical to previous events except it varies only on timestamp -2. similar - the event is identical to previous events except it varies on timestamp and message +1. same - The event is identical to previous events except it varies only on +timestamp. +2. similar - The event is identical to previous events except it varies on +timestamp and message. -For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)). +For example, when pulling a non-existing image, Kubelet will repeatedly generate +`image_not_existing` and `container_is_waiting` events until upstream components +correct the image. When this happens, the spam from the repeated events makes +the entire event mechanism useless. It also appears to cause memory pressure in +etcd (see [#3853](http://issue.k8s.io/3853)). -The goal is introduce event counting to increment same events, and event aggregation to collapse similar events. +The goal is introduce event counting to increment same events, and event +aggregation to collapse similar events. ## Proposal -Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event. In addition, if many similar events are -created, events should be aggregated into a single event to reduce spam. +Each binary that generates events (for example, `kubelet`) should keep track of +previously generated events so that it can collapse recurring events into a +single event instead of creating a new instance for each new event. In addition, +if many similar events are created, events should be aggregated into a single +event to reduce spam. -Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries. +Event compression should be best effort (not guaranteed). Meaning, in the worst +case, `n` identical (minus timestamp) events may still result in `n` event +entries. ## Design -Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: +Instead of a single Timestamp, each event object +[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following +fields: * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. * `LastTimestamp unversioned.Time` * The date/time of the most recent occurrence of the event. * On first occurrence, this is equal to the FirstTimestamp. * `Count int` - * The number of occurrences of this event between FirstTimestamp and LastTimestamp + * The number of occurrences of this event between FirstTimestamp and +LastTimestamp. * On first occurrence, this is 1. Each binary that generates events: * Maintains a historical record of previously generated events: - * Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). - * Implemented behind an `EventCorrelator` that manages two subcomponents: `EventAggregator` and `EventLogger` - * The `EventCorrelator` observes all incoming events and lets each subcomponent visit and modify the event in turn. - * The `EventAggregator` runs an aggregation function over each event. This function buckets each event based on an `aggregateKey`, - and identifies the event uniquely with a `localKey` in that bucket. - * The default aggregation function groups similar events that differ only by `event.Message`. It's `localKey` is `event.Message` and its aggregate key is produced by joining: + * Implemented with +["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) +in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). + * Implemented behind an `EventCorrelator` that manages two subcomponents: +`EventAggregator` and `EventLogger`. + * The `EventCorrelator` observes all incoming events and lets each +subcomponent visit and modify the event in turn. + * The `EventAggregator` runs an aggregation function over each event. This +function buckets each event based on an `aggregateKey` and identifies the event +uniquely with a `localKey` in that bucket. + * The default aggregation function groups similar events that differ only by +`event.Message`. Its `localKey` is `event.Message` and its aggregate key is +produced by joining: * `event.Source.Component` * `event.Source.Host` * `event.InvolvedObject.Kind` @@ -84,12 +106,17 @@ Each binary that generates events: * `event.InvolvedObject.UID` * `event.InvolvedObject.APIVersion` * `event.Reason` - * If the `EventAggregator` observes a similar event produced 10 times in a 10 minute window, it drops the event that was provided as - input and creates a new event that differs only on the message. The message denotes that this event is used to group similar events - that matched on reason. This aggregated `Event` is then used in the event processing sequence. - * The `EventLogger` observes the event out of `EventAggregation` and tracks the number of times it has observed that event previously - by incrementing a key in a cache associated with that matching event. - * The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event: + * If the `EventAggregator` observes a similar event produced 10 times in a 10 +minute window, it drops the event that was provided as input and creates a new +event that differs only on the message. The message denotes that this event is +used to group similar events that matched on reason. This aggregated `Event` is +then used in the event processing sequence. + * The `EventLogger` observes the event out of `EventAggregation` and tracks +the number of times it has observed that event previously by incrementing a key +in a cache associated with that matching event. + * The key in the cache is generated from the event object minus +timestamps/count/transient fields, specifically the following events fields are +used to construct a unique key for an event: * `event.Source.Component` * `event.Source.Host` * `event.InvolvedObject.Kind` @@ -99,24 +126,47 @@ Each binary that generates events: * `event.InvolvedObject.APIVersion` * `event.Reason` * `event.Message` - * The LRU cache is capped at 4096 events for both `EventAggregator` and `EventLogger`. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). - * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd: - * The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count. - * The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update). - * If the key for the new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique and a new event entry is created in etcd: - * The usual POST/create event API is called to create a new event entry in etcd. - * An entry for the event is also added to the previously generated events cache. + * The LRU cache is capped at 4096 events for both `EventAggregator` and +`EventLogger`. That means if a component (e.g. kubelet) runs for a long period +of time and generates tons of unique events, the previously generated events +cache will not grow unchecked in memory. Instead, after 4096 unique events are +generated, the oldest events are evicted from the cache. + * When an event is generated, the previously generated events cache is checked +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). + * If the key for the new event matches the key for a previously generated +event (meaning all of the above fields match between the new event and some +previously generated event), then the event is considered to be a duplicate and +the existing event entry is updated in etcd: + * The new PUT (update) event API is called to update the existing event +entry in etcd with the new last seen timestamp and count. + * The event is also updated in the previously generated events cache with +an incremented count, updated last seen timestamp, name, and new resource +version (all required to issue a future event update). + * If the key for the new event does not match the key for any previously +generated event (meaning none of the above fields match between the new event +and any previously generated events), then the event is considered to be +new/unique and a new event entry is created in etcd: + * The usual POST/create event API is called to create a new event entry in +etcd. + * An entry for the event is also added to the previously generated events +cache. ## Issues/Risks - * Compression is not guaranteed, because each component keeps track of event history in memory - * An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts. - * Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd. + * Compression is not guaranteed, because each component keeps track of event + history in memory + * An application restart causes event history to be cleared, meaning event +history is not preserved across application restarts and compression will not +occur across component restarts. + * Because an LRU cache is used to keep track of previously generated events, +if too many unique events are generated, old events will be evicted from the +cache, so events will only be compressed until they age out of the events cache, +at which point any new instance of the event will cause a new entry to be +created in etcd. ## Example -Sample kubectl output +Sample kubectl output: ```console FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE @@ -133,15 +183,19 @@ Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal ``` -This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries. +This demonstrates what would have been 20 separate entries (indicating +scheduling failure) collapsed/compressed down to 5 entries. ## Related Pull Requests/Issues - * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events - * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API - * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event - * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage - * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache instead of map + * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events. + * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API. + * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow +compressing multiple recurring events in to a single event. + * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a +single event to optimize etcd storage. + * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache +instead of map. diff --git a/expansion.md b/expansion.md index 9012b2c5..cf44baed 100644 --- a/expansion.md +++ b/expansion.md @@ -36,13 +36,15 @@ Documentation for other releases can be found at ## Abstract -A proposal for the expansion of environment variables using a simple `$(var)` syntax. +A proposal for the expansion of environment variables using a simple `$(var)` +syntax. ## Motivation -It is extremely common for users to need to compose environment variables or pass arguments to -their commands using the values of environment variables. Kubernetes should provide a facility for -the 80% cases in order to decrease coupling and the use of workarounds. +It is extremely common for users to need to compose environment variables or +pass arguments to their commands using the values of environment variables. +Kubernetes should provide a facility for the 80% cases in order to decrease +coupling and the use of workarounds. ## Goals @@ -53,150 +55,170 @@ the 80% cases in order to decrease coupling and the use of workarounds. ## Constraints and Assumptions -* This design should describe the simplest possible syntax to accomplish the use-cases -* Expansion syntax will not support more complicated shell-like behaviors such as default values - (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. +* This design should describe the simplest possible syntax to accomplish the +use-cases. +* Expansion syntax will not support more complicated shell-like behaviors such +as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. ## Use Cases -1. As a user, I want to compose new environment variables for a container using a substitution - syntax to reference other variables in the container's environment and service environment - variables -1. As a user, I want to substitute environment variables into a container's command -1. As a user, I want to do the above without requiring the container's image to have a shell -1. As a user, I want to be able to specify a default value for a service variable which may - not exist -1. As a user, I want to see an event associated with the pod if an expansion fails (ie, references - variable names that cannot be expanded) +1. As a user, I want to compose new environment variables for a container using +a substitution syntax to reference other variables in the container's +environment and service environment variables. +1. As a user, I want to substitute environment variables into a container's +command. +1. As a user, I want to do the above without requiring the container's image to +have a shell. +1. As a user, I want to be able to specify a default value for a service +variable which may not exist. +1. As a user, I want to see an event associated with the pod if an expansion +fails (ie, references variable names that cannot be expanded). ### Use Case: Composition of environment variables -Currently, containers are injected with docker-style environment variables for the services in -their pod's namespace. There are several variables for each service, but users routinely need -to compose URLs based on these variables because there is not a variable for the exact format -they need. Users should be able to build new environment variables with the exact format they need. -Eventually, it should also be possible to turn off the automatic injection of the docker-style -variables into pods and let the users consume the exact information they need via the downward API -and composition. +Currently, containers are injected with docker-style environment variables for +the services in their pod's namespace. There are several variables for each +service, but users routinely need to compose URLs based on these variables +because there is not a variable for the exact format they need. Users should be +able to build new environment variables with the exact format they need. +Eventually, it should also be possible to turn off the automatic injection of +the docker-style variables into pods and let the users consume the exact +information they need via the downward API and composition. #### Expanding expanded variables -It should be possible to reference an variable which is itself the result of an expansion, if the -referenced variable is declared in the container's environment prior to the one referencing it. -Put another way -- a container's environment is expanded in order, and expanded variables are -available to subsequent expansions. +It should be possible to reference an variable which is itself the result of an +expansion, if the referenced variable is declared in the container's environment +prior to the one referencing it. Put another way -- a container's environment is +expanded in order, and expanded variables are available to subsequent +expansions. ### Use Case: Variable expansion in command -Users frequently need to pass the values of environment variables to a container's command. -Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a -shell in the container's command and have the shell perform the substitution, or to write a wrapper -script that sets up the environment and runs the command. This has a number of drawbacks: +Users frequently need to pass the values of environment variables to a +container's command. Currently, Kubernetes does not perform any expansion of +variables. The workaround is to invoke a shell in the container's command and +have the shell perform the substitution, or to write a wrapper script that sets +up the environment and runs the command. This has a number of drawbacks: -1. Solutions that require a shell are unfriendly to images that do not contain a shell -2. Wrapper scripts make it harder to use images as base images -3. Wrapper scripts increase coupling to Kubernetes +1. Solutions that require a shell are unfriendly to images that do not contain +a shell. +2. Wrapper scripts make it harder to use images as base images. +3. Wrapper scripts increase coupling to Kubernetes. -Users should be able to do the 80% case of variable expansion in command without writing a wrapper -script or adding a shell invocation to their containers' commands. +Users should be able to do the 80% case of variable expansion in command without +writing a wrapper script or adding a shell invocation to their containers' +commands. ### Use Case: Images without shells -The current workaround for variable expansion in a container's command requires the container's -image to have a shell. This is unfriendly to images that do not contain a shell (`scratch` images, -for example). Users should be able to perform the other use-cases in this design without regard to -the content of their images. +The current workaround for variable expansion in a container's command requires +the container's image to have a shell. This is unfriendly to images that do not +contain a shell (`scratch` images, for example). Users should be able to perform +the other use-cases in this design without regard to the content of their +images. ### Use Case: See an event for incomplete expansions -It is possible that a container with incorrect variable values or command line may continue to run -for a long period of time, and that the end-user would have no visual or obvious warning of the -incorrect configuration. If the kubelet creates an event when an expansion references a variable -that cannot be expanded, it will help users quickly detect problems with expansions. +It is possible that a container with incorrect variable values or command line +may continue to run for a long period of time, and that the end-user would have +no visual or obvious warning of the incorrect configuration. If the kubelet +creates an event when an expansion references a variable that cannot be +expanded, it will help users quickly detect problems with expansions. ## Design Considerations ### What features should be supported? -In order to limit complexity, we want to provide the right amount of functionality so that the 80% -cases can be realized and nothing more. We felt that the essentials boiled down to: +In order to limit complexity, we want to provide the right amount of +functionality so that the 80% cases can be realized and nothing more. We felt +that the essentials boiled down to: -1. Ability to perform direct expansion of variables in a string -2. Ability to specify default values via a prioritized mapping function but without support for - defaults as a syntax-level feature +1. Ability to perform direct expansion of variables in a string. +2. Ability to specify default values via a prioritized mapping function but +without support for defaults as a syntax-level feature. ### What should the syntax be? -The exact syntax for variable expansion has a large impact on how users perceive and relate to the -feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This -syntax is an attractive option on some level, because many people are familiar with it. However, -this syntax also has a large number of lesser known features such as the ability to provide -default values for unset variables, perform inline substitution, etc. +The exact syntax for variable expansion has a large impact on how users perceive +and relate to the feature. We considered implementing a very restrictive subset +of the shell `${var}` syntax. This syntax is an attractive option on some level, +because many people are familiar with it. However, this syntax also has a large +number of lesser known features such as the ability to provide default values +for unset variables, perform inline substitution, etc. -In the interest of preventing conflation of the expansion feature in Kubernetes with the shell -feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not -to support the bar `$var` format, since it is not required to implement the required use-cases. +In the interest of preventing conflation of the expansion feature in Kubernetes +with the shell feature, we chose a different syntax similar to the one in +Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since +it is not required to implement the required use-cases. -Nested references, ie, variable expansion within variable names, are not supported. +Nested references, ie, variable expansion within variable names, are not +supported. #### How should unmatched references be treated? -Ideally, it should be extremely clear when a variable reference couldn't be expanded. We decided -the best experience for unmatched variable references would be to have the entire reference, syntax -included, show up in the output. As an example, if the reference `$(VARIABLE_NAME)` cannot be -expanded, then `$(VARIABLE_NAME)` should be present in the output. +Ideally, it should be extremely clear when a variable reference couldn't be +expanded. We decided the best experience for unmatched variable references would +be to have the entire reference, syntax included, show up in the output. As an +example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then +`$(VARIABLE_NAME)` should be present in the output. #### Escaping the operator -Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution -supported by many shells, because unexpanded variables are present verbatim in the output, we -expect this will not present a problem to many users. If there is a collision between a variable -name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`, -which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. +Although the `$(var)` syntax does overlap with the `$(command)` form of command +substitution supported by many shells, because unexpanded variables are present +verbatim in the output, we expect this will not present a problem to many users. +If there is a collision between a variable name and command substitution syntax, +the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate +to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. ## Design -This design encompasses the variable expansion syntax and specification and the changes needed to -incorporate the expansion feature into the container's environment and command. +This design encompasses the variable expansion syntax and specification and the +changes needed to incorporate the expansion feature into the container's +environment and command. ### Syntax and expansion mechanics -This section describes the expansion syntax, evaluation of variable values, and how unexpected or -malformed inputs are handled. +This section describes the expansion syntax, evaluation of variable values, and +how unexpected or malformed inputs are handled. #### Syntax The inputs to the expansion feature are: -1. A utf-8 string (the input string) which may contain variable references -2. A function (the mapping function) that maps the name of a variable to the variable's value, of - type `func(string) string` +1. A utf-8 string (the input string) which may contain variable references. +2. A function (the mapping function) that maps the name of a variable to the +variable's value, of type `func(string) string`. Variable references in the input string are indicated exclusively with the syntax -`$()`. The syntax tokens are: +`$()`. The syntax tokens are: -- `$`: the operator -- `(`: the reference opener -- `)`: the reference closer +- `$`: the operator, +- `(`: the reference opener, and +- `)`: the reference closer. -The operator has no meaning unless accompanied by the reference opener and closer tokens. The -operator can be escaped using `$$`. One literal `$` will be emitted for each `$$` in the input. +The operator has no meaning unless accompanied by the reference opener and +closer tokens. The operator can be escaped using `$$`. One literal `$` will be +emitted for each `$$` in the input. -The reference opener and closer characters have no meaning when not part of a variable reference. -If a variable reference is malformed, viz: `$(VARIABLE_NAME` without a closing expression, the -operator and expression opening characters are treated as ordinary characters without special -meanings. +The reference opener and closer characters have no meaning when not part of a +variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME` +without a closing expression, the operator and expression opening characters are +treated as ordinary characters without special meanings. #### Scope and ordering of substitutions -The scope in which variable references are expanded is defined by the mapping function. Within the -mapping function, any arbitrary strategy may be used to determine the value of a variable name. -The most basic implementation of a mapping function is to use a `map[string]string` to lookup the -value of a variable. +The scope in which variable references are expanded is defined by the mapping +function. Within the mapping function, any arbitrary strategy may be used to +determine the value of a variable name. The most basic implementation of a +mapping function is to use a `map[string]string` to lookup the value of a +variable. -In order to support default values for variables like service variables presented by the kubelet, -which may not be bound because the service that provides them does not yet exist, there should be a -mapping function that uses a list of `map[string]string` like: +In order to support default values for variables like service variables +presented by the kubelet, which may not be bound because the service that +provides them does not yet exist, there should be a mapping function that uses a +list of `map[string]string` like: ```go func MakeMappingFunc(maps ...map[string]string) func(string) string { @@ -235,38 +257,41 @@ mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv) The necessary changes to implement this functionality are: -1. Add a new interface, `ObjectEventRecorder`, which is like the `EventRecorder` interface, but - scoped to a single object, and a function that returns an `ObjectEventRecorder` given an - `ObjectReference` and an `EventRecorder` +1. Add a new interface, `ObjectEventRecorder`, which is like the +`EventRecorder` interface, but scoped to a single object, and a function that +returns an `ObjectEventRecorder` given an `ObjectReference` and an +`EventRecorder`. 2. Introduce `third_party/golang/expansion` package that provides: - 1. An `Expand(string, func(string) string) string` function - 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function -3. Make the kubelet expand environment correctly -4. Make the kubelet expand command correctly + 1. An `Expand(string, func(string) string) string` function. + 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` +function. +3. Make the kubelet expand environment correctly. +4. Make the kubelet expand command correctly. #### Event Recording -In order to provide an event when an expansion references undefined variables, the mapping function -must be able to create an event. In order to facilitate this, we should create a new interface in -the `api/client/record` package which is similar to `EventRecorder`, but scoped to a single object: +In order to provide an event when an expansion references undefined variables, +the mapping function must be able to create an event. In order to facilitate +this, we should create a new interface in the `api/client/record` package which +is similar to `EventRecorder`, but scoped to a single object: ```go // ObjectEventRecorder knows how to record events about a single object. type ObjectEventRecorder interface { - // Event constructs an event from the given information and puts it in the queue for sending. - // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will - // be used to automate handling of events, so imagine people writing switch statements to - // handle them. You want to make that easy. - // 'message' is intended to be human readable. - // - // The resulting event will be created in the same namespace as the reference object. - Event(reason, message string) - - // Eventf is just like Event, but with Sprintf for the message field. - Eventf(reason, messageFmt string, args ...interface{}) - - // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. - PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) + // Event constructs an event from the given information and puts it in the queue for sending. + // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will + // be used to automate handling of events, so imagine people writing switch statements to + // handle them. You want to make that easy. + // 'message' is intended to be human readable. + // + // The resulting event will be created in the same namespace as the reference object. + Event(reason, message string) + + // Eventf is just like Event, but with Sprintf for the message field. + Eventf(reason, messageFmt string, args ...interface{}) + + // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. + PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) } ``` @@ -275,16 +300,16 @@ and an `EventRecorder`: ```go type objectRecorderImpl struct { - object runtime.Object - recorder EventRecorder + object runtime.Object + recorder EventRecorder } func (r *objectRecorderImpl) Event(reason, message string) { - r.recorder.Event(r.object, reason, message) + r.recorder.Event(r.object, reason, message) } func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder { - return &objectRecorderImpl{object, recorder} + return &objectRecorderImpl{object, recorder} } ``` @@ -299,28 +324,29 @@ The expansion package should provide two methods: // for the input is found. If no expansion is found for a key, an event // is raised on the given recorder. func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string { - // ... + // ... } // Expand replaces variable references in the input string according to // the expansion spec using the given mapping function to resolve the // values of variables. func Expand(input string, mapping func(string) string) string { - // ... + // ... } ``` #### Kubelet changes -The Kubelet should be made to correctly expand variables references in a container's environment, -command, and args. Changes will need to be made to: +The Kubelet should be made to correctly expand variables references in a +container's environment, command, and args. Changes will need to be made to: 1. The `makeEnvironmentVariables` function in the kubelet; this is used by - `GenerateRunContainerOptions`, which is used by both the docker and rkt container runtimes -2. The docker manager `setEntrypointAndCommand` func has to be changed to perform variable - expansion -3. The rkt runtime should be made to support expansion in command and args when support for it is - implemented +`GenerateRunContainerOptions`, which is used by both the docker and rkt +container runtimes. +2. The docker manager `setEntrypointAndCommand` func has to be changed to +perform variable expansion. +3. The rkt runtime should be made to support expansion in command and args +when support for it is implemented. ### Examples diff --git a/extending-api.md b/extending-api.md index ee53a7d6..aa1821c8 100644 --- a/extending-api.md +++ b/extending-api.md @@ -34,59 +34,62 @@ Documentation for other releases can be found at # Adding custom resources to the Kubernetes API server -This document describes the design for implementing the storage of custom API types in the Kubernetes API Server. +This document describes the design for implementing the storage of custom API +types in the Kubernetes API Server. ## Resource Model ### The ThirdPartyResource -The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add -to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource; attempting to place it in a namespace -will return an error. +The `ThirdPartyResource` resource describes the multiple versions of a custom +resource that the user wants to add to the Kubernetes API. `ThirdPartyResource` +is a non-namespaced resource; attempting to place it in a namespace will return +an error. Each `ThirdPartyResource` resource has the following: * Standard Kubernetes object metadata. - * ResourceKind - The kind of the resources described by this third party resource. + * ResourceKind - The kind of the resources described by this third party +resource. * Description - A free text description of the resource. * APIGroup - An API group that this resource should be placed into. * Versions - One or more `Version` objects. ### The `Version` Object -The `Version` object describes a single concrete version of a custom resource. The `Version` object currently -only specifies: +The `Version` object describes a single concrete version of a custom resource. +The `Version` object currently only specifies: * The `Name` of the version. * The `APIGroup` this version should belong to. ## Expectations about third party objects -Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes -compatible [object metadata](../devel/api-conventions.md#metadata). This requirement enables the -Kubernetes API server to provide the following features: - * Filtering lists of objects via label queries - * `resourceVersion`-based optimistic concurrency via compare-and-swap - * Versioned storage - * Event recording - * Integration with basic `kubectl` command line tooling - * Watch for resource changes - -The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be -programmatically convertible to the name of the resource using -the following conversion. Kinds are expected to be of the form ``, and the -`APIVersion` for the object is expected to be `/`. To -prevent collisions, it's expected that you'll use a fully qualified domain -name for the API group, e.g. `example.com`. +Every object that is added to a third-party Kubernetes object store is expected +to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata). +This requirement enables the Kubernetes API server to provide the following +features: + * Filtering lists of objects via label queries. + * `resourceVersion`-based optimistic concurrency via compare-and-swap. + * Versioned storage. + * Event recording. + * Integration with basic `kubectl` command line tooling. + * Watch for resource changes. + +The `Kind` for an instance of a third-party object (e.g. CronTab) below is +expected to be programmatically convertible to the name of the resource using +the following conversion. Kinds are expected to be of the form +``, and the `APIVersion` for the object is expected to be +`/`. To prevent collisions, it's expected that you'll +use a fully qualified domain name for the API group, e.g. `example.com`. For example `stable.example.com/v1` 'CamelCaseKind' is the specific type name. -To convert this into the `metadata.name` for the `ThirdPartyResource` resource instance, -the `` is copied verbatim, the `CamelCaseKind` is -then converted -using '-' instead of capitalization ('camel-case'), with the first character being assumed to be -capitalized. In pseudo code: +To convert this into the `metadata.name` for the `ThirdPartyResource` resource +instance, the `` is copied verbatim, the `CamelCaseKind` is then +converted using '-' instead of capitalization ('camel-case'), with the first +character being assumed to be capitalized. In pseudo code: ```go var result string @@ -98,17 +101,20 @@ for ix := range kindName { } ``` -As a concrete example, the resource named `camel-case-kind.example.com` defines resources of Kind `CamelCaseKind`, in -the APIGroup with the prefix `example.com/...`. +As a concrete example, the resource named `camel-case-kind.example.com` defines +resources of Kind `CamelCaseKind`, in the APIGroup with the prefix +`example.com/...`. -The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information. -This is also the reason why `ThirdPartyResource` is not namespaced. +The reason for this is to enable rapid lookup of a `ThirdPartyResource` object +given the kind information. This is also the reason why `ThirdPartyResource` is +not namespaced. ## Usage -When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced -RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects, -deleting a namespace deletes all third party resources in that namespace. +When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts +by creating a new, namespaced RESTful resource path. For now, non-namespaced +objects are not supported. As with existing built-in objects, deleting a +namespace deletes all third party resources in that namespace. For example, if a user creates: @@ -143,14 +149,15 @@ Now that this schema has been created, a user can `POST`: to: `/apis/stable.example.com/v1/namespaces/default/crontabs` -and the corresponding data will be stored into etcd by the APIServer, so that when the user issues: +and the corresponding data will be stored into etcd by the APIServer, so that +when the user issues: ``` GET /apis/stable.example.com/v1/namespaces/default/crontabs/my-new-cron-object` ``` -And when they do that, they will get back the same data, but with additional Kubernetes metadata -(e.g. `resourceVersion`, `createdTimestamp`) filled in. +And when they do that, they will get back the same data, but with additional +Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in. Likewise, to list all resources, a user can issue: @@ -178,29 +185,35 @@ and get back: } ``` -Because all objects are expected to contain standard Kubernetes metadata fields, these -list operations can also use label queries to filter requests down to specific subsets. - -Likewise, clients can use watch endpoints to watch for changes to stored objects. +Because all objects are expected to contain standard Kubernetes metadata fields, +these list operations can also use label queries to filter requests down to +specific subsets. +Likewise, clients can use watch endpoints to watch for changes to stored +objects. ## Storage -In order to store custom user data in a versioned fashion inside of etcd, we need to also introduce a -`Codec`-compatible object for persistent storage in etcd. This object is `ThirdPartyResourceData` and it contains: - * Standard API Metadata +In order to store custom user data in a versioned fashion inside of etcd, we +need to also introduce a `Codec`-compatible object for persistent storage in +etcd. This object is `ThirdPartyResourceData` and it contains: + * Standard API Metadata. * `Data`: The raw JSON data for this custom object. ### Storage key specification -Each custom object stored by the API server needs a custom key in storage, this is described below: +Each custom object stored by the API server needs a custom key in storage, this +is described below: #### Definitions - * `resource-namespace`: the namespace of the particular resource that is being stored + * `resource-namespace`: the namespace of the particular resource that is +being stored * `resource-name`: the name of the particular resource being stored - * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored - * `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored + * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` +resource that represents the type for the specific instance being stored + * `third-party-resource-name`: the name of the `ThirdPartyResource` resource +that represents the type for the specific instance being stored #### Key diff --git a/federated-services.md b/federated-services.md index 6febfb21..7e9933e3 100644 --- a/federated-services.md +++ b/federated-services.md @@ -76,7 +76,7 @@ Documentation for other releases can be found at load balancers between the client and the serving Pod, failover might be completely automatic (i.e. the client's end of the connection remains intact, and the client is completely - oblivious of the fail-over). This approach incurs network speed + oblivious of the fail-over). This approach incurs network speed and cost penalties (by traversing possibly multiple load balancers), but requires zero smarts in clients, DNS libraries, recursing DNS servers etc, as the IP address of the endpoint @@ -102,17 +102,17 @@ Documentation for other releases can be found at A Kubernetes application configuration (e.g. for a Pod, Replication Controller, Service etc) should be able to be successfully deployed into any Kubernetes Cluster or Ubernetes Federation of Clusters, -without modification. More specifically, a typical configuration +without modification. More specifically, a typical configuration should work correctly (although possibly not optimally) across any of the following environments: 1. A single Kubernetes Cluster on one cloud provider (e.g. Google - Compute Engine, GCE) + Compute Engine, GCE). 1. A single Kubernetes Cluster on a different cloud provider - (e.g. Amazon Web Services, AWS) + (e.g. Amazon Web Services, AWS). 1. A single Kubernetes Cluster on a non-cloud, on-premise data center 1. A Federation of Kubernetes Clusters all on the same cloud provider - (e.g. GCE) + (e.g. GCE). 1. A Federation of Kubernetes Clusters across multiple different cloud providers and/or on-premise data centers (e.g. one cluster on GCE/GKE, one on AWS, and one on-premise). @@ -122,18 +122,18 @@ the following environments: It should be possible to explicitly opt out of portability across some subset of the above environments in order to take advantage of non-portable load balancing and DNS features of one or more -environments. More specifically, for example: +environments. More specifically, for example: 1. For HTTP(S) applications running on GCE-only Federations, [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - should be usable. These provide single, static global IP addresses + should be usable. These provide single, static global IP addresses which load balance and fail over globally (i.e. across both regions - and zones). These allow for really dumb clients, but they only + and zones). These allow for really dumb clients, but they only work on GCE, and only for HTTP(S) traffic. 1. For non-HTTP(S) applications running on GCE-only Federations within a single region, [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - should be usable. These provide TCP (i.e. both HTTP/S and + should be usable. These provide TCP (i.e. both HTTP/S and non-HTTP/S) load balancing and failover, but only on GCE, and only within a single region. [Google Cloud DNS](https://cloud.google.com/dns) can be used to @@ -141,7 +141,7 @@ environments. More specifically, for example: providers and on-premise clusters, as it's plain DNS, IP only). 1. For applications running on AWS-only Federations, [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - should be usable. These provide both L7 (HTTP(S)) and L4 load + should be usable. These provide both L7 (HTTP(S)) and L4 load balancing, but only within a single region, and only on AWS ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be used to load balance and fail over across multiple regions, and is @@ -153,7 +153,7 @@ Ubernetes cross-cluster load balancing is built on top of the following: 1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) provide single, static global IP addresses which load balance and - fail over globally (i.e. across both regions and zones). These + fail over globally (i.e. across both regions and zones). These allow for really dumb clients, but they only work on GCE, and only for HTTP(S) traffic. 1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) @@ -170,7 +170,7 @@ Ubernetes cross-cluster load balancing is built on top of the following: doesn't provide any built-in geo-DNS, latency-based routing, health checking, weighted round robin or other advanced capabilities. It's plain old DNS. We would need to build all the aforementioned - on top of it. It can provide internal DNS services (i.e. serve RFC + on top of it. It can provide internal DNS services (i.e. serve RFC 1918 addresses). 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be used to load balance and fail over across regions, and is also @@ -185,23 +185,24 @@ Ubernetes cross-cluster load balancing is built on top of the following: service IP which is load-balanced (currently simple round-robin) across the healthy pods comprising a service within a single Kubernetes cluster. -1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy. +1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): +A generic wrapper around cloud-provided L4 and L7 load balancing services, and +roll-your-own load balancers run in pods, e.g. HA Proxy. ## Ubernetes API -The Ubernetes API for load balancing should be compatible with the -equivalent Kubernetes API, to ease porting of clients between -Ubernetes and Kubernetes. Further details below. +The Ubernetes API for load balancing should be compatible with the equivalent +Kubernetes API, to ease porting of clients between Ubernetes and Kubernetes. +Further details below. ## Common Client Behavior -To be useful, our load balancing solution needs to work properly with -real client applications. There are a few different classes of -those... +To be useful, our load balancing solution needs to work properly with real +client applications. There are a few different classes of those... ### Browsers -These are the most common external clients. These are all well-written. See below. +These are the most common external clients. These are all well-written. See below. ### Well-written clients @@ -218,8 +219,8 @@ Examples: ### Dumb clients -1. Don't do a DNS resolution every time they connect (or do cache - beyond the TTL). +1. Don't do a DNS resolution every time they connect (or do cache beyond the +TTL). 1. Do try multiple A records Examples: @@ -237,34 +238,34 @@ Examples: ### Dumbest clients -1. Never do a DNS lookup - are pre-configured with a single (or - possibly multiple) fixed server IP(s). Nothing else matters. +1. Never do a DNS lookup - are pre-configured with a single (or possibly +multiple) fixed server IP(s). Nothing else matters. ## Architecture and Implementation -### General control plane architecture +### General Control Plane Architecture -Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and -etcd quorum members. This is documented in more detail in a -[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). +Each cluster hosts one or more Ubernetes master components (Ubernetes API +servers, controller managers with leader election, and etcd quorum members. This +is documented in more detail in a separate design doc: +[Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). -In the description below, assume that 'n' clusters, named -'cluster-1'... 'cluster-n' have been registered against an Ubernetes -Federation "federation-1", each with their own set of Kubernetes API -endpoints,so, +In the description below, assume that 'n' clusters, named 'cluster-1'... +'cluster-n' have been registered against an Ubernetes Federation "federation-1", +each with their own set of Kubernetes API endpoints,so, "[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), [http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) ... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . ### Federated Services -Ubernetes Services are pretty straight-forward. They're comprised of -multiple equivalent underlying Kubernetes Services, each with their -own external endpoint, and a load balancing mechanism across them. -Let's work through how exactly that works in practice. +Ubernetes Services are pretty straight-forward. They're comprised of multiple +equivalent underlying Kubernetes Services, each with their own external +endpoint, and a load balancing mechanism across them. Let's work through how +exactly that works in practice. -Our user creates the following Ubernetes Service (against an Ubernetes -API endpoint): +Our user creates the following Ubernetes Service (against an Ubernetes API +endpoint): $ kubectl create -f my-service.yaml --context="federation-1" @@ -290,9 +291,9 @@ where service.yaml contains the following: run: my-service type: LoadBalancer -Ubernetes in turn creates one equivalent service (identical config to -the above) in each of the underlying Kubernetes clusters, each of -which results in something like this: +Ubernetes in turn creates one equivalent service (identical config to the above) +in each of the underlying Kubernetes clusters, each of which results in +something like this: $ kubectl get -o yaml --context="cluster-1" service my-service @@ -329,9 +330,8 @@ which results in something like this: ingress: - ip: 104.197.117.10 -Similar services are created in `cluster-2` and `cluster-3`, each of -which are allocated their own `spec.clusterIP`, and -`status.loadBalancer.ingress.ip`. +Similar services are created in `cluster-2` and `cluster-3`, each of which are +allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. In Ubernetes `federation-1`, the resulting federated service looks as follows: @@ -376,21 +376,21 @@ Note that the federated service: 1. has no clusterIP (as it is cluster-independent) 1. has a federation-wide load balancer hostname -In addition to the set of underlying Kubernetes services (one per -cluster) described above, Ubernetes has also created a DNS name -(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or -[AWS Route 53](https://aws.amazon.com/route53/), depending on -configuration) which provides load balancing across all of those -services. For example, in a very basic configuration: +In addition to the set of underlying Kubernetes services (one per cluster) +described above, Ubernetes has also created a DNS name (e.g. on +[Google Cloud DNS](https://cloud.google.com/dns) or +[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) +which provides load balancing across all of those services. For example, in a +very basic configuration: $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 -Each of the above IP addresses (which are just the external load -balancer ingress IP's of each cluster service) is of course load -balanced across the pods comprising the service in each cluster. +Each of the above IP addresses (which are just the external load balancer +ingress IP's of each cluster service) is of course load balanced across the pods +comprising the service in each cluster. In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes automatically creates a @@ -411,23 +411,21 @@ for failover purposes: my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 -If Ubernetes Global Service Health Checking is enabled, multiple -service health checkers running across the federated clusters -collaborate to monitor the health of the service endpoints, and -automatically remove unhealthy endpoints from the DNS record (e.g. a -majority quorum is required to vote a service endpoint unhealthy, to -avoid false positives due to individual health checker network +If Ubernetes Global Service Health Checking is enabled, multiple service health +checkers running across the federated clusters collaborate to monitor the health +of the service endpoints, and automatically remove unhealthy endpoints from the +DNS record (e.g. a majority quorum is required to vote a service endpoint +unhealthy, to avoid false positives due to individual health checker network isolation). ### Federated Replication Controllers -So far we have a federated service defined, with a resolvable load -balancer hostname by which clients can reach it, but no pods serving -traffic directed there. So now we need a Federated Replication -Controller. These are also fairly straight-forward, being comprised -of multiple underlying Kubernetes Replication Controllers which do the -hard work of keeping the desired number of Pod replicas alive in each -Kubernetes cluster. +So far we have a federated service defined, with a resolvable load balancer +hostname by which clients can reach it, but no pods serving traffic directed +there. So now we need a Federated Replication Controller. These are also fairly +straight-forward, being comprised of multiple underlying Kubernetes Replication +Controllers which do the hard work of keeping the desired number of Pod replicas +alive in each Kubernetes cluster. $ kubectl create -f my-service-rc.yaml --context="federation-1" @@ -495,54 +493,49 @@ something like this: status: replicas: 2 -The exact number of replicas created in each underlying cluster will -of course depend on what scheduling policy is in force. In the above -example, the scheduler created an equal number of replicas (2) in each -of the three underlying clusters, to make up the total of 6 replicas -required. To handle entire cluster failures, various approaches are possible, -including: +The exact number of replicas created in each underlying cluster will of course +depend on what scheduling policy is in force. In the above example, the +scheduler created an equal number of replicas (2) in each of the three +underlying clusters, to make up the total of 6 replicas required. To handle +entire cluster failures, various approaches are possible, including: 1. **simple overprovisioing**, such that sufficient replicas remain even if a - cluster fails. This wastes some resources, but is simple and - reliable. + cluster fails. This wastes some resources, but is simple and reliable. 2. **pod autoscaling**, where the replication controller in each cluster automatically and autonomously increases the number of replicas in its cluster in response to the additional traffic - diverted from the - failed cluster. This saves resources and is reatively simple, - but there is some delay in the autoscaling. + diverted from the failed cluster. This saves resources and is relatively + simple, but there is some delay in the autoscaling. 3. **federated replica migration**, where the Ubernetes Federation Control Plane detects the cluster failure and automatically increases the replica count in the remainaing clusters to make up - for the lost replicas in the failed cluster. This does not seem to + for the lost replicas in the failed cluster. This does not seem to offer any benefits relative to pod autoscaling above, and is arguably more complex to implement, but we note it here as a possibility. ### Implementation Details -The implementation approach and architecture is very similar to -Kubernetes, so if you're familiar with how Kubernetes works, none of -what follows will be surprising. One additional design driver not -present in Kubernetes is that Ubernetes aims to be resilient to -individual cluster and availability zone failures. So the control -plane spans multiple clusters. More specifically: +The implementation approach and architecture is very similar to Kubernetes, so +if you're familiar with how Kubernetes works, none of what follows will be +surprising. One additional design driver not present in Kubernetes is that +Ubernetes aims to be resilient to individual cluster and availability zone +failures. So the control plane spans multiple clusters. More specifically: + Ubernetes runs it's own distinct set of API servers (typically one or more per underlying Kubernetes cluster). These are completely distinct from the Kubernetes API servers for each of the underlying clusters. + Ubernetes runs it's own distinct quorum-based metadata store (etcd, - by default). Approximately 1 quorum member runs in each underlying + by default). Approximately 1 quorum member runs in each underlying cluster ("approximately" because we aim for an odd number of quorum members, and typically don't want more than 5 quorum members, even if we have a larger number of federated clusters, so 2 clusters->3 quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). -Cluster Controllers in Ubernetes watch against the Ubernetes API -server/etcd state, and apply changes to the underlying kubernetes -clusters accordingly. They also have the anti-entropy mechanism for -reconciling ubernetes "desired desired" state against kubernetes -"actual desired" state. +Cluster Controllers in Ubernetes watch against the Ubernetes API server/etcd +state, and apply changes to the underlying kubernetes clusters accordingly. They +also have the anti-entropy mechanism for reconciling ubernetes "desired desired" +state against kubernetes "actual desired" state. diff --git a/federation-phase-1.md b/federation-phase-1.md index baf1e472..53087fd8 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -71,7 +71,8 @@ unified view. Here are the functionality requirements derived from above use cases: -+ Clients of the federation control plane API server can register and deregister clusters. ++ Clients of the federation control plane API server can register and deregister +clusters. + Workloads should be spread to different clusters according to the workload distribution policy. + Pods are able to discover and connect to services hosted in other @@ -90,7 +91,7 @@ Here are the functionality requirements derived from above use cases: It’s difficult to have a perfect design with one click that implements all the above requirements. Therefore we will go with an iterative approach to design and build the system. This document describes the -phase one of the whole work. In phase one we will cover only the +phase one of the whole work. In phase one we will cover only the following objectives: + Define the basic building blocks and API objects of control plane @@ -130,9 +131,9 @@ description of each module contained in above diagram. The API Server in the Ubernetes control plane works just like the API Server in K8S. It talks to a distributed key-value store to persist, -retrieve and watch API objects. This store is completely distinct +retrieve and watch API objects. This store is completely distinct from the kubernetes key-value stores (etcd) in the underlying -kubernetes clusters. We still use `etcd` as the distributed +kubernetes clusters. We still use `etcd` as the distributed storage so customers don’t need to learn and manage a different storage system, although it is envisaged that other storage systems (consol, zookeeper) will probably be developedand supported over @@ -141,16 +142,16 @@ time. ## Ubernetes Scheduler The Ubernetes Scheduler schedules resources onto the underlying -Kubernetes clusters. For example it watches for unscheduled Ubernetes +Kubernetes clusters. For example it watches for unscheduled Ubernetes replication controllers (those that have not yet been scheduled onto underlying Kubernetes clusters) and performs the global scheduling -work. For each unscheduled replication controller, it calls policy +work. For each unscheduled replication controller, it calls policy engine to decide how to spit workloads among clusters. It creates a Kubernetes Replication Controller on one ore more underlying cluster, and post them back to `etcd` storage. -One sublety worth noting here is that the scheduling decision is -arrived at by combining the application-specific request from the user (which might +One sublety worth noting here is that the scheduling decision is arrived at by +combining the application-specific request from the user (which might include, for example, placement constraints), and the global policy specified by the federation administrator (for example, "prefer on-premise clusters over AWS clusters" or "spread load equally across clusters"). @@ -165,9 +166,9 @@ performs the following two kinds of work: corresponding API objects on the underlying K8S clusters. 1. It periodically retrieves the available resources metrics from the underlying K8S cluster, and updates them as object status of the - `cluster` API object. An alternative design might be to run a pod + `cluster` API object. An alternative design might be to run a pod in each underlying cluster that reports metrics for that cluster to - the Ubernetes control plane. Which approach is better remains an + the Ubernetes control plane. Which approach is better remains an open topic of discussion. ## Ubernetes Service Controller @@ -187,7 +188,7 @@ Cluster is a new first-class API object introduced in this design. For each registered K8S cluster there will be such an API resource in control plane. The way clients register or deregister a cluster is to send corresponding REST requests to following URL: -`/api/{$version}/clusters`. Because control plane is behaving like a +`/api/{$version}/clusters`. Because control plane is behaving like a regular K8S client to the underlying clusters, the spec of a cluster object contains necessary properties like K8S cluster address and credentials. The status of a cluster API object will contain @@ -294,7 +295,7 @@ $version.clusterStatus **For simplicity we didn’t introduce a separate “cluster metrics” API object here**. The cluster resource metrics are stored in cluster status section, just like what we did to nodes in K8S. In phase one it -only contains available CPU resources and memory resources. The +only contains available CPU resources and memory resources. The cluster controller will periodically poll the underlying cluster API Server to get cluster capability. In phase one it gets the metrics by simply aggregating metrics from all nodes. In future we will improve @@ -315,7 +316,7 @@ Below is the state transition diagram. ## Replication Controller A global workload submitted to control plane is represented as an -Ubernetes replication controller. When a replication controller +Ubernetes replication controller. When a replication controller is submitted to control plane, clients need a way to express its requirements or preferences on clusters. Depending on different use cases it may be complex. For example: @@ -327,7 +328,7 @@ cases it may be complex. For example: (use case: workload ) + Seventy percent of this workload should be scheduled to cluster Foo, and thirty percent should be scheduled to cluster Bar (use case: - vendor lock-in avoidance). In phase one, we only introduce a + vendor lock-in avoidance). In phase one, we only introduce a _clusterSelector_ field to filter acceptable clusters. In default case there is no such selector and it means any cluster is acceptable. @@ -376,7 +377,7 @@ clusters. How to handle this will be addressed after phase one. The Service API object exposed by Ubernetes is similar to service objects on Kubernetes. It defines the access to a group of pods. The Ubernetes service controller will create corresponding Kubernetes -service objects on underlying clusters. These are detailed in a +service objects on underlying clusters. These are detailed in a separate design document: [Federated Services](federated-services.md). ## Pod @@ -389,7 +390,8 @@ order to keep the Ubernetes API compatible with the Kubernetes API. ## Scheduling -The below diagram shows how workloads are scheduled on the Ubernetes control plane: +The below diagram shows how workloads are scheduled on the Ubernetes control\ +plane: 1. A replication controller is created by the client. 1. APIServer persists it into the storage. @@ -425,8 +427,8 @@ proposed solutions like resource reservation mechanisms. This part has been included in the section “Federated Service” of document -“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please -refer to that document for details. +“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. +Please refer to that document for details. diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index a5969c01..1b0d78bd 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -36,33 +36,40 @@ Documentation for other releases can be found at ## Preface -This document briefly describes the design of the horizontal autoscaler for pods. -The autoscaler (implemented as a Kubernetes API resource and controller) is responsible for dynamically controlling -the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), +This document briefly describes the design of the horizontal autoscaler for +pods. The autoscaler (implemented as a Kubernetes API resource and controller) +is responsible for dynamically controlling the number of replicas of some +collection (e.g. the pods of a ReplicationController) to meet some objective(s), for example a target per-pod CPU utilization. This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). ## Overview -The resource usage of a serving application usually varies over time: sometimes the demand for the application rises, -and sometimes it drops. -In Kubernetes version 1.0, a user can only manually set the number of serving pods. -Our aim is to provide a mechanism for the automatic adjustment of the number of pods based on CPU utilization statistics -(a future version will allow autoscaling based on other resources/metrics). +The resource usage of a serving application usually varies over time: sometimes +the demand for the application rises, and sometimes it drops. In Kubernetes +version 1.0, a user can only manually set the number of serving pods. Our aim is +to provide a mechanism for the automatic adjustment of the number of pods based +on CPU utilization statistics (a future version will allow autoscaling based on +other resources/metrics). ## Scale Subresource -In Kubernetes version 1.1, we are introducing Scale subresource and implementing horizontal autoscaling of pods based on it. -Scale subresource is supported for replication controllers and deployments. -Scale subresource is a Virtual Resource (does not correspond to an object stored in etcd). -It is only present in the API as an interface that a controller (in this case the HorizontalPodAutoscaler) can use to dynamically scale -the number of replicas controlled by some other API object (currently ReplicationController and Deployment) and to learn the current number of replicas. -Scale is a subresource of the API object that it serves as the interface for. -The Scale subresource is useful because whenever we introduce another type we want to autoscale, we just need to implement the Scale subresource for it. -The wider discussion regarding Scale took place in [#1629](https://github.com/kubernetes/kubernetes/issues/1629). - -Scale subresource is in API for replication controller or deployment under the following paths: +In Kubernetes version 1.1, we are introducing Scale subresource and implementing +horizontal autoscaling of pods based on it. Scale subresource is supported for +replication controllers and deployments. Scale subresource is a Virtual Resource +(does not correspond to an object stored in etcd). It is only present in the API +as an interface that a controller (in this case the HorizontalPodAutoscaler) can +use to dynamically scale the number of replicas controlled by some other API +object (currently ReplicationController and Deployment) and to learn the current +number of replicas. Scale is a subresource of the API object that it serves as +the interface for. The Scale subresource is useful because whenever we introduce +another type we want to autoscale, we just need to implement the Scale +subresource for it. The wider discussion regarding Scale took place in issue +[#1629](https://github.com/kubernetes/kubernetes/issues/1629). + +Scale subresource is in API for replication controller or deployment under the +following paths: `apis/extensions/v1beta1/replicationcontrollers/myrc/scale` @@ -99,14 +106,15 @@ type ScaleStatus struct { } ``` -Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment associated with -the given Scale subresource. -`ScaleStatus.Replicas` reports how many pods are currently running in the replication controller/deployment, -and `ScaleStatus.Selector` returns selector for the pods. +Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment +associated with the given Scale subresource. `ScaleStatus.Replicas` reports how +many pods are currently running in the replication controller/deployment, and +`ScaleStatus.Selector` returns selector for the pods. ## HorizontalPodAutoscaler Object -In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It is accessible under: +In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It +is accessible under: `apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler` @@ -168,8 +176,9 @@ type HorizontalPodAutoscalerStatus struct { ``` `ScaleRef` is a reference to the Scale subresource. -`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler configuration. -We are also introducing HorizontalPodAutoscalerList object to enable listing all autoscalers in a namespace: +`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler +configuration. We are also introducing HorizontalPodAutoscalerList object to +enable listing all autoscalers in a namespace: ```go // list of horizontal pod autoscaler objects. @@ -184,19 +193,22 @@ type HorizontalPodAutoscalerList struct { ## Autoscaling Algorithm -The autoscaler is implemented as a control loop. It periodically queries pods described by `Status.PodSelector` of Scale subresource, and collects their CPU utilization. -Then, it compares the arithmetic mean of the pods' CPU utilization with the target defined in `Spec.CPUUtilization`, -and adjust the replicas of the Scale if needed to match the target -(preserving condition: MinReplicas <= Replicas <= MaxReplicas). +The autoscaler is implemented as a control loop. It periodically queries pods +described by `Status.PodSelector` of Scale subresource, and collects their CPU +utilization. Then, it compares the arithmetic mean of the pods' CPU utilization +with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of +the Scale if needed to match the target (preserving condition: MinReplicas <= +Replicas <= MaxReplicas). -The period of the autoscaler is controlled by `--horizontal-pod-autoscaler-sync-period` flag of controller manager. -The default value is 30 seconds. +The period of the autoscaler is controlled by the +`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The +default value is 30 seconds. -CPU utilization is the recent CPU usage of a pod (average across the last 1 minute) divided by the CPU requested by the pod. -In Kubernetes version 1.1, CPU usage is taken directly from Heapster. -In future, there will be API on master for this purpose -(see [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). +CPU utilization is the recent CPU usage of a pod (average across the last 1 +minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU +usage is taken directly from Heapster. In future, there will be API on master +for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). The target number of pods is calculated from the following formula: @@ -204,66 +216,76 @@ The target number of pods is calculated from the following formula: TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target) ``` -Starting and stopping pods may introduce noise to the metric (for instance, starting may temporarily increase CPU). -So, after each action, the autoscaler should wait some time for reliable data. -Scale-up can only happen if there was no rescaling within the last 3 minutes. -Scale-down will wait for 5 minutes from the last rescaling. -Moreover any scaling will only be made if: `avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 (10% tolerance). -Such approach has two benefits: +Starting and stopping pods may introduce noise to the metric (for instance, +starting may temporarily increase CPU). So, after each action, the autoscaler +should wait some time for reliable data. Scale-up can only happen if there was +no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from +the last rescaling. Moreover any scaling will only be made if: +`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 +(10% tolerance). Such approach has two benefits: -* Autoscaler works in a conservative way. - If new user load appears, it is important for us to rapidly increase the number of pods, - so that user requests will not be rejected. - Lowering the number of pods is not that urgent. +* Autoscaler works in a conservative way. If new user load appears, it is +important for us to rapidly increase the number of pods, so that user requests +will not be rejected. Lowering the number of pods is not that urgent. -* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting decision if the load is not stable. +* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting +decision if the load is not stable. ## Relative vs. absolute metrics -We chose values of the target metric to be relative (e.g. 90% of requested CPU resource) rather than absolute (e.g. 0.6 core) for the following reason. -If we choose absolute metric, user will need to guarantee that the target is lower than the request. -Otherwise, overloaded pods may not be able to consume more than the autoscaler's absolute target utilization, -thereby preventing the autoscaler from seeing high enough utilization to trigger it to scale up. -This may be especially troublesome when user changes requested resources for a pod +We chose values of the target metric to be relative (e.g. 90% of requested CPU +resource) rather than absolute (e.g. 0.6 core) for the following reason. If we +choose absolute metric, user will need to guarantee that the target is lower +than the request. Otherwise, overloaded pods may not be able to consume more +than the autoscaler's absolute target utilization, thereby preventing the +autoscaler from seeing high enough utilization to trigger it to scale up. This +may be especially troublesome when user changes requested resources for a pod because they would need to also change the autoscaler utilization threshold. -Therefore, we decided to choose relative metric. -For user, it is enough to set it to a value smaller than 100%, and further changes of requested resources will not invalidate it. +Therefore, we decided to choose relative metric. For user, it is enough to set +it to a value smaller than 100%, and further changes of requested resources will +not invalidate it. ## Support in kubectl -To make manipulation of HorizontalPodAutoscaler object simpler, we added support for -creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. -In addition, in future, we are planning to add kubectl support for the following use-cases: -* When creating a replication controller or deployment with `kubectl create [-f]`, there should be - a possibility to specify an additional autoscaler object. - (This should work out-of-the-box when creation of autoscaler is supported by kubectl as we may include - multiple objects in the same config file). -* *[future]* When running an image with `kubectl run`, there should be an additional option to create - an autoscaler for it. -* *[future]* We will add a new command `kubectl autoscale` that will allow for easy creation of an autoscaler object - for already existing replication controller/deployment. +To make manipulation of HorizontalPodAutoscaler object simpler, we added support +for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In +addition, in future, we are planning to add kubectl support for the following +use-cases: +* When creating a replication controller or deployment with +`kubectl create [-f]`, there should be a possibility to specify an additional +autoscaler object. (This should work out-of-the-box when creation of autoscaler +is supported by kubectl as we may include multiple objects in the same config +file). +* *[future]* When running an image with `kubectl run`, there should be an +additional option to create an autoscaler for it. +* *[future]* We will add a new command `kubectl autoscale` that will allow for +easy creation of an autoscaler object for already existing replication +controller/deployment. ## Next steps We list here some features that are not supported in Kubernetes version 1.1. -However, we want to keep them in mind, as they will most probably be needed in future. +However, we want to keep them in mind, as they will most probably be needed in +the future. Our design is in general compatible with them. -* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. memory, network traffic, qps). - This includes scaling based on a custom/application metric. -* *[future]* **Autoscale pods base on an aggregate metric.** - Autoscaler, instead of computing average for a target metric across pods, will use a single, external, metric (e.g. qps metric from load balancer). - The metric will be aggregated while the target will remain per-pod - (e.g. when observing 100 qps on load balancer while the target is 20 qps per pod, autoscaler will set the number of replicas to 5). -* *[future]* **Autoscale pods based on multiple metrics.** - If the target numbers of pods for different metrics are different, choose the largest target number of pods. -* *[future]* **Scale the number of pods starting from 0.** - All pods can be turned-off, and then turned-on when there is a demand for them. - When a request to service with no pods arrives, kube-proxy will generate an event for autoscaler - to create a new pod. - Discussed in [#3247](https://github.com/kubernetes/kubernetes/issues/3247). -* *[future]* **When scaling down, make more educated decision which pods to kill.** - E.g.: if two or more pods from the same replication controller are on the same node, kill one of them. - Discussed in [#4301](https://github.com/kubernetes/kubernetes/issues/4301). +* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. +memory, network traffic, qps). This includes scaling based on a custom/application metric. +* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler, +instead of computing average for a target metric across pods, will use a single, +external, metric (e.g. qps metric from load balancer). The metric will be +aggregated while the target will remain per-pod (e.g. when observing 100 qps on +load balancer while the target is 20 qps per pod, autoscaler will set the number +of replicas to 5). +* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers +of pods for different metrics are different, choose the largest target number of +pods. +* *[future]* **Scale the number of pods starting from 0.** All pods can be +turned-off, and then turned-on when there is a demand for them. When a request +to service with no pods arrives, kube-proxy will generate an event for +autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247). +* *[future]* **When scaling down, make more educated decision which pods to +kill.** E.g.: if two or more pods from the same replication controller are on +the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301). diff --git a/identifiers.md b/identifiers.md index fc5e6925..175d25c9 100644 --- a/identifiers.md +++ b/identifiers.md @@ -34,95 +34,111 @@ Documentation for other releases can be found at # Identifiers and Names in Kubernetes -A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](http://issue.k8s.io/199). +A summarization of the goals and recommendations for identifiers in Kubernetes. +Described in GitHub issue [#199](http://issue.k8s.io/199). ## Definitions -UID -: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities. +`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time +and space; intended to distinguish between historical occurrences of similar +entities. -Name -: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations. +`Name`: A non-empty string guaranteed to be unique within a given scope at a +particular time; used in resource URLs; provided by clients at creation time and +encouraged to be human friendly; intended to facilitate creation idempotence and +space-uniqueness of singleton objects, distinguish distinct entities, and +reference particular entities across operations. -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL) -: An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL): +An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, +with the '-' character allowed anywhere except the first or last character, +suitable for use as a hostname or segment in a domain name. -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN) -: One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN): +One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum +length of 253 characters. -[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID) -: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination +[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID): +A 128 bit generated value that is extremely unlikely to collide across time and +space and requires no central coordination. -[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) port name (IANA_SVC_NAME) -: An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, with the '-' character allowed anywhere except the first or the last character or adjacent to another '-' character, it must contain at least a (a-z) character +[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME): +An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, +with the '-' character allowed anywhere except the first or the last character +or adjacent to another '-' character, it must contain at least a (a-z) +character. ## Objectives for names and UIDs -1. Uniquely identify (via a UID) an object across space and time - -2. Uniquely name (via a name) an object across space - -3. Provide human-friendly names in API operations and/or configuration files - -4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects - -5. Allow DNS names to be automatically generated for some objects +1. Uniquely identify (via a UID) an object across space and time. +2. Uniquely name (via a name) an object across space. +3. Provide human-friendly names in API operations and/or configuration files. +4. Allow idempotent creation of API resources (#148) and enforcement of +space-uniqueness of singleton objects. +5. Allow DNS names to be automatically generated for some objects. ## General design -1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency. +1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must +be specified. Name must be non-empty and unique within the apiserver. This +enables idempotent and space-unique creation operations. Parts of the system +(e.g. replication controller) may join strings (e.g. a base name and a random +suffix) to create a unique Name. For situations where generating a name is +impractical, some or all objects may support a param to auto-generate a name. +Generating random names will defeat idempotency. * Examples: "guestbook.user", "backend-x4eb1" - -2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random). +2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? +format TBD via #1114) may be specified. Depending on the API receiver, +namespaces might be validated (e.g. apiserver might ensure that the namespace +actually exists). If a namespace is not specified, one will be assigned by the +API receiver. This assignment policy might vary across API receivers (e.g. +apiserver might have a default, kubelet might generate something semi-random). * Example: "api.k8s.example.com" - -3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time. +3. Upon acceptance of an object via an API, the object is assigned a UID +(a UUID). UID must be non-empty and unique across space and time. * Example: "01234567-89ab-cdef-0123-456789abcdef" - ## Case study: Scheduling a pod -Pods can be placed onto a particular node in a number of ways. This case -study demonstrates how the above design can be applied to satisfy the -objectives. +Pods can be placed onto a particular node in a number of ways. This case study +demonstrates how the above design can be applied to satisfy the objectives. ### A pod scheduled by a user through the apiserver 1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver. - 2. The apiserver validates the input. 1. A default Namespace is assigned. 2. The pod name must be space-unique within the Namespace. - 3. Each container within the pod has a name which must be space-unique within the pod. - + 3. Each container within the pod has a name which must be space-unique within +the pod. 3. The pod is accepted. 1. A new UID is assigned. - 4. The pod is bound to a node. 1. The kubelet on the node is passed the pod's UID, Namespace, and Name. - 5. Kubelet validates the input. - 6. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is unique across time. - * This may correspond to Docker's container ID. + 1. Each container is started up with enough metadata to distinguish the pod +from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is +unique across time. * This may correspond to Docker's container ID. ### A pod placed by a config file on the node -1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor". - +1. A config file is stored on the node, containing a pod with UID="", +Namespace="", and Name="cadvisor". 2. Kubelet validates the input. 1. Since UID is not provided, kubelet generates one. 2. Since Namespace is not provided, kubelet generates one. - 1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path. + 1. The generated namespace should be deterministic and cluster-unique for +the source, such as a hash of the hostname and file path. * E.g. Namespace="file-f4231812554558a718a01ca942782d81" - 3. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is unique across time. + 1. Each container is started up with enough metadata to distinguish the pod +from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is +unique across time. 1. This may correspond to Docker's container ID. diff --git a/indexed-job.md b/indexed-job.md index b4d06dde..6c41bd64 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -53,64 +53,66 @@ a third way to run embarrassingly parallel programs, with a focus on ease of use. This new style of Job is called an *indexed job*, because each Pod of the Job -is specialized to work on a particular *index* from a fixed length array of work items. +is specialized to work on a particular *index* from a fixed length array of work +items. ## Background The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports the embarrassingly parallel use case through *workqueue jobs*. -While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) - are very flexible, they can be difficult to use. -They: (1) typically require running a message queue -or other database service, (2) typically require modifications -to existing binaries and images and (3) subtle race conditions -are easy to overlook. +While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very +flexible, they can be difficult to use. They: (1) typically require running a +message queue or other database service, (2) typically require modifications +to existing binaries and images and (3) subtle race conditions are easy to + overlook. Users also have another option for parallel jobs: creating [multiple Job objects -from a template](hdocs/design/indexed-job.md#job-patterns). -For small numbers of Jobs, this is a fine choice. Labels make it easy to view and -delete multiple Job objects at once. But, that approach also has its drawbacks: -(1) for large levels of parallelism (hundreds or thousands of pods) this approach -means that listing all jobs presents too much information, (2) users want a single -source of information about the success or failure of what the user views as a single +from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of +Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job +objects at once. But, that approach also has its drawbacks: (1) for large levels +of parallelism (hundreds or thousands of pods) this approach means that listing +all jobs presents too much information, (2) users want a single source of +information about the success or failure of what the user views as a single logical process. -Indexed job fills provides a third option with better ease-of-use for common use cases. +Indexed job fills provides a third option with better ease-of-use for common +use cases. ## Requirements ### User Requirements - Users want an easy way to run a Pod to completion *for each* item within a - [work list](#example-use-cases). +[work list](#example-use-cases). - Users want to run these pods in parallel for speed, but to vary the level of - parallelism as needed, independent of the number of work items. +parallelism as needed, independent of the number of work items. - Users want to do this without requiring changes to existing images, or source-to-image pipelines. - Users want a single object that encompasses the lifetime of the parallel - program. Deleting it should delete all dependent objects. It should report - the status of the overall process. Users should be - able to wait for it to complete, and can refer to it from other resource types, such as - [ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). +program. Deleting it should delete all dependent objects. It should report the +status of the overall process. Users should be able to wait for it to complete, +and can refer to it from other resource types, such as +[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). ### Example Use Cases -Here are several examples of *work lists*: lists of command lines that the -user wants to run, each line its own Pod. (Note that in practice, a work -list may not ever be written out in this form, but it exists in the mind of -the Job creator, and it is a useful way to talk about the the intent of the user when discussing alternatives for specifying Indexed Jobs). +Here are several examples of *work lists*: lists of command lines that the user +wants to run, each line its own Pod. (Note that in practice, a work list may not +ever be written out in this form, but it exists in the mind of the Job creator, +and it is a useful way to talk about the the intent of the user when discussing +alternatives for specifying Indexed Jobs). Note that we will not have the user express their requirements in work list -form; it is just a format for presenting use cases. Subsequent discussion -will reference these work lists. +form; it is just a format for presenting use cases. Subsequent discussion will +reference these work lists. #### Work List 1 -Process several files with the same program +Process several files with the same program: ``` /usr/local/bin/process_file 12342.dat @@ -120,7 +122,7 @@ Process several files with the same program #### Work List 2 -Process a matrix (or image, etc) in rectangular blocks +Process a matrix (or image, etc) in rectangular blocks: ``` /usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 @@ -131,7 +133,7 @@ Process a matrix (or image, etc) in rectangular blocks #### Work List 3 -Build a program at several different git commits +Build a program at several different git commits: ``` HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH @@ -141,7 +143,7 @@ HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH #### Work List 4 -Render several frames of a movie. +Render several frames of a movie: ``` ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1 @@ -151,7 +153,8 @@ Render several frames of a movie. #### Work List 5 -Render several blocks of frames. (Render blocks to avoid Pod startup overhead for every frame) +Render several blocks of frames (Render blocks to avoid Pod startup overhead for +every frame): ``` ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100 @@ -167,57 +170,59 @@ Given a work list, like in the [work list examples](#work-list-examples), the information from the work list needs to get into each Pod of the Job. Users will typically not want to create a new image for each job they -run. They will want to use existing images. So, the image is not the place +run. They will want to use existing images. So, the image is not the place for the work list. A work list can be stored on networked storage, and mounted by pods of the job. -Also, as a shortcut, for small worklists, it can be included in an annotation on the Job object, -which is then exposed as a volume in the pod via the downward API. +Also, as a shortcut, for small worklists, it can be included in an annotation on +the Job object, which is then exposed as a volume in the pod via the downward +API. ### What Varies Between Pods of a Job -Pods need to differ in some way to do something different. (They do not -differ in the work-queue style of Job, but that style has ease-of-use issues). +Pods need to differ in some way to do something different. (They do not differ +in the work-queue style of Job, but that style has ease-of-use issues). -A general approach would be to allow pods to differ from each other in arbitrary ways. -For example, the Job object could have a list of PodSpecs to run. -However, this is so general that it provides little value. It would: +A general approach would be to allow pods to differ from each other in arbitrary +ways. For example, the Job object could have a list of PodSpecs to run. +However, this is so general that it provides little value. It would: -- make the Job Spec very verbose, especially for jobs with thousands of work items +- make the Job Spec very verbose, especially for jobs with thousands of work +items - Job becomes such a vague concept that it is hard to explain to users -- in practice, we do not see cases where many pods which differ across many fields of their - specs, and need to run as a group, with no ordering constraints. +- in practice, we do not see cases where many pods which differ across many +fields of their specs, and need to run as a group, with no ordering constraints. - CLIs and UIs need to support more options for creating Job -- it is useful for monitoring and accounting databases want to aggregate data for pods - with the same controller. However, pods with very different Specs may not make sense - to aggregate. -- profiling, debugging, accounting, auditing and monitoring tools cannot assume common - images/files, behaviors, provenance and so on between Pods of a Job. +- it is useful for monitoring and accounting databases want to aggregate data +for pods with the same controller. However, pods with very different Specs may +not make sense to aggregate. +- profiling, debugging, accounting, auditing and monitoring tools cannot assume +common images/files, behaviors, provenance and so on between Pods of a Job. -Also, variety has another cost. Pods which differ in ways that affect scheduling -(node constraints, resource requirements, labels) prevent the scheduler -from treating them as fungible, which is an important optimization for the scheduler. +Also, variety has another cost. Pods which differ in ways that affect scheduling +(node constraints, resource requirements, labels) prevent the scheduler from +treating them as fungible, which is an important optimization for the scheduler. Therefore, we will not allow Pods from the same Job to differ arbitrarily (anyway, users can use multiple Job objects for that case). We will try to -allow as little as possible to differ between pods of the same Job, while -still allowing users to express common parallel patterns easily. -For users who need to run jobs which differ in other ways, they can create multiple -Jobs, and manage them as a group using labels. +allow as little as possible to differ between pods of the same Job, while still +allowing users to express common parallel patterns easily. For users who need to +run jobs which differ in other ways, they can create multiple Jobs, and manage +them as a group using labels. From the above work lists, we see a need for Pods which differ in their command lines, and in their environment variables. These work lists do not require the pods to differ in other ways. -Experience in a [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) has shown this model to be applicable -to a very broad range of problems, despite this restriction. - -Therefore we to allow pods in the same Job to differ **only** in the following aspects: +Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) +has shown this model to be applicable to a very broad range of problems, despite +this restriction. +Therefore we to allow pods in the same Job to differ **only** in the following + aspects: - command line - environment variables - ### Composition of existing images The docker image that is used in a job may not be maintained by the person @@ -230,9 +235,9 @@ This needs more thought. ### Running Ad-Hoc Jobs using kubectl -A user should be able to easily start an Indexed Job using `kubectl`. -For example to run [work list 1](#work-list-1), a user should be able -to type something simple like: +A user should be able to easily start an Indexed Job using `kubectl`. For +example to run [work list 1](#work-list-1), a user should be able to type +something simple like: ``` kubectl run process-files --image=myfileprocessor \ @@ -246,13 +251,16 @@ In the above example: - `--restart=OnFailure` implies creating a job instead of replicationController. - Each pods command line is `/usr/local/bin/process_file $F`. -- `--per-completion-env=` implies the jobs `.spec.completions` is set to the length of the argument array (3 in the example). -- `--per-completion-env=F=` causes env var with `F` to be available in the environment when the command line is evaluated. +- `--per-completion-env=` implies the jobs `.spec.completions` is set to the +length of the argument array (3 in the example). +- `--per-completion-env=F=` causes env var with `F` to be available in +the environment when the command line is evaluated. -How exactly this happens is discussed later in the doc: this is a sketch of the user experience. +How exactly this happens is discussed later in the doc: this is a sketch of the +user experience. -In practice, the list of files might be much longer and stored in a file -on the users local host, like: +In practice, the list of files might be much longer and stored in a file on the +users local host, like: ``` $ cat files-to-process.txt @@ -266,16 +274,27 @@ So, the user could specify instead: `--per-completion-env=F="$(cat files-to-proc However, `kubectl` should also support a format like: `--per-completion-env=F=@files-to-process.txt`. -That allows `kubectl` to parse the file, point out any syntax errors, and would not run up against command line length limits (2MB is common, as low as 4kB is POSIX compliant). - -One case we do not try to handle is where the file of work is stored on a cloud filesystem, and not accessible from the users local host. Then we cannot easily use indexed job, because we do not know the number of completions. The user needs to copy the file locally first or use the Work-Queue style of Job (already supported). - -Another case we do not try to handle is where the input file does not exist yet because this Job is to be run at a future time, or depends on another job. The workflow and scheduled job proposal need to consider this case. For that case, you could use an indexed job which runs a program which shards the input file (map-reduce-style). +That allows `kubectl` to parse the file, point out any syntax errors, and would +not run up against command line length limits (2MB is common, as low as 4kB is +POSIX compliant). + +One case we do not try to handle is where the file of work is stored on a cloud +filesystem, and not accessible from the users local host. Then we cannot easily +use indexed job, because we do not know the number of completions. The user +needs to copy the file locally first or use the Work-Queue style of Job (already +supported). + +Another case we do not try to handle is where the input file does not exist yet +because this Job is to be run at a future time, or depends on another job. The +workflow and scheduled job proposal need to consider this case. For that case, +you could use an indexed job which runs a program which shards the input file +(map-reduce-style). #### Multiple parameters The user may also have multiple parameters, like in [work list 2](#work-list-2). -One way is to just list all the command lines already expanded, one per line, in a file, like this: +One way is to just list all the command lines already expanded, one per line, in +a file, like this: ``` $ cat matrix-commandlines.txt @@ -295,10 +314,12 @@ kubectl run process-matrix --image=my/matrix \ 'eval "$COMMAND_LINE"' ``` -However, this may have some subtleties with shell escaping. Also, it depends on the user -knowing all the correct arguments to the docker image being used (more on this later). +However, this may have some subtleties with shell escaping. Also, it depends on +the user knowing all the correct arguments to the docker image being used (more +on this later). -Instead, kubectl should support multiple instances of the `--per-completion-env` flag. For example, to implement work list 2, a user could do: +Instead, kubectl should support multiple instances of the `--per-completion-env` +flag. For example, to implement work list 2, a user could do: ``` kubectl run process-matrix --image=my/matrix \ @@ -313,8 +334,8 @@ kubectl run process-matrix --image=my/matrix \ ### Composition With Workflows and ScheduledJob -A user should be able to create a job (Indexed or not) which runs at a specific time(s). -For example: +A user should be able to create a job (Indexed or not) which runs at a specific +time(s). For example: ``` $ kubectl run process-files --image=myfileprocessor \ @@ -326,12 +347,16 @@ $ kubectl run process-files --image=myfileprocessor \ created "scheduledJob/process-files-37dt3" ``` -Kubectl should build the same JobSpec, and then put it into a ScheduledJob (#11980) and create that. +Kubectl should build the same JobSpec, and then put it into a ScheduledJob +(#11980) and create that. -For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a complete workflow from a single command line would be messy, because of the need to specify all the arguments multiple times. +For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a +complete workflow from a single command line would be messy, because of the need +to specify all the arguments multiple times. -For that use case, the user could create a workflow message by hand. -Or the user could create a job template, and then make a workflow from the templates, perhaps like this: +For that use case, the user could create a workflow message by hand. Or the user +could create a job template, and then make a workflow from the templates, +perhaps like this: ``` $ kubectl run process-files --image=myfileprocessor \ @@ -357,17 +382,17 @@ created "workflow/process-and-merge" ### Completion Indexes A JobSpec specifies the number of times a pod needs to complete successfully, -through the `job.Spec.Completions` field. The number of completions -will be equal to the number of work items in the work list. +through the `job.Spec.Completions` field. The number of completions will be +equal to the number of work items in the work list. Each pod that the job controller creates is intended to complete one work item -from the work list. Since a pod may fail, several pods may, serially, -attempt to complete the same index. Therefore, we call it a -a *completion index* (or just *index*), but not a *pod index*. +from the work list. Since a pod may fail, several pods may, serially, attempt to +complete the same index. Therefore, we call it a a *completion index* (or just +*index*), but not a *pod index*. -For each completion index, in the range 1 to `.job.Spec.Completions`, -the job controller will create a pod with that index, and keep creating them -on failure, until each index is completed. +For each completion index, in the range 1 to `.job.Spec.Completions`, the job +controller will create a pod with that index, and keep creating them on failure, +until each index is completed. An dense integer index, rather than a sparse string index (e.g. using just `metadata.generate-name`) makes it easy to use the index to lookup parameters @@ -375,9 +400,9 @@ in, for example, an array in shared storage. ### Pod Identity and Template Substitution in Job Controller -The JobSpec contains a single pod template. When the job controller creates a particular -pod, it copies the pod template and modifies it in some way to make that pod distinctive. -Whatever is distinctive about that pod is its *identity*. +The JobSpec contains a single pod template. When the job controller creates a +particular pod, it copies the pod template and modifies it in some way to make +that pod distinctive. Whatever is distinctive about that pod is its *identity*. We consider several options. @@ -387,45 +412,46 @@ The job controller substitutes only the *completion index* of the pod into the pod template when creating it. The JSON it POSTs differs only in a single fields. -We would put the completion index as a stringified integer, into an -annotation of the pod. The user can extract it from the annotation -into an env var via the downward API, or put it in a file via a Downward -API volume, and parse it himself. +We would put the completion index as a stringified integer, into an annotation +of the pod. The user can extract it from the annotation into an env var via the +downward API, or put it in a file via a Downward API volume, and parse it +himself. - -Once it is an environment variable in the pod (say `$INDEX`), -then one of two things can happen. +Once it is an environment variable in the pod (say `$INDEX`), then one of two +things can happen. First, the main program can know how to map from an integer index to what it -needs to do. -For example, from Work List 4 above: +needs to do. For example, from Work List 4 above: ``` ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX ``` -Second, a shell script can be prepended to the original command line which maps the -index to one or more string parameters. For example, to implement Work List 5 above, -you could do: +Second, a shell script can be prepended to the original command line which maps +the index to one or more string parameters. For example, to implement Work List +5 above, you could do: ``` /vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME ``` -In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` and exports `$START_FRAME` and `$END_FRAME`. +In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` +and exports `$START_FRAME` and `$END_FRAME`. -The shell could be part of the image, but more usefully, it could be generated by a program and stuffed in an annotation -or a configMap, and from there added to a volume. +The shell could be part of the image, but more usefully, it could be generated +by a program and stuffed in an annotation or a configMap, and from there added +to a volume. -The first approach may require the user -to modify an existing image (see next section) to be able to accept an `$INDEX` env var or argument. -The second approach requires that the image have a shell. We think that together these two options -cover a wide range of use cases (though not all). +The first approach may require the user to modify an existing image (see next +section) to be able to accept an `$INDEX` env var or argument. The second +approach requires that the image have a shell. We think that together these two +options cover a wide range of use cases (though not all). #### Multiple Substitution -In this option, the JobSpec is extended to include a list of values to substitute, -and which fields to substitute them into. For example, a worklist like this: +In this option, the JobSpec is extended to include a list of values to +substitute, and which fields to substitute them into. For example, a worklist +like this: ``` FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds @@ -433,7 +459,7 @@ FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit ``` -Can be broken down into a template like this, with three parameters +Can be broken down into a template like this, with three parameters: ``` ; process-fruit -a -b -c @@ -447,9 +473,8 @@ and a list of parameter tuples, like this: ("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit") ``` -The JobSpec can be extended to hold a list of parameter tuples (which -are more easily expressed as a list of lists of individual parameters). -For example: +The JobSpec can be extended to hold a list of parameter tuples (which are more +easily expressed as a list of lists of individual parameters). For example: ``` apiVersion: extensions/v1beta1 @@ -477,42 +502,46 @@ spec: - "red" ``` -However, just providing custom env vars, and not arguments, is sufficient -for many use cases: parameter can be put into env vars, and then -substituted on the command line. +However, just providing custom env vars, and not arguments, is sufficient for +many use cases: parameter can be put into env vars, and then substituted on the +command line. #### Comparison The multiple substitution approach: - keeps the *per completion parameters* in the JobSpec. -- Drawback: makes the job spec large for job with thousands of completions. (But for very large jobs, the work-queue style or another type of controller, such as map-reduce or spark, may be a better fit.) -- Drawback: is a form of server-side templating, which we want in Kubernetes but have not fully designed - (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). - +- Drawback: makes the job spec large for job with thousands of completions. (But +for very large jobs, the work-queue style or another type of controller, such as +map-reduce or spark, may be a better fit.) +- Drawback: is a form of server-side templating, which we want in Kubernetes but +have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). The index-only approach: -- requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage. -- makes no changes to the JobSpec. -- Drawback: while in separate storage, they could be mutatated, which would have unexpected effects +- Requires that the user keep the *per completion parameters* in a separate +storage, such as a configData or networked storage. +- Makes no changes to the JobSpec. +- Drawback: while in separate storage, they could be mutatated, which would have +unexpected effects. - Drawback: Logic for using index to lookup parameters needs to be in the Pod. -- Drawback: CLIs and UIs are limited to using the "index" as the identity of a pod - from a job. They cannot easily say, for example `repeated failures on the pod processing banana.txt`. - +- Drawback: CLIs and UIs are limited to using the "index" as the identity of a +pod from a job. They cannot easily say, for example `repeated failures on the +pod processing banana.txt`. Index-only approach relies on at least one of the following being true: -1. image containing a shell and certain shell commands (not all images have this) -1. use directly consumes the index from annoations (file or env var) and expands to specific behavior in the main program. +1. Image containing a shell and certain shell commands (not all images have +this). +1. Use directly consumes the index from annotations (file or env var) and +expands to specific behavior in the main program. -Also Using the index-only approach from -non-kubectl clients requires that they mimic the script-generation step, -or only use the second style. +Also Using the index-only approach from non-kubectl clients requires that they +mimic the script-generation step, or only use the second style. #### Decision -It is decided to implement the Index-only approach now. Once the server-side +It is decided to implement the Index-only approach now. Once the server-side templating design is complete for Kubernetes, and we have feedback from users, we can consider if Multiple Substitution. @@ -523,43 +552,42 @@ we can consider if Multiple Substitution. No changes are made to the JobSpec. -The JobStatus is also not changed. -The user can gauge the progress of the job by the `.status.succeeded` count. +The JobStatus is also not changed. The user can gauge the progress of the job by +the `.status.succeeded` count. #### Job Spec Compatilibity -A job spec written before this change will work exactly the same -as before with the new controller. -The Pods it creates will have the same environment as before. -They will have a new annotation, but pod are expected to tolerate +A job spec written before this change will work exactly the same as before with +the new controller. The Pods it creates will have the same environment as +before. They will have a new annotation, but pod are expected to tolerate unfamiliar annotations. -However, if the job controller version is reverted, to a version before this change, -the jobs whose pod specs depend on the the new annotation will fail. This is -okay for a Beta resource. +However, if the job controller version is reverted, to a version before this +change, the jobs whose pod specs depend on the the new annotation will fail. +This is okay for a Beta resource. #### Job Controller Changes The Job controller will maintain for each Job a data structed which -indicates the status of each completion index. We call this the -*scoreboard* for short. It is an array of length `.spec.completions`. +indicates the status of each completion index. We call this the +*scoreboard* for short. It is an array of length `.spec.completions`. Elements of the array are `enum` type with possible values including `complete`, `running`, and `notStarted`. -The scoreboard is stored in Job Controller -memory for efficiency. In either case, the Status can be reconstructed from -watching pods of the job (such as on a controller manager restart). -The index of the pods can be extracted from the pod annotation. +The scoreboard is stored in Job Controller memory for efficiency. In either +case, the Status can be reconstructed from watching pods of the job (such as on +a controller manager restart). The index of the pods can be extracted from the +pod annotation. -When Job controller sees that the number of running pods is less than the desired -parallelism of the job, it finds the first index in the scoreboard with value -`notRunning`. It creates a pod with this creation index. +When Job controller sees that the number of running pods is less than the +desired parallelism of the job, it finds the first index in the scoreboard with +value `notRunning`. It creates a pod with this creation index. -When it creates a pod with creation index `i`, it makes a copy -of the `.spec.template`, and sets -`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` -to `i`. It does this in both the index-only and multiple-substitutions options. +When it creates a pod with creation index `i`, it makes a copy of the +`.spec.template`, and sets +`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to +`i`. It does this in both the index-only and multiple-substitutions options. Then it creates the pod. @@ -571,8 +599,8 @@ When all entries in the scoreboard are `complete`, then the job is complete. #### Downward API Changes -The downward API is changed to support extracting specific key names -into a single environment variable. So, the following would be supported: +The downward API is changed to support extracting specific key names into a +single environment variable. So, the following would be supported: ``` kind: Pod @@ -589,15 +617,16 @@ spec: This requires kubelet changes. -Users who fail to upgrade their kubelets at the same time as they upgrade their controller -manager will see a failure for pods to run when they are created by the controller. -The Kubelet will send an event about failure to create the pod. +Users who fail to upgrade their kubelets at the same time as they upgrade their +controller manager will see a failure for pods to run when they are created by +the controller. The Kubelet will send an event about failure to create the pod. The `kubectl describe job` will show many failed pods. #### Kubectl Interface Changes -The `--completions` and `--completion-index-var-name` flags are added to kubectl. +The `--completions` and `--completion-index-var-name` flags are added to +kubectl. For example, this command: @@ -621,8 +650,8 @@ Kubectl would create the following pod: -Kubectl will also support the `--per-completion-env` flag, as described previously. -For example, this command: +Kubectl will also support the `--per-completion-env` flag, as described +previously. For example, this command: ``` kubectl run say-fruit --image=busybox \ @@ -655,7 +684,7 @@ kubectl run say-fruit --image=busybox \ sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' ``` -will all run 3 pods in parallel. Index 0 pod will log: +will all run 3 pods in parallel. Index 0 pod will log: ``` Have a nice grenn apple @@ -666,16 +695,20 @@ and so on. Notes: -- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a quoted - space separated list or `@` and the name of a text file containing a list. -- `--per-completion-env=` can be specified several times, but all must have the same - length list +- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a +quoted space separated list or `@` and the name of a text file containing a +list. +- `--per-completion-env=` can be specified several times, but all must have the +same length list. - `--completions=N` with `N` equal to list length is implied. - The flag `--completions=3` sets `job.spec.completions=3`. -- The flag `--completion-index-var-name=I` causes an env var to be created named I in each pod, with the index in it. -- The flag `--restart=OnFailure` is implied by `--completions` or any job-specific arguments. The user can also specify - `--restart=Never` if they desire but may not specify `--restart=Always` with job-related flags. -- Setting any of these flags in turn tells kubectl to create a Job, not a replicationController. +- The flag `--completion-index-var-name=I` causes an env var to be created named +I in each pod, with the index in it. +- The flag `--restart=OnFailure` is implied by `--completions` or any +job-specific arguments. The user can also specify `--restart=Never` if they +desire but may not specify `--restart=Always` with job-related flags. +- Setting any of these flags in turn tells kubectl to create a Job, not a +replicationController. #### How Kubectl Creates Job Specs. @@ -850,14 +883,17 @@ configData/secret, and prevent the case where someone changes the configData mid-job, and breaks things in a hard-to-debug way. - ## Interactions with other features #### Supporting Work Queue Jobs too -For Work Queue Jobs, completions has no meaning. Parallelism should be allowed to be greater than it, and pods have no identity. So, the job controller should not create a scoreboard in the JobStatus, just a count. Therefore, we need to add one of the following to JobSpec: +For Work Queue Jobs, completions has no meaning. Parallelism should be allowed +to be greater than it, and pods have no identity. So, the job controller should +not create a scoreboard in the JobStatus, just a count. Therefore, we need to +add one of the following to JobSpec: -- allow unset `.spec.completions` to indicate no scoreboard, and no index for tasks (identical tasks) +- allow unset `.spec.completions` to indicate no scoreboard, and no index for +tasks (identical tasks). - allow `.spec.completions=-1` to indicate the same. - add `.spec.indexed` to job to indicate need for scoreboard. @@ -866,33 +902,31 @@ For Work Queue Jobs, completions has no meaning. Parallelism should be allowed Since pods of the same job will not be created with different resources, a vertical autoscaler will need to: -- if it has index-specific initial resource suggestions, suggest those at admission -time; it will need to understand indexes. -- mutate resource requests on already created pods based on usage trend or previous container failures +- if it has index-specific initial resource suggestions, suggest those at +admission time; it will need to understand indexes. +- mutate resource requests on already created pods based on usage trend or +previous container failures. - modify the job template, affecting all indexes. #### Comparison to PetSets - The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b. -The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more restrictive and thus less verbose. +The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more +restrictive and thus less verbose. -It would be easier for users if Indexed Job and PetSet are similar where possible. -However, PetSet differs in several key respects: +It would be easier for users if Indexed Job and PetSet are similar where +possible. However, PetSet differs in several key respects: - PetSet is for ones to tens of instances. Indexed job should work with tens of - thousands of instances. -- When you have few instances, you may want to given them pet names. When you have many - instances, you that many instances, integer indexes make more sense. +thousands of instances. +- When you have few instances, you may want to given them pet names. When you +have many instances, you that many instances, integer indexes make more sense. - When you have thousands of instances, storing the work-list in the JobSpec - is verbose. For PetSet, this is less of a problem. +is verbose. For PetSet, this is less of a problem. - PetSets (apparently) need to differ in more fields than indexed Jobs. -This differs from PetSet in that PetSet uses names and not indexes. -PetSet is intended to support ones to tens of things. - - - +This differs from PetSet in that PetSet uses names and not indexes. PetSet is +intended to support ones to tens of things. diff --git a/metadata-policy.md b/metadata-policy.md index 090241d4..da7d5425 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -38,21 +38,24 @@ Documentation for other releases can be found at This document describes a new API resource, `MetadataPolicy`, that configures an admission controller to take one or more actions based on an object's metadata. -Initially the metadata fields that the predicates can examine are labels and annotations, -and the actions are to add one or more labels and/or annotations, or to reject creation/update -of the object. In the future other actions might be supported, such as applying an initializer. - -The first use of `MetadataPolicy` will be to decide which scheduler should schedule a pod -in a [multi-scheduler](../proposals/multiple-schedulers.md) Kubernetes system. In particular, the -policy will add the scheduler name annotation to a pod based on an annotation that -is already on the pod that indicates the QoS of the pod. -(That annotation was presumably set by a simpler admission controller that -uses code, rather than configuration, to map the resource requests and limits of a pod -to QoS, and attaches the corresponding annotation.) - -We anticipate a number of other uses for `MetadataPolicy`, such as defaulting for -labels and annotations, prohibiting/requiring particular labels or annotations, or -choosing a scheduling policy within a scheduler. We do not discuss them in this doc. +Initially the metadata fields that the predicates can examine are labels and +annotations, and the actions are to add one or more labels and/or annotations, +or to reject creation/update of the object. In the future other actions might be +supported, such as applying an initializer. + +The first use of `MetadataPolicy` will be to decide which scheduler should +schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md) +Kubernetes system. In particular, the policy will add the scheduler name +annotation to a pod based on an annotation that is already on the pod that +indicates the QoS of the pod. (That annotation was presumably set by a simpler +admission controller that uses code, rather than configuration, to map the +resource requests and limits of a pod to QoS, and attaches the corresponding +annotation.) + +We anticipate a number of other uses for `MetadataPolicy`, such as defaulting +for labels and annotations, prohibiting/requiring particular labels or +annotations, or choosing a scheduling policy within a scheduler. We do not +discuss them in this doc. ## API @@ -126,7 +129,8 @@ type MetadataPolicyList struct { ## Implementation plan 1. Create `MetadataPolicy` API resource -1. Create admission controller that implements policies defined in `MetadataPolicy` +1. Create admission controller that implements policies defined in +`MetadataPolicy` 1. Create admission controller that sets annotation `scheduler.alpha.kubernetes.io/qos: ` (where `QOS` is one of `Guaranteed, Burstable, BestEffort`) @@ -134,30 +138,32 @@ based on pod's resource request and limit. ## Future work -Longer-term we will have QoS be set on create and update by the registry, similar to `Pending` phase today, -instead of having an admission controller (that runs before the one that takes `MetadataPolicy` as input) -do it. +Longer-term we will have QoS be set on create and update by the registry, +similar to `Pending` phase today, instead of having an admission controller +(that runs before the one that takes `MetadataPolicy` as input) do it. -We plan to eventually move from having an admission controller -set the scheduler name as a pod annotation, to using the initializer concept. In particular, the -scheduler will be an initializer, and the admission controller that decides which scheduler to use -will add the scheduler's name to the list of initializers for the pod (presumably the scheduler -will be the last initializer to run on each pod). -The admission controller would still be configured using the `MetadataPolicy` described here, only the -mechanism the admission controller uses to record its decision of which scheduler to use would change. +We plan to eventually move from having an admission controller set the scheduler +name as a pod annotation, to using the initializer concept. In particular, the +scheduler will be an initializer, and the admission controller that decides +which scheduler to use will add the scheduler's name to the list of initializers +for the pod (presumably the scheduler will be the last initializer to run on +each pod). The admission controller would still be configured using the +`MetadataPolicy` described here, only the mechanism the admission controller +uses to record its decision of which scheduler to use would change. ## Related issues -The main issue for multiple schedulers is #11793. There was also a lot of discussion -in PRs #17197 and #17865. +The main issue for multiple schedulers is #11793. There was also a lot of +discussion in PRs #17197 and #17865. -We could use the approach described here to choose a scheduling -policy within a single scheduler, as opposed to choosing a scheduler, a desire mentioned in #9920. -Issue #17097 describes a scenario unrelated to scheduler-choosing where `MetadataPolicy` could be used. -Issue #17324 proposes to create a generalized API for matching -"claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. +We could use the approach described here to choose a scheduling policy within a +single scheduler, as opposed to choosing a scheduler, a desire mentioned in +# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where +`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized +API for matching "claims" to "service classes"; matching a pod to a scheduler +would be one use for such an API. diff --git a/namespaces.md b/namespaces.md index e2a532b2..d63015bc 100644 --- a/namespaces.md +++ b/namespaces.md @@ -41,9 +41,11 @@ a logically named group. ## Motivation -A single cluster should be able to satisfy the needs of multiple user communities. +A single cluster should be able to satisfy the needs of multiple user +communities. -Each user community wants to be able to work in isolation from other communities. +Each user community wants to be able to work in isolation from other +communities. Each user community has its own: @@ -61,13 +63,16 @@ The Namespace provides a unique scope for: ## Use cases -1. As a cluster operator, I want to support multiple user communities on a single cluster. -2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users - in those communities. -3. As a cluster operator, I want to limit the amount of resources each community can consume in order - to limit the impact to other communities using the cluster. -4. As a cluster user, I want to interact with resources that are pertinent to my user community in - isolation of what other user communities are doing on the cluster. +1. As a cluster operator, I want to support multiple user communities on a +single cluster. +2. As a cluster operator, I want to delegate authority to partitions of the +cluster to trusted users in those communities. +3. As a cluster operator, I want to limit the amount of resources each +community can consume in order to limit the impact to other communities using +the cluster. +4. As a cluster user, I want to interact with resources that are pertinent to +my user community in isolation of what other user communities are doing on the +cluster. ## Design @@ -91,20 +96,26 @@ A *Namespace* must exist prior to associating content with it. A *Namespace* must not be deleted if there is content associated with it. -To associate a resource with a *Namespace* the following conditions must be satisfied: +To associate a resource with a *Namespace* the following conditions must be +satisfied: -1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with the server -2. The resource's *TypeMeta.Namespace* field must have a value that references an existing *Namespace* +1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with +the server +2. The resource's *TypeMeta.Namespace* field must have a value that references +an existing *Namespace* -The *Name* of a resource associated with a *Namespace* is unique to that *Kind* in that *Namespace*. +The *Name* of a resource associated with a *Namespace* is unique to that *Kind* +in that *Namespace*. -It is intended to be used in resource URLs; provided by clients at creation time, and encouraged to be -human friendly; intended to facilitate idempotent creation, space-uniqueness of singleton objects, -distinguish distinct entities, and reference particular entities across operations. +It is intended to be used in resource URLs; provided by clients at creation +time, and encouraged to be human friendly; intended to facilitate idempotent +creation, space-uniqueness of singleton objects, distinguish distinct entities, +and reference particular entities across operations. ### Authorization -A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*. +A *Namespace* provides an authorization scope for accessing content associated +with the *Namespace*. See [Authorization plugins](../admin/authorization.md) @@ -112,19 +123,21 @@ See [Authorization plugins](../admin/authorization.md) A *Namespace* provides a scope to limit resource consumption. -A *LimitRange* defines min/max constraints on the amount of resources a single entity can consume in -a *Namespace*. +A *LimitRange* defines min/max constraints on the amount of resources a single +entity can consume in a *Namespace*. See [Admission control: Limit Range](admission_control_limit_range.md) -A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and allows cluster operators -to define *Hard* resource usage limits that a *Namespace* may consume. +A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and +allows cluster operators to define *Hard* resource usage limits that a +*Namespace* may consume. See [Admission control: Resource Quota](admission_control_resource_quota.md) ### Finalizers -Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects. +Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* +objects. ```go type FinalizerName string @@ -143,13 +156,14 @@ type NamespaceSpec struct { A *FinalizerName* is a qualified name. -The API Server enforces that a *Namespace* can only be deleted from storage if and only if -it's *Namespace.Spec.Finalizers* is empty. +The API Server enforces that a *Namespace* can only be deleted from storage if +and only if it's *Namespace.Spec.Finalizers* is empty. -A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation. +A *finalize* operation is the only mechanism to modify the +*Namespace.Spec.Finalizers* field post creation. -Each *Namespace* created has *kubernetes* as an item in its list of initial *Namespace.Spec.Finalizers* -set by default. +Each *Namespace* created has *kubernetes* as an item in its list of initial +*Namespace.Spec.Finalizers* set by default. ### Phases @@ -168,39 +182,48 @@ type NamespaceStatus struct { } ``` -A *Namespace* is in the **Active** phase if it does not have a *ObjectMeta.DeletionTimestamp*. +A *Namespace* is in the **Active** phase if it does not have a +*ObjectMeta.DeletionTimestamp*. -A *Namespace* is in the **Terminating** phase if it has a *ObjectMeta.DeletionTimestamp*. +A *Namespace* is in the **Terminating** phase if it has a +*ObjectMeta.DeletionTimestamp*. **Active** -Upon creation, a *Namespace* goes in the *Active* phase. This means that content may be associated with -a namespace, and all normal interactions with the namespace are allowed to occur in the cluster. +Upon creation, a *Namespace* goes in the *Active* phase. This means that content +may be associated with a namespace, and all normal interactions with the +namespace are allowed to occur in the cluster. -If a DELETE request occurs for a *Namespace*, the *Namespace.ObjectMeta.DeletionTimestamp* is set -to the current server time. A *namespace controller* observes the change, and sets the *Namespace.Status.Phase* -to *Terminating*. +If a DELETE request occurs for a *Namespace*, the +*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A +*namespace controller* observes the change, and sets the +*Namespace.Status.Phase* to *Terminating*. **Terminating** -A *namespace controller* watches for *Namespace* objects that have a *Namespace.ObjectMeta.DeletionTimestamp* -value set in order to know when to initiate graceful termination of the *Namespace* associated content that -are known to the cluster. +A *namespace controller* watches for *Namespace* objects that have a +*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to +initiate graceful termination of the *Namespace* associated content that are +known to the cluster. -The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one. +The *namespace controller* enumerates each known resource type in that namespace +and deletes it one by one. -Admission control blocks creation of new resources in that namespace in order to prevent a race-condition -where the controller could believe all of a given resource type had been deleted from the namespace, -when in fact some other rogue client agent had created new objects. Using admission control in this -scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle. +Admission control blocks creation of new resources in that namespace in order to +prevent a race-condition where the controller could believe all of a given +resource type had been deleted from the namespace, when in fact some other rogue +client agent had created new objects. Using admission control in this scenario +allows each of registry implementations for the individual objects to not need +to take into account Namespace life-cycle. -Once all objects known to the *namespace controller* have been deleted, the *namespace controller* -executes a *finalize* operation on the namespace that removes the *kubernetes* value from -the *Namespace.Spec.Finalizers* list. +Once all objects known to the *namespace controller* have been deleted, the +*namespace controller* executes a *finalize* operation on the namespace that +removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list. -If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and -whose *Namespace.Spec.Finalizers* list is empty, it will signal the server to permanently remove -the *Namespace* from storage by sending a final DELETE action to the API server. +If the *namespace controller* sees a *Namespace* whose +*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers* +list is empty, it will signal the server to permanently remove the *Namespace* +from storage by sending a final DELETE action to the API server. ### REST API @@ -232,15 +255,18 @@ To interact with content associated with a Namespace: | WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | | LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | -The API server verifies the *Namespace* on resource creation matches the *{namespace}* on the path. +The API server verifies the *Namespace* on resource creation matches the +*{namespace}* on the path. -The API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context -of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request, -then the API server will reject the request. +The API server will associate a resource with a *Namespace* if not populated by +the end-user based on the *Namespace* context of the incoming request. If the +*Namespace* of the resource being created, or updated does not match the +*Namespace* on the request, then the API server will reject the request. ### Storage -A namespace provides a unique identifier space and therefore must be in the storage path of a resource. +A namespace provides a unique identifier space and therefore must be in the +storage path of a resource. In etcd, we want to continue to still support efficient WATCH across namespaces. @@ -248,18 +274,19 @@ Resources that persist content in etcd will have storage paths as follows: /{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} -This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}. +This enables consumers to WATCH /registry/{resourceType} for changes across +namespace of a particular {resourceType}. ### Kubelet -The kubelet will register pod's it sources from a file or http source with a namespace associated with the -*cluster-id* +The kubelet will register pod's it sources from a file or http source with a +namespace associated with the *cluster-id* ### Example: OpenShift Origin managing a Kubernetes Namespace In this example, we demonstrate how the design allows for agents built on-top of -Kubernetes that manage their own set of resource types associated with a *Namespace* -to take part in Namespace termination. +Kubernetes that manage their own set of resource types associated with a +*Namespace* to take part in Namespace termination. OpenShift creates a Namespace in Kubernetes @@ -282,9 +309,10 @@ OpenShift creates a Namespace in Kubernetes } ``` -OpenShift then goes and creates a set of resources (pods, services, etc) associated -with the "development" namespace. It also creates its own set of resources in its -own storage associated with the "development" namespace unknown to Kubernetes. +OpenShift then goes and creates a set of resources (pods, services, etc) +associated with the "development" namespace. It also creates its own set of +resources in its own storage associated with the "development" namespace unknown +to Kubernetes. User deletes the Namespace in Kubernetes, and Namespace now has following state: @@ -308,10 +336,10 @@ User deletes the Namespace in Kubernetes, and Namespace now has following state: } ``` -The Kubernetes *namespace controller* observes the namespace has a *deletionTimestamp* -and begins to terminate all of the content in the namespace that it knows about. Upon -success, it executes a *finalize* action that modifies the *Namespace* by -removing *kubernetes* from the list of finalizers: +The Kubernetes *namespace controller* observes the namespace has a +*deletionTimestamp* and begins to terminate all of the content in the namespace +that it knows about. Upon success, it executes a *finalize* action that modifies +the *Namespace* by removing *kubernetes* from the list of finalizers: ```json { @@ -333,11 +361,11 @@ removing *kubernetes* from the list of finalizers: } ``` -OpenShift Origin has its own *namespace controller* that is observing cluster state, and -it observes the same namespace had a *deletionTimestamp* assigned to it. It too will go -and purge resources from its own storage that it manages associated with that namespace. -Upon completion, it executes a *finalize* action and removes the reference to "openshift.com/origin" -from the list of finalizers. +OpenShift Origin has its own *namespace controller* that is observing cluster +state, and it observes the same namespace had a *deletionTimestamp* assigned to +it. It too will go and purge resources from its own storage that it manages +associated with that namespace. Upon completion, it executes a *finalize* action +and removes the reference to "openshift.com/origin" from the list of finalizers. This results in the following state: @@ -361,12 +389,14 @@ This results in the following state: } ``` -At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace -has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all -content associated from that namespace has been purged. It performs a final DELETE action -to remove that Namespace from the storage. +At this point, the Kubernetes *namespace controller* in its sync loop will see +that the namespace has a deletion timestamp and that its list of finalizers is +empty. As a result, it knows all content associated from that namespace has been +purged. It performs a final DELETE action to remove that Namespace from the +storage. -At this point, all content associated with that Namespace, and the Namespace itself are gone. +At this point, all content associated with that Namespace, and the Namespace +itself are gone. diff --git a/networking.md b/networking.md index 711a709a..ca2527e5 100644 --- a/networking.md +++ b/networking.md @@ -44,9 +44,9 @@ There are 4 distinct networking problems to solve: ## Model and motivation Kubernetes deviates from the default Docker networking model (though as of -Docker 1.8 their network plugins are getting closer). The goal is for each pod +Docker 1.8 their network plugins are getting closer). The goal is for each pod to have an IP in a flat shared networking namespace that has full communication -with other physical computers and containers across the network. IP-per-pod +with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and @@ -71,15 +71,15 @@ among other problems. All containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. This offers simplicity (static ports know a priori), security (ports bound to localhost -are visible within the pod but never outside it), and performance. This also +are visible within the pod but never outside it), and performance. This also reduces friction for applications moving from the world of uncontainerized apps -on physical or virtual hosts. People running application stacks together on +on physical or virtual hosts. People running application stacks together on the same host have already figured out how to make ports not conflict and have arranged for clients to find them. The approach does reduce isolation between containers within a pod — ports could conflict, and there can be no container-private ports, but these -seem to be relatively minor issues with plausible future workarounds. Besides, +seem to be relatively minor issues with plausible future workarounds. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod @@ -88,7 +88,7 @@ whereas, in general, they don't control what pods land together on a host. ## Pod to pod Because every pod gets a "real" (not machine-private) IP address, pods can -communicate without proxies or translations. The pod can use well-known port +communicate without proxies or translations. The pod can use well-known port numbers and can avoid the use of higher-level service discovery systems like DNS-SD, Consul, or Etcd. @@ -98,7 +98,7 @@ each pod has its own IP address that other pods can know. By making IP addresses and ports the same both inside and outside the pods, we create a NAT-less, flat address space. Running "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including -self-registration mechanisms and applications that distribute IP addresses. We +self-registration mechanisms and applications that distribute IP addresses. We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC. @@ -141,7 +141,7 @@ gcloud compute routes add "${NODE_NAMES[$i]}" \ --next-hop-instance-zone "${ZONE}" & ``` -GCE itself does not know anything about these IPs, though. This means that when +GCE itself does not know anything about these IPs, though. This means that when a pod tries to egress beyond GCE's project the packets must be SNAT'ed (masqueraded) to the VM's IP, which GCE recognizes and allows. @@ -161,26 +161,26 @@ to serve the purpose outside of GCE. ## Pod to service The [service](../user-guide/services.md) abstraction provides a way to group pods under a -common access policy (e.g. load-balanced). The implementation of this creates a +common access policy (e.g. load-balanced). The implementation of this creates a virtual IP which clients can access and which is transparently proxied to the -pods in a Service. Each node runs a kube-proxy process which programs +pods in a Service. Each node runs a kube-proxy process which programs `iptables` rules to trap access to service IPs and redirect them to the correct -backends. This provides a highly-available load-balancing solution with low +backends. This provides a highly-available load-balancing solution with low performance overhead by balancing client traffic from a node on that same node. ## External to internal So far the discussion has been about how to access a pod or service from within -the cluster. Accessing a pod from outside the cluster is a bit more tricky. We +the cluster. Accessing a pod from outside the cluster is a bit more tricky. We want to offer highly-available, high-performance load balancing to target -Kubernetes Services. Most public cloud providers are simply not flexible enough +Kubernetes Services. Most public cloud providers are simply not flexible enough yet. The way this is generally implemented is to set up external load balancers (e.g. -GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When +GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When traffic arrives at a node it is recognized as being part of a particular Service -and routed to an appropriate backend Pod. This does mean that some traffic will -get double-bounced on the network. Once cloud providers have better offerings +and routed to an appropriate backend Pod. This does mean that some traffic will +get double-bounced on the network. Once cloud providers have better offerings we can take advantage of those. ## Challenges and future work @@ -207,7 +207,13 @@ External IP assignment would also simplify DNS support (see below). ### IPv6 -IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) +IPv6 would be a nice option, also, but we can't depend on it yet. Docker support +is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), +[Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), +[Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). +Additionally, direct ipv6 assignment to instances doesn't appear to be supported +by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull +requests from people running Kubernetes on bare metal, though. :-) diff --git a/nodeaffinity.md b/nodeaffinity.md index dda04a51..8c999fec 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -36,40 +36,41 @@ Documentation for other releases can be found at ## Introduction -This document proposes a new label selector representation, called `NodeSelector`, -that is similar in many ways to `LabelSelector`, but is a bit more flexible and is -intended to be used only for selecting nodes. - -In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler -currently uses as part of restricting the set of nodes onto which a pod is -eligible to schedule, with a field of type `Affinity` that contains one or -more affinity specifications. In this document we discuss `NodeAffinity`, which -contains one or more of the following +This document proposes a new label selector representation, called +`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit +more flexible and is intended to be used only for selecting nodes. + +In addition, we propose to replace the `map[string]string` in `PodSpec` that the +scheduler currently uses as part of restricting the set of nodes onto which a +pod is eligible to schedule, with a field of type `Affinity` that contains one +or more affinity specifications. In this document we discuss `NodeAffinity`, +which contains one or more of the following: * a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be represented by a `NodeSelector`, and thus generalizes the scheduling behavior of the current `map[string]string` but still serves the purpose of restricting -the set of nodes onto which the pod can schedule. In addition, unlike the behavior -of the current `map[string]string`, when it becomes violated the system will -try to eventually evict the pod from its node. -* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical -to `RequiredDuringSchedulingRequiredDuringExecution` except that the system -may or may not try to eventually evict the pod from its node. -* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are -preferred for scheduling among those that meet all scheduling requirements. +the set of nodes onto which the pod can schedule. In addition, unlike the +behavior of the current `map[string]string`, when it becomes violated the system +will try to eventually evict the pod from its node. +* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is +identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the +system may or may not try to eventually evict the pod from its node. +* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that +specifies which nodes are preferred for scheduling among those that meet all +scheduling requirements. (In practice, as discussed later, we will actually *add* the `Affinity` field -rather than replacing `map[string]string`, due to backward compatibility requirements.) +rather than replacing `map[string]string`, due to backward compatibility +requirements.) -The affiniy specifications described above allow a pod to request various properties -that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a -multi-zone cluster, "run this pod on a node in zone Z." +The affiniy specifications described above allow a pod to request various +properties that are inherent to nodes, for example "run this pod on a node with +an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." ([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes -some of the properties that a node might publish as labels, which affinity expressions -can match against.) -They do *not* allow a pod to request to schedule -(or not schedule) on a node based on what other pods are running on the node. That -feature is called "inter-pod topological affinity/anti-afinity" and is described -[here](https://github.com/kubernetes/kubernetes/pull/18265). +some of the properties that a node might publish as labels, which affinity +expressions can match against.) They do *not* allow a pod to request to schedule +(or not schedule) on a node based on what other pods are running on the node. +That feature is called "inter-pod topological affinity/anti-afinity" and is +described [here](https://github.com/kubernetes/kubernetes/pull/18265). ## API @@ -171,9 +172,9 @@ type PreferredSchedulingTerm struct { } ``` -Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector` -and we can't change it since this name is part of the API. Hopefully this won't -cause too much confusion. +Unfortunately, the name of the existing `map[string]string` field in PodSpec is +`NodeSelector` and we can't change it since this name is part of the API. +Hopefully this won't cause too much confusion. ## Examples @@ -186,81 +187,91 @@ cause too much confusion. ## Backward compatibility -When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec +When we add `Affinity` to PodSpec, we will deprecate, but not remove, the +current field in PodSpec ```go NodeSelector map[string]string `json:"nodeSelector,omitempty"` ``` -Old version of the scheduler will ignore the `Affinity` field. -New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`, -i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not -attempt to convert between `Affinity` and `nodeSelector`. +Old version of the scheduler will ignore the `Affinity` field. New versions of +the scheduler will apply their scheduling predicates to both `Affinity` and +`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets +of requirements. We will not attempt to convert between `Affinity` and +`nodeSelector`. -Old versions of non-scheduling clients will not know how to do anything semantically meaningful -with `Affinity`, but we don't expect that this will cause a problem. +Old versions of non-scheduling clients will not know how to do anything +semantically meaningful with `Affinity`, but we don't expect that this will +cause a problem. See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) for more discussion. -Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master -for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet -or master to a version that does not support them. Longer-term we will use a programatic approach to -enforcing this (#4855). +Users should not start using `NodeAffinity` until the full implementation has +been in Kubelet and the master for enough binary versions that we feel +comfortable that we will not need to roll back either Kubelet or master to a +version that does not support them. Longer-term we will use a programatic +approach to enforcing this (#4855). ## Implementation plan -1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`, -and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API -2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account -3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account -4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated -5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API -6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account -7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision -8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies -`RequiredDuringSchedulingRequiredDuringExecution` -(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). - -We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling -domains (e.g. node name, rack name, availability zone name, etc.). See #9044. +1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, +`PreferredDuringSchedulingIgnoredDuringExecution`, and +`RequiredDuringSchedulingIgnoredDuringExecution` types to the API. +2. Implement a scheduler predicate that takes +`RequiredDuringSchedulingIgnoredDuringExecution` into account. +3. Implement a scheduler priority function that takes +`PreferredDuringSchedulingIgnoredDuringExecution` into account. +4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be +marked as deprecated. +5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API. +6. Modify the scheduler predicate from step 2 to also take +`RequiredDuringSchedulingRequiredDuringExecution` into account. +7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission +decision. +8. Implement code in Kubelet *or* the controllers that evicts a pod that no +longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). + +We assume Kubelet publishes labels describing the node's membership in all of +the relevant scheduling domains (e.g. node name, rack name, availability zone +name, etc.). See #9044. ## Extensibility -The design described here is the result of careful analysis of use cases, a decade of experience -with Borg at Google, and a review of similar features in other open-source container orchestration -systems. We believe that it properly balances the goal of expressiveness against the goals of -simplicity and efficiency of implementation. However, we recognize that -use cases may arise in the future that cannot be expressed using the syntax described here. -Although we are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes -users to get a consistent experience, etc.), the regular Kubernetes -annotation mechanism can be used to add or replace affinity rules. The way this work would is +The design described here is the result of careful analysis of use cases, a +decade of experience with Borg at Google, and a review of similar features in +other open-source container orchestration systems. We believe that it properly +balances the goal of expressiveness against the goals of simplicity and +efficiency of implementation. However, we recognize that use cases may arise in +the future that cannot be expressed using the syntax described here. Although we +are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire +for Kubernetes users to get a consistent experience, etc.), the regular +Kubernetes annotation mechanism can be used to add or replace affinity rules. +The way this work would is: 1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. -If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields -from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the -annotation(s). +1. User (or an admission controller) attaches the annotation(s) to pods to +request the desired scheduling behavior. If the new rule(s) *replace* one or +more fields of `Affinity` then the user would omit those fields from `Affinity`; +if they are *additional rules*, then the user would fill in `Affinity` as well +as the annotation(s). 1. Scheduler takes the annotation(s) into account when scheduling. -If some particular new syntax becomes popular, we would consider upstreaming it by integrating -it into the standard `Affinity`. +If some particular new syntax becomes popular, we would consider upstreaming it +by integrating it into the standard `Affinity`. ## Future work -Are there any other fields we should convert from `map[string]string` to `NodeSelector`? +Are there any other fields we should convert from `map[string]string` to +`NodeSelector`? ## Related issues The review for this proposal is in #18261. -The main related issue is #341. Issue #367 is also related. Those issues reference other -related issues. - - - - +The main related issue is #341. Issue #367 is also related. Those issues +reference other related issues. diff --git a/persistent-storage.md b/persistent-storage.md index 4c3b08e6..00eb2fef 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -34,43 +34,60 @@ Documentation for other releases can be found at # Persistent Storage -This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data. +This document proposes a model for managing persistent, cluster-scoped storage +for applications requiring long lived data. ### Abstract Two new API kinds: -A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it. +A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. +It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) +for how to use it. -A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod. +A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to +use in a pod. It is analogous to a pod. One new system component: -`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage. +`PersistentVolumeClaimBinder` is a singleton running in master that watches all +PersistentVolumeClaims in the system and binds them to the closest matching +available PersistentVolume. The volume manager watches the API for newly created +volumes to manage. One new volume: -`PersistentVolumeClaimVolumeSource` references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A `PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another type of volume that is owned by someone else (the system). +`PersistentVolumeClaimVolumeSource` references the user's PVC in the same +namespace. This volume finds the bound PV and mounts that volume for the pod. A +`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another +type of volume that is owned by someone else (the system). -Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider. +Kubernetes makes no guarantees at runtime that the underlying storage exists or +is available. High availability is left to the storage provider. ### Goals -* Allow administrators to describe available storage -* Allow pod authors to discover and request persistent volumes to use with pods -* Enforce security through access control lists and securing storage to the same namespace as the pod volume -* Enforce quotas through admission control -* Enforce scheduler rules by resource counting -* Ensure developers can rely on storage being available without being closely bound to a particular disk, server, network, or storage device. - +* Allow administrators to describe available storage. +* Allow pod authors to discover and request persistent volumes to use with pods. +* Enforce security through access control lists and securing storage to the same +namespace as the pod volume. +* Enforce quotas through admission control. +* Enforce scheduler rules by resource counting. +* Ensure developers can rely on storage being available without being closely +bound to a particular disk, server, network, or storage device. #### Describe available storage -Cluster administrators use the API to manage *PersistentVolumes*. A custom store `NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request. +Cluster administrators use the API to manage *PersistentVolumes*. A custom store +`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by +storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for +storage and binds them to an available volume by matching the volume's +characteristics (AccessModes and storage size) to the user's request. PVs are system objects and, thus, have no namespace. -Many means of dynamic provisioning will be eventually be implemented for various storage types. +Many means of dynamic provisioning will be eventually be implemented for various +storage types. ##### PersistentVolume API @@ -87,11 +104,15 @@ Many means of dynamic provisioning will be eventually be implemented for various #### Request Storage -Kubernetes users request persistent storage for their pod by creating a ```PersistentVolumeClaim```. Their request for storage is described by their requirements for resources and mount capabilities. +Kubernetes users request persistent storage for their pod by creating a +```PersistentVolumeClaim```. Their request for storage is described by their +requirements for resources and mount capabilities. -Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled. +Requests for volumes are bound to available volumes by the volume manager, if a +suitable match is found. Requests for resources can go unfulfilled. -Users attach their claim to their pod using a new ```PersistentVolumeClaimVolumeSource``` volume source. +Users attach their claim to their pod using a new +```PersistentVolumeClaimVolumeSource``` volume source. ##### PersistentVolumeClaim API @@ -110,23 +131,31 @@ Users attach their claim to their pod using a new ```PersistentVolumeClaimVolume #### Scheduling constraints -Scheduling constraints are to be handled similar to pod resource constraints. Pods will need to be annotated or decorated with the number of resources it requires on a node. Similarly, a node will need to list how many it has used or available. +Scheduling constraints are to be handled similar to pod resource constraints. +Pods will need to be annotated or decorated with the number of resources it +requires on a node. Similarly, a node will need to list how many it has used or +available. TBD #### Events -The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary. - -Events that communicate the state of a mounted volume are left to the volume plugins. +The implementation of persistent storage will not require events to communicate +to the user the state of their claim. The CLI for bound claims contains a +reference to the backing persistent volume. This is always present in the API +and CLI, making an event to communicate the same unnecessary. +Events that communicate the state of a mounted volume are left to the volume +plugins. ### Example #### Admin provisions storage -An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs. +An administrator provisions storage by posting PVs to the API. Various ways to +automate this task can be scripted. Dynamic provisioning is a future feature +that can maintain levels of PVs. ```yaml POST: @@ -152,7 +181,8 @@ pv0001 map[] 10737418240 RWO #### Users request storage -A user requests storage by posting a PVC to the API. Their request contains the AccessModes they wish their volume to have and the minimum size needed. +A user requests storage by posting a PVC to the API. Their request contains the +AccessModes they wish their volume to have and the minimum size needed. The user must be within a namespace to create PVCs. @@ -181,7 +211,10 @@ myclaim-1 map[] pending #### Matching and binding - The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found. +The ```PersistentVolumeClaimBinder``` attempts to find an available volume that +most closely matches the user's request. If one exists, they are bound by +putting a reference on the PV to the PVC. Requests can go unfulfilled if a +suitable match is not found. ```console $ kubectl get pv @@ -198,9 +231,12 @@ myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8 #### Claim usage -The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim and mount its volume for a pod. +The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim +and mount its volume for a pod. -The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods. +The claim holder owns the claim and its data for as long as the claim exists. +The pod using the claim can be deleted, but the claim remains in the user's +namespace. It can be used again and again by many pods. ```yaml POST: @@ -233,9 +269,11 @@ When a claim holder is finished with their data, they can delete their claim. $ kubectl delete pvc myclaim-1 ``` -The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'. +The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim +reference from the PV and change the PVs status to 'Released'. -Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. +Admins can script the recycling of released volumes. Future dynamic provisioners +will understand how a volume should be recycled. diff --git a/podaffinity.md b/podaffinity.md index 1a2da4af..2c57ed90 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -38,45 +38,48 @@ Documentation for other releases can be found at NOTE: It is useful to read about [node affinity](nodeaffinity.md) first. -This document describes a proposal for specifying and implementing inter-pod topological affinity and -anti-affinity. By that we mean: rules that specify that certain pods should be placed -in the same topological domain (e.g. same node, same rack, same zone, same -power domain, etc.) as some other pods, or, conversely, should *not* be placed in the -same topological domain as some other pods. - -Here are a few example rules; we explain how to express them using the API described -in this doc later, in the section "Examples." +This document describes a proposal for specifying and implementing inter-pod +topological affinity and anti-affinity. By that we mean: rules that specify that +certain pods should be placed in the same topological domain (e.g. same node, +same rack, same zone, same power domain, etc.) as some other pods, or, +conversely, should *not* be placed in the same topological domain as some other +pods. + +Here are a few example rules; we explain how to express them using the API +described in this doc later, in the section "Examples." * Affinity - * Co-locate the pods from a particular service or Job in the same availability zone, - without specifying which zone that should be. - * Co-locate the pods from service S1 with pods from service S2 because S1 uses S2 - and thus it is useful to minimize the network latency between them. Co-location - might mean same nodes and/or same availability zone. + * Co-locate the pods from a particular service or Job in the same availability +zone, without specifying which zone that should be. + * Co-locate the pods from service S1 with pods from service S2 because S1 uses +S2 and thus it is useful to minimize the network latency between them. +Co-location might mean same nodes and/or same availability zone. * Anti-affinity - * Spread the pods of a service across nodes and/or availability zones, - e.g. to reduce correlated failures - * Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods + * Spread the pods of a service across nodes and/or availability zones, e.g. to +reduce correlated failures. + * Give a pod "exclusive" access to a node to guarantee resource isolation -- +it must never share the node with other pods. * Don't schedule the pods of a particular service on the same nodes as pods of - another service that are known to interfere with the performance of the pods of the first service. - -For both affinity and anti-affinity, there are three variants. Two variants have the -property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed -to schedule onto a node; the difference between them is that if the condition ceases to -be met later on at runtime, for one of them the system will try to eventually evict the pod, -while for the other the system may not try to do so. The third variant -simply provides scheduling-time *hints* that the scheduler will try -to satisfy but may not be able to. These three variants are directly analogous to the three -variants of [node affinity](nodeaffinity.md). - -Note that this proposal is only about *inter-pod* topological affinity and anti-affinity. -There are other forms of topological affinity and anti-affinity. For example, -you can use [node affinity](nodeaffinity.md) to require (prefer) -that a set of pods all be scheduled in some specific zone Z. Node affinity is not -capable of expressing inter-pod dependencies, and conversely the API -we describe in this document is not capable of expressing node affinity rules. -For simplicity, we will use the terms "affinity" and "anti-affinity" to mean -"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively, -in the remainder of this document. +another service that are known to interfere with the performance of the pods of +the first service. + +For both affinity and anti-affinity, there are three variants. Two variants have +the property of requiring the affinity/anti-affinity to be satisfied for the pod +to be allowed to schedule onto a node; the difference between them is that if +the condition ceases to be met later on at runtime, for one of them the system +will try to eventually evict the pod, while for the other the system may not try +to do so. The third variant simply provides scheduling-time *hints* that the +scheduler will try to satisfy but may not be able to. These three variants are +directly analogous to the three variants of [node affinity](nodeaffinity.md). + +Note that this proposal is only about *inter-pod* topological affinity and +anti-affinity. There are other forms of topological affinity and anti-affinity. +For example, you can use [node affinity](nodeaffinity.md) to require (prefer) +that a set of pods all be scheduled in some specific zone Z. Node affinity is +not capable of expressing inter-pod dependencies, and conversely the API we +describe in this document is not capable of expressing node affinity rules. For +simplicity, we will use the terms "affinity" and "anti-affinity" to mean +"inter-pod topological affinity" and "inter-pod topological anti-affinity," +respectively, in the remainder of this document. ## API @@ -90,28 +93,28 @@ The `Affinity` type is defined as follows ```go type Affinity struct { - PodAffinity *PodAffinity `json:"podAffinity,omitempty"` - PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` + PodAffinity *PodAffinity `json:"podAffinity,omitempty"` + PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` } type PodAffinity struct { - // If the affinity requirements specified by this field are not met at + // If the affinity requirements specified by this field are not met at // scheduling time, the pod will not be scheduled onto the node. // If the affinity requirements specified by this field cease to be met // at some point during pod execution (e.g. due to a pod label update), the // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` // If the affinity requirements specified by this field are not met at // scheduling time, the pod will not be scheduled onto the node. // If the affinity requirements specified by this field cease to be met // at some point during pod execution (e.g. due to a pod label update), the // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy // the affinity expressions specified by this field, but it may choose // a node that violates one or more of the expressions. The node that is // most preferred is the one with the greatest sum of weights, i.e. @@ -120,27 +123,27 @@ type PodAffinity struct { // compute a sum by iterating through the elements of this field and adding // "weight" to the sum if the node matches the corresponding MatchExpressions; the // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` } type PodAntiAffinity struct { - // If the anti-affinity requirements specified by this field are not met at + // If the anti-affinity requirements specified by this field are not met at // scheduling time, the pod will not be scheduled onto the node. // If the anti-affinity requirements specified by this field cease to be met // at some point during pod execution (e.g. due to a pod label update), the // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` // If the anti-affinity requirements specified by this field are not met at // scheduling time, the pod will not be scheduled onto the node. // If the anti-affinity requirements specified by this field cease to be met // at some point during pod execution (e.g. due to a pod label update), the // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy // the anti-affinity expressions specified by this field, but it may choose // a node that violates one or more of the expressions. The node that is // most preferred is the one with the greatest sum of weights, i.e. @@ -149,7 +152,7 @@ type PodAntiAffinity struct { // compute a sum by iterating through the elements of this field and adding // "weight" to the sum if the node matches the corresponding MatchExpressions; the // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` } type WeightedPodAffinityTerm struct { @@ -159,23 +162,25 @@ type WeightedPodAffinityTerm struct { } type PodAffinityTerm struct { - LabelSelector *LabelSelector `json:"labelSelector,omitempty"` - // namespaces specifies which namespaces the LabelSelector applies to (matches against); - // nil list means "this pod's namespace," empty list means "all namespaces" - // The json tag here is not "omitempty" since we need to distinguish nil and empty. - // See https://golang.org/pkg/encoding/json/#Marshal for more details. - Namespaces []api.Namespace `json:"namespaces,omitempty"` - // empty topology key is interpreted by the scheduler as "all topologies" - TopologyKey string `json:"topologyKey,omitempty"` + LabelSelector *LabelSelector `json:"labelSelector,omitempty"` + // namespaces specifies which namespaces the LabelSelector applies to (matches against); + // nil list means "this pod's namespace," empty list means "all namespaces" + // The json tag here is not "omitempty" since we need to distinguish nil and empty. + // See https://golang.org/pkg/encoding/json/#Marshal for more details. + Namespaces []api.Namespace `json:"namespaces,omitempty"` + // empty topology key is interpreted by the scheduler as "all topologies" + TopologyKey string `json:"topologyKey,omitempty"` } ``` -Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped -to the pod's namespace, but we need to be able to match against all pods globally. +Note that the `Namespaces` field is necessary because normal `LabelSelector` is +scoped to the pod's namespace, but we need to be able to match against all pods +globally. -To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity` -that is configured as follows (note that we've omitted and collapsed some fields for -simplicity, but this should sufficiently convey the intent of the design): +To explain how this API works, let's say that the `PodSpec` of a pod `P` has an +`Affinity` that is configured as follows (note that we've omitted and collapsed +some fields for simplicity, but this should sufficiently convey the intent of +the design): ```go PodAffinity { @@ -188,130 +193,160 @@ PodAntiAffinity { } ``` -Then when scheduling pod P, the scheduler -* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.) -* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.) -* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.) -* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.) - -When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed. -For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and -the node(s) with the highest weight(s) are the most preferred. - -In reality there are two variants of `RequiredDuringScheduling`: one suffixed with -`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the -first variant, if the affinity/anti-affinity ceases to be met at some point during -pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod -from its node. In the second variant, the system may or may not try to eventually -evict the pod from its node. +Then when scheduling pod P, the scheduler: +* Can only schedule P onto nodes that are running pods that satisfy `P1`. +(Assumes all nodes have a label with key `node` and value specifying their node +name.) +* Should try to schedule P onto zones that are running pods that satisfy `P2`. +(Assumes all nodes have a label with key `zone` and value specifying their +zone.) +* Cannot schedule P onto any racks that are running pods that satisfy `P3`. +(Assumes all nodes have a label with key `rack` and value specifying their rack +name.) +* Should try not to schedule P onto any power domains that are running pods that +satisfy `P4`. (Assumes all nodes have a label with key `power` and value +specifying their power domain.) + +When `RequiredDuringScheduling` has multiple elements, the requirements are +ANDed. For `PreferredDuringScheduling` the weights are added for the terms that +are satisfied for each node, and the node(s) with the highest weight(s) are the +most preferred. + +In reality there are two variants of `RequiredDuringScheduling`: one suffixed +with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. +For the first variant, if the affinity/anti-affinity ceases to be met at some +point during pod execution (e.g. due to a pod label update), the system will try +to eventually evict the pod from its node. In the second variant, the system may +or may not try to eventually evict the pod from its node. ## A comment on symmetry One thing that makes affinity and anti-affinity tricky is symmetry. -Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule -"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when -you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod, -*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1 -pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned -RequiredDuringScheduling anti-affinity rule, then +Imagine a cluster that is running pods from two services, S1 and S2. Imagine +that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not +run me on nodes that are running pods from S2." It is not sufficient just to +check that there are no S2 pods on a node when you are scheduling a S1 pod. You +also need to ensure that there are no S1 pods on a node when you are scheduling +a S2 pod, *even though the S2 pod does not have any anti-affinity rules*. +Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's +RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving +S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling +anti-affinity rule, then: * if a node is empty, you can schedule S1 or S2 onto the node * if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node Note that while RequiredDuringScheduling anti-affinity is symmetric, -RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running -pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More -specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then +RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 +have a RequiredDuringScheduling affinity rule "run me on nodes that are running +pods from S2," it is not required that there be S1 pods on a node in order to +schedule a S2 pod onto that node. More specifically, if S1 has the +aforementioned RequiredDuringScheduling affinity rule, then: * if a node is empty, you can schedule S2 onto the node * if a node is empty, you cannot schedule S1 onto the node * if a node is running S2, you can schedule S1 onto the node * if a node is running S1+S2 and S1 terminates, S2 continues running -* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually) - -However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every -RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running -pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node, -but it would be better if there are. - -PreferredDuringScheduling is symmetric. -If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2" -then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also -to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of -S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer -to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place -a S2 pod that we are scheduling onto a node that is running a S1 pod. +* if a node is running S1+S2 and S2 terminates, the system terminates S1 +(eventually) + +However, although RequiredDuringScheduling affinity is not symmetric, there is +an implicit PreferredDuringScheduling affinity rule corresponding to every +RequiredDuringScheduling affinity rule: if the pods of S1 have a +RequiredDuringScheduling affinity rule "run me on nodes that are running pods +from S2" then it is not required that there be S1 pods on a node in order to +schedule a S2 pod onto that node, but it would be better if there are. + +PreferredDuringScheduling is symmetric. If the pods of S1 had a +PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that +are running pods from S2" then we would prefer to keep a S1 pod that we are +scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that +we are scheduling off of nodes that are running S1 pods. Likewise if the pods of +S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that +are running pods from S2" then we would prefer to place a S1 pod that we are +scheduling onto a node that is running a S2 pod, and also to place a S2 pod that +we are scheduling onto a node that is running a S1 pod. ## Examples -Here are some examples of how you would express various affinity and anti-affinity rules using the API we described. +Here are some examples of how you would express various affinity and +anti-affinity rules using the API we described. ### Affinity -In the examples below, the word "put" is intentionally ambiguous; the rules are the same -whether "put" means "must put" (RequiredDuringScheduling) or "try to put" -(PreferredDuringScheduling)--all that changes is which field the rule goes into. -Also, we only discuss scheduling-time, and ignore the execution-time. -Finally, some of the examples -use "zone" and some use "node," just to make the examples more interesting; any of the examples -with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa. +In the examples below, the word "put" is intentionally ambiguous; the rules are +the same whether "put" means "must put" (RequiredDuringScheduling) or "try to +put" (PreferredDuringScheduling)--all that changes is which field the rule goes +into. Also, we only discuss scheduling-time, and ignore the execution-time. +Finally, some of the examples use "zone" and some use "node," just to make the +examples more interesting; any of the examples with "zone" will also work for +"node" if you change the `TopologyKey`, and vice-versa. * **Put the pod in zone Z**: -Tricked you! It is not possible express this using the API described here. For this you should use node affinity. +Tricked you! It is not possible express this using the API described here. For +this you should use node affinity. * **Put the pod in a zone that is running at least one pod from service S**: `{LabelSelector: , TopologyKey: "zone"}` -* **Put the pod on a node that is already running a pod that requires a license for software package P**: -Assuming pods that require a license for software package P have a label `{key=license, value=P}`: +* **Put the pod on a node that is already running a pod that requires a license +for software package P**: Assuming pods that require a license for software +package P have a label `{key=license, value=P}`: `{LabelSelector: "license" In "P", TopologyKey: "node"}` * **Put this pod in the same zone as other pods from its same service**: Assuming pods from this pod's service have some label `{key=service, value=S}`: `{LabelSelector: "service" In "S", TopologyKey: "zone"}` -This last example illustrates a small issue with this API when it is used -with a scheduler that processes the pending queue one pod at a time, like the current +This last example illustrates a small issue with this API when it is used with a +scheduler that processes the pending queue one pod at a time, like the current Kubernetes scheduler. The RequiredDuringScheduling rule `{LabelSelector: "service" In "S", TopologyKey: "zone"}` -only "works" once one pod from service S has been scheduled. But if all pods in service -S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule -will block the first -pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from -the same service. And of course that means none of the pods of the service will be able -to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not -PreferredDuringScheduling affinity or any variant of anti-affinity. -There are at least three ways to solve this problem -* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement -matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement. -This approach has a corner case when running parallel schedulers that are allowed to -schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to -schedule pods from the set -at the same time and think there are no other pods from that set scheduled yet (e.g. they are -trying to schedule the first two pods from the set), but by the time -the second binding is committed, the first one has already been committed, leaving you with -two pods running that do not respect their RequiredDuringScheduling affinity. There is no -simple way to detect this "conflict" at scheduling time given the current system implementation. -* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those -pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate. -* **very long-term/speculative**: controllers could present the scheduler with a group of pods from -the same PodTemplate as a single unit. This is similar to the first approach described above but -avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow -the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) -since it could receive an entire gang simultaneously as a single unit. +only "works" once one pod from service S has been scheduled. But if all pods in +service S have this RequiredDuringScheduling rule in their PodSpec, then the +RequiredDuringScheduling rule will block the first pod of the service from ever +scheduling, since it is only allowed to run in a zone with another pod from the +same service. And of course that means none of the pods of the service will be +able to schedule. This problem *only* applies to RequiredDuringScheduling +affinity, not PreferredDuringScheduling affinity or any variant of +anti-affinity. There are at least three ways to solve this problem: +* **short-term**: have the scheduler use a rule that if the +RequiredDuringScheduling affinity requirement matches a pod's own labels, and +there are no other such pods anywhere, then disregard the requirement. This +approach has a corner case when running parallel schedulers that are allowed to +schedule pods from the same replicated set (e.g. a single PodTemplate): both +schedulers may try to schedule pods from the set at the same time and think +there are no other pods from that set scheduled yet (e.g. they are trying to +schedule the first two pods from the set), but by the time the second binding is +committed, the first one has already been committed, leaving you with two pods +running that do not respect their RequiredDuringScheduling affinity. There is no +simple way to detect this "conflict" at scheduling time given the current system +implementation. +* **longer-term**: when a controller creates pods from a PodTemplate, for +exactly *one* of those pods, it should omit any RequiredDuringScheduling +affinity rules that select the pods of that PodTemplate. +* **very long-term/speculative**: controllers could present the scheduler with a +group of pods from the same PodTemplate as a single unit. This is similar to the +first approach described above but avoids the corner case. No special logic is +needed in the controllers. Moreover, this would allow the scheduler to do proper +[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since +it could receive an entire gang simultaneously as a single unit. ### Anti-affinity -As with the affinity examples, the examples here can be RequiredDuringScheduling or -PreferredDuringScheduling anti-affinity, i.e. -"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears -in `RequiredDuringScheduling` or `PreferredDuringScheduling`. +As with the affinity examples, the examples here can be RequiredDuringScheduling +or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as +"must not" or as "try not to" depending on whether the rule appears in +`RequiredDuringScheduling` or `PreferredDuringScheduling`. * **Spread the pods of this service S across nodes and zones**: -`{{LabelSelector: , TopologyKey: "node"}, {LabelSelector: , TopologyKey: "zone"}}` -(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second -clause will force the scheduler to not put more than one pod from S in the same zone, and thus by -definition it will not put more than one pod from S on the same node, assuming each node is in one zone. -This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in +`{{LabelSelector: , TopologyKey: "node"}, +{LabelSelector: , TopologyKey: "zone"}}` +(note that if this is specified as a RequiredDuringScheduling anti-affinity, +then the first clause is redundant, since the second clause will force the +scheduler to not put more than one pod from S in the same zone, and thus by +definition it will not put more than one pod from S on the same node, assuming +each node is in one zone. This rule is more useful as PreferredDuringScheduling +anti-affinity, e.g. one might expect it to be common in [Ubernetes](../../docs/proposals/federation.md) clusters.) * **Don't co-locate pods of this service with pods from service "evilService"**: @@ -323,25 +358,29 @@ This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one mi * **Don't co-locate pods of this service with any other pods except other pods of this service**: Assuming pods from the service have some label `{key=service, value=S}`: `{LabelSelector: "service" NotIn "S", TopologyKey: "node"}` -Note that this works because `"service" NotIn "S"` matches pods with no key "service" -as well as pods with key "service" and a corresponding value that is not "S." +Note that this works because `"service" NotIn "S"` matches pods with no key +"service" as well as pods with key "service" and a corresponding value that is +not "S." ## Algorithm -An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows. -There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's -semantics are implementable. +An example algorithm a scheduler might use to implement affinity and +anti-affinity rules is as follows. There are certainly more efficient ways to +do it; this is just intended to demonstrate that the API's semantics are +implementable. -Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler -predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling -time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution. +Terminology definition: We say a pod P is "feasible" on a node N if P meets all +of the scheduler predicates for scheduling P onto N. Note that this algorithm is +only concerned about scheduling time, thus it makes no distinction between +RequiredDuringExecution and IgnoredDuringExecution. -To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand -for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for -"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." +To make the algorithm slightly more readable, we use the term "HardPodAffinity" +as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and +"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity." +Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." -** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account; -currently it assumes all terms have weight 1. ** +** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} +into account; currently it assumes all terms have weight 1. ** ``` Z = the pod you are scheduling @@ -389,74 +428,81 @@ foreach node A of {N} ## Special considerations for RequiredDuringScheduling anti-affinity -In this section we discuss three issues with RequiredDuringScheduling anti-affinity: -Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill. -See issue #18265 for additional discussion of these topics. +In this section we discuss three issues with RequiredDuringScheduling +anti-affinity: Denial of Service (DoS), co-existing with daemons, and +determining which pod(s) to kill. See issue #18265 for additional discussion of +these topics. ### Denial of Service -Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally -or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity. - -The most notable danger is the ability for a -pod that arrives first to some topology domain, to block all other pods from -scheduling there by stating a conflict with all other pods. -The standard approach -to preventing resource hogging is quota, but simple resource quota cannot prevent -this scenario because the pod may request very little resources. Addressing this -using quota requires a quota scheme that charges based on "opportunity cost" rather -than based simply on requested resources. For example, when handling a pod that expresses +Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity +can intentionally or unintentionally cause various problems for other pods, due +to the symmetry property of anti-affinity. + +The most notable danger is the ability for a pod that arrives first to some +topology domain, to block all other pods from scheduling there by stating a +conflict with all other pods. The standard approach to preventing resource +hogging is quota, but simple resource quota cannot prevent this scenario because +the pod may request very little resources. Addressing this using quota requires +a quota scheme that charges based on "opportunity cost" rather than based simply +on requested resources. For example, when handling a pod that expresses RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` (i.e. exclusive access to a node), it could charge for the resources of the -average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling -anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the -entire cluster. If node affinity is used to -constrain the pod to a particular topology domain, then the admission-time quota -charging should take that into account (e.g. not charge for the average/largest machine -if the PodSpec constrains the pod to a specific machine with a known size; instead charge -for the size of the actual machine that the pod was constrained to). In all cases -once the pod is scheduled, the quota charge should be adjusted down to the -actual amount of resources allocated (e.g. the size of the actual machine that was -assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for +average or largest node in the cluster. Likewise if a pod expresses +RequiredDuringScheduling anti-affinity for all pods using a "cluster" +`TopologyKey`, it could charge for the resources of the entire cluster. If node +affinity is used to constrain the pod to a particular topology domain, then the +admission-time quota charging should take that into account (e.g. not charge for +the average/largest machine if the PodSpec constrains the pod to a specific +machine with a known size; instead charge for the size of the actual machine +that the pod was constrained to). In all cases once the pod is scheduled, the +quota charge should be adjusted down to the actual amount of resources allocated +(e.g. the size of the actual machine that was assigned, not the +average/largest). If a cluster administrator wants to overcommit quota, for example to allow more than N pods across all users to request exclusive node -access in a cluster with N nodes, then a priority/preemption scheme should be added -so that the most important pods run when resource demand exceeds supply. +access in a cluster with N nodes, then a priority/preemption scheme should be +added so that the most important pods run when resource demand exceeds supply. An alternative approach, which is a bit of a blunt hammer, is to use a capability mechanism to restrict use of RequiredDuringScheduling anti-affinity -to trusted users. A more complex capability mechanism might only restrict it when -using a non-"node" TopologyKey. +to trusted users. A more complex capability mechanism might only restrict it +when using a non-"node" TopologyKey. Our initial implementation will use a variant of the capability approach, which -requires no configuration: we will simply reject ALL requests, regardless of user, -that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity. -This allows the "exclusive node" use case while prohibiting the more dangerous ones. - -A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade -the scheduling quality of another pod, but not completely block it from scheduling. -For example, a set of pods S1 could use node affinity to request to schedule onto a set -of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1 -have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2, -then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from -scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and -with some probability that depends on the weighting scheme for the PreferredDuringScheduling case). -A very sophisticated priority and/or quota scheme could mitigate this, or alternatively -we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity. -Then only RequiredDuringScheduling anti-affinity could affect scheduling quality -of another pod, and as we described in the previous paragraph, such pods could be charged -quota for the full topology domain, thereby reducing the potential for abuse. - -We won't try to address this issue in our initial implementation; we can consider one -of the approaches mentioned above if it turns out to be a problem in practice. +requires no configuration: we will simply reject ALL requests, regardless of +user, that specify "all namespaces" with non-"node" TopologyKey for +RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use +case while prohibiting the more dangerous ones. + +A weaker variant of the problem described in the previous paragraph is a pod's +ability to use anti-affinity to degrade the scheduling quality of another pod, +but not completely block it from scheduling. For example, a set of pods S1 could +use node affinity to request to schedule onto a set of nodes that some other set +of pods S2 prefers to schedule onto. If the pods in S1 have +RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for +S2, then due to the symmetry property of anti-affinity, they can prevent the +pods in S2 from scheduling onto their preferred nodes if they arrive first (for +sure in the RequiredDuringScheduling case, and with some probability that +depends on the weighting scheme for the PreferredDuringScheduling case). A very +sophisticated priority and/or quota scheme could mitigate this, or alternatively +we could eliminate the symmetry property of the implementation of +PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling +anti-affinity could affect scheduling quality of another pod, and as we +described in the previous paragraph, such pods could be charged quota for the +full topology domain, thereby reducing the potential for abuse. + +We won't try to address this issue in our initial implementation; we can +consider one of the approaches mentioned above if it turns out to be a problem +in practice. ### Co-existing with daemons -A cluster administrator -may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with -system daemon pods, such as those run by DaemonSet. In principle, we would like the specification -for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more -other pods (see #18263 for a more detailed explanation of the toleration concept). There are -at least two ways to accomplish this: +A cluster administrator may wish to allow pods that express anti-affinity +against all pods, to nonetheless co-exist with system daemon pods, such as those +run by DaemonSet. In principle, we would like the specification for +RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or +more other pods (see #18263 for a more detailed explanation of the toleration +concept). There are at least two ways to accomplish this: * Scheduler special-cases the namespace(s) where daemons live, in the sense that it ignores pods in those namespaces when it is @@ -478,147 +524,168 @@ Our initial implementation will use the first approach. ### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution) -Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution -anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in -such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution -anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod -with the anti-affinity rule that becomes violated should be the one killed. -A pod should only specify constraints that apply to -namespaces it trusts to not do malicious things. Once we have priority/preemption, we can -change the rule to say that the lowest-priority pod(s) are killed until all +Because anti-affinity is symmetric, in the case of +RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must +determine which pod(s) to kill when a pod's labels are updated in such as way as +to cause them to conflict with one or more other pods' +RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the +absence of a priority/preemption scheme, our rule will be that the pod with the +anti-affinity rule that becomes violated should be the one killed. A pod should +only specify constraints that apply to namespaces it trusts to not do malicious +things. Once we have priority/preemption, we can change the rule to say that the +lowest-priority pod(s) are killed until all RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied. ## Special considerations for RequiredDuringScheduling affinity -The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry: -if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods, -and pods that conflict with P cannot schedule onto the node one P has been scheduled there. -The design we have described says that the symmetry property for RequiredDuringScheduling *affinity* -is weaker: if a pod P says it can only schedule onto nodes running pod Q, this -does not mean Q can only run on a node that is running P, but the scheduler will try -to schedule Q onto a node that is running P (i.e. treats the reverse direction as -preferred). This raises the same scheduling quality concern as we mentioned at the -end of the Denial of Service section above, and can be addressed in similar ways. - -The nature of affinity (as opposed to anti-affinity) means that there is no issue of -determining which pod(s) to kill -when a pod's labels change: it is obviously the pod with the affinity rule that becomes -violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule; -it can only "fix" violation an anti-affinity rule.) However, affinity does have a -different question related to killing: how long should the system wait before declaring -that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime? -For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed -so that it can be updated to a new binary version, should that trigger killing of P? More -generally, how long should the system wait before declaring that P's affinity is -violated? (Of course affinity is expressed in terms of label selectors, not for a specific -pod, but the scenario is easier to describe using a concrete pod.) This is closely related to -the concept of forgiveness (see issue #1574). In theory we could make this time duration be -configurable by the user on a per-pod basis, but for the first version of this feature we will -make it a configurable property of whichever component does the killing and that applies across -all pods using the feature. Making it configurable by the user would require a nontrivial change -to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution -affinity). +The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its +symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with +conflicting pods, and pods that conflict with P cannot schedule onto the node +one P has been scheduled there. The design we have described says that the +symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P +says it can only schedule onto nodes running pod Q, this does not mean Q can +only run on a node that is running P, but the scheduler will try to schedule Q +onto a node that is running P (i.e. treats the reverse direction as preferred). +This raises the same scheduling quality concern as we mentioned at the end of +the Denial of Service section above, and can be addressed in similar ways. + +The nature of affinity (as opposed to anti-affinity) means that there is no +issue of determining which pod(s) to kill when a pod's labels change: it is +obviously the pod with the affinity rule that becomes violated that must be +killed. (Killing a pod never "fixes" violation of an affinity rule; it can only +"fix" violation an anti-affinity rule.) However, affinity does have a different +question related to killing: how long should the system wait before declaring +that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met +at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q +is temporarily killed so that it can be updated to a new binary version, should +that trigger killing of P? More generally, how long should the system wait +before declaring that P's affinity is violated? (Of course affinity is expressed +in terms of label selectors, not for a specific pod, but the scenario is easier +to describe using a concrete pod.) This is closely related to the concept of +forgiveness (see issue #1574). In theory we could make this time duration be +configurable by the user on a per-pod basis, but for the first version of this +feature we will make it a configurable property of whichever component does the +killing and that applies across all pods using the feature. Making it +configurable by the user would require a nontrivial change to the API syntax +(since the field would only apply to +RequiredDuringSchedulingRequiredDuringExecution affinity). ## Implementation plan -1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types. -2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` -affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod). -3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account -4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity. -This admission controller should be enabled by default. +1. Add the `Affinity` field to PodSpec and the `PodAffinity` and +`PodAntiAffinity` types to the API along with all of their descendant types. +2. Implement a scheduler predicate that takes +`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into +account. Include a workaround for the issue described at the end of the Affinity +section of the Examples section (can't schedule first pod). +3. Implement a scheduler priority function that takes +`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity +into account. +4. Implement admission controller that rejects requests that specify "all +namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` +anti-affinity. This admission controller should be enabled by default. 5. Implement the recommended solution to the "co-existing with daemons" issue 6. At this point, the feature can be deployed. -7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure -the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take -`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism, -the "co-existing with daemons" solution). -8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision -9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies -`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`; -if controller then potentially for all `TopologyKeys`'s. -(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). +7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity +and anti-affinity, and make sure the pieces of the system already implemented +for `RequiredDuringSchedulingIgnoredDuringExecution` also take +`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the +scheduler predicate, the quota mechanism, the "co-existing with daemons" +solution). +8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" +`TopologyKey` to Kubelet's admission decision. +9. Implement code in Kubelet *or* the controllers that evicts a pod that no +longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet +then only for "node" `TopologyKey`; if controller then potentially for all +`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). Do so in a way that addresses the "determining which pod(s) to kill" issue. -We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling -domains (e.g. node name, rack name, availability zone name, etc.). See #9044. +We assume Kubelet publishes labels describing the node's membership in all of +the relevant scheduling domains (e.g. node name, rack name, availability zone +name, etc.). See #9044. ## Backward compatibility Old versions of the scheduler will ignore `Affinity`. -Users should not start using `Affinity` until the full implementation has -been in Kubelet and the master for enough binary versions that we feel -comfortable that we will not need to roll back either Kubelet or -master to a version that does not support them. Longer-term we will -use a programmatic approach to enforcing this (#4855). +Users should not start using `Affinity` until the full implementation has been +in Kubelet and the master for enough binary versions that we feel comfortable +that we will not need to roll back either Kubelet or master to a version that +does not support them. Longer-term we will use a programmatic approach to +enforcing this (#4855). ## Extensibility -The design described here is the result of careful analysis of use cases, a decade of experience -with Borg at Google, and a review of similar features in other open-source container orchestration -systems. We believe that it properly balances the goal of expressiveness against the goals of -simplicity and efficiency of implementation. However, we recognize that -use cases may arise in the future that cannot be expressed using the syntax described here. -Although we are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes -users to get a consistent experience, etc.), the regular Kubernetes -annotation mechanism can be used to add or replace affinity rules. The way this work would is +The design described here is the result of careful analysis of use cases, a +decade of experience with Borg at Google, and a review of similar features in +other open-source container orchestration systems. We believe that it properly +balances the goal of expressiveness against the goals of simplicity and +efficiency of implementation. However, we recognize that use cases may arise in +the future that cannot be expressed using the syntax described here. Although we +are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire +for Kubernetes users to get a consistent experience, etc.), the regular +Kubernetes annotation mechanism can be used to add or replace affinity rules. +The way this work would is: 1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior. -If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields -from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the -annotation(s). +1. User (or an admission controller) attaches the annotation(s) to pods to +request the desired scheduling behavior. If the new rule(s) *replace* one or +more fields of `Affinity` then the user would omit those fields from `Affinity`; +if they are *additional rules*, then the user would fill in `Affinity` as well +as the annotation(s). 1. Scheduler takes the annotation(s) into account when scheduling. -If some particular new syntax becomes popular, we would consider upstreaming it by integrating -it into the standard `Affinity`. +If some particular new syntax becomes popular, we would consider upstreaming it +by integrating it into the standard `Affinity`. ## Future work and non-work -One can imagine that in the anti-affinity RequiredDuringScheduling case -one might want to associate a number with the rule, -for example "do not allow this pod to share a rack with more than three other -pods (in total, or from the same service as the pod)." We could allow this to be -specified by adding an integer `Limit` to `PodAffinityTerm` just for the -`RequiredDuringScheduling` case. However, this flexibility complicates the -system and we do not intend to implement it. +One can imagine that in the anti-affinity RequiredDuringScheduling case one +might want to associate a number with the rule, for example "do not allow this +pod to share a rack with more than three other pods (in total, or from the same +service as the pod)." We could allow this to be specified by adding an integer +`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case. +However, this flexibility complicates the system and we do not intend to +implement it. It is likely that the specification and implementation of pod anti-affinity can be unified with [taints and tolerations](taint-toleration-dedicated.md), and likewise that the specification and implementation of pod affinity -can be unified with [node affinity](nodeaffinity.md). -The basic idea is that pod labels would be "inherited" by the node, and pods -would only be able to specify affinity and anti-affinity for a node's labels. -Our main motivation for not unifying taints and tolerations with -pod anti-affinity is that we foresee taints and tolerations as being a concept that -only cluster administrators need to understand (and indeed in some setups taints and -tolerations wouldn't even be directly manipulated by a cluster administrator, -instead they would only be set by an admission controller that is implementing the administrator's -high-level policy about different classes of special machines and the users who belong to the groups -allowed to access them). Moreover, the concept of nodes "inheriting" labels -from pods seems complicated; it seems conceptually simpler to separate rules involving -relatively static properties of nodes from rules involving which other pods are running -on the same node or larger topology domain. - -Data/storage affinity is related to pod affinity, and is likely to draw on some of the -ideas we have used for pod affinity. Today, data/storage affinity is expressed using -node affinity, on the assumption that the pod knows which node(s) store(s) the data -it wants. But a more flexible approach would allow the pod to name the data rather than -the node. +can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod +labels would be "inherited" by the node, and pods would only be able to specify +affinity and anti-affinity for a node's labels. Our main motivation for not +unifying taints and tolerations with pod anti-affinity is that we foresee taints +and tolerations as being a concept that only cluster administrators need to +understand (and indeed in some setups taints and tolerations wouldn't even be +directly manipulated by a cluster administrator, instead they would only be set +by an admission controller that is implementing the administrator's high-level +policy about different classes of special machines and the users who belong to +the groups allowed to access them). Moreover, the concept of nodes "inheriting" +labels from pods seems complicated; it seems conceptually simpler to separate +rules involving relatively static properties of nodes from rules involving which +other pods are running on the same node or larger topology domain. + +Data/storage affinity is related to pod affinity, and is likely to draw on some +of the ideas we have used for pod affinity. Today, data/storage affinity is +expressed using node affinity, on the assumption that the pod knows which +node(s) store(s) the data it wants. But a more flexible approach would allow the +pod to name the data rather than the node. ## Related issues The review for this proposal is in #18265. -The topic of affinity/anti-affinity has generated a lot of discussion. The main issue -is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906 -all have additional discussion and use cases. +The topic of affinity/anti-affinity has generated a lot of discussion. The main +issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, + +# 1965, and #2906 all have additional discussion and use cases. + +As the examples in this document have demonstrated, topological affinity is very +useful in clusters that are spread across availability zones, e.g. to co-locate +pods of a service in the same zone to avoid a wide-area network hop, or to +spread pods across zones for failure tolerance. #17059, #13056, #13063, and -As the examples in this document have demonstrated, topological affinity is very useful -in clusters that are spread across availability zones, e.g. to co-locate pods of a service -in the same zone to avoid a wide-area network hop, or to spread pods across zones for -failure tolerance. #17059, #13056, #13063, and #4235 are relevant. +# 4235 are relevant. Issue #15675 describes connection affinity, which is vaguely related. diff --git a/principles.md b/principles.md index 5e0e8252..297ae923 100644 --- a/principles.md +++ b/principles.md @@ -43,26 +43,57 @@ See also the [API conventions](../devel/api-conventions.md). * All APIs should be declarative. * API objects should be complementary and composable, not opaque wrappers. * The control plane should be transparent -- there are no hidden internal APIs. -* The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior. -* Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation. -* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation. -* Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs). +* The cost of API operations should be proportional to the number of objects +intentionally operated upon. Therefore, common filtered lookups must be indexed. +Beware of patterns of multiple API calls that would incur quadratic behavior. +* Object status must be 100% reconstructable by observation. Any history kept +must be just an optimization and not required for correct operation. +* Cluster-wide invariants are difficult to enforce correctly. Try not to add +them. If you must have them, don't enforce them atomically in master components, +that is contention-prone and doesn't provide a recovery path in the case of a +bug allowing the invariant to be violated. Instead, provide a series of checks +to reduce the probability of a violation, and make every component involved able +to recover from an invariant violation. +* Low-level APIs should be designed for control by higher-level systems. +Higher-level APIs should be intent-oriented (think SLOs) rather than +implementation-oriented (think control knobs). ## Control logic -* Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization. -* Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them. -* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation. -* Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions. -* Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans. -* Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure. +* Functionality must be *level-based*, meaning the system must operate correctly +given the desired state and the current/observed state, regardless of how many +intermediate state updates may have been missed. Edge-triggered behavior must be +just an optimization. +* Assume an open world: continually verify assumptions and gracefully adapt to +external events and/or actors. Example: we allow users to kill pods under +control of a replication controller; it just replaces them. +* Do not define comprehensive state machines for objects with behaviors +associated with state transitions and/or "assumed" states that cannot be +ascertained by observation. +* Don't assume a component's decisions will not be overridden or rejected, nor +for the component to always understand why. For example, etcd may reject writes. +Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, +but back off and/or make alternative decisions. +* Components should be self-healing. For example, if you must keep some state +(e.g., cache) the content needs to be periodically refreshed, so that if an item +does get erroneously stored or a deletion event is missed etc, it will be soon +fixed, ideally on timescales that are shorter than what will attract attention +from humans. +* Component behavior should degrade gracefully. Prioritize actions so that the +most important activities can continue to function even when overloaded and/or +in states of partial failure. ## Architecture -* Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.). +* Only the apiserver should communicate with etcd/store, and not other +components (scheduler, kubelet, etc.). * Compromising a single node shouldn't compromise the cluster. -* Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage). -* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients. +* Components should continue to do what they were last told in the absence of +new instructions (e.g., due to network partition or component outage). +* All components should keep all relevant state in memory all the time. The +apiserver should write through to etcd/store, other components should write +through to the apiserver, and they should watch for updates made by other +clients. * Watch is preferred over polling. ## Extensibility @@ -72,13 +103,23 @@ TODO: pluggability ## Bootstrapping * [Self-hosting](http://issue.k8s.io/246) of all components is a goal. -* Minimize the number of dependencies, particularly those required for steady-state operation. +* Minimize the number of dependencies, particularly those required for +steady-state operation. * Stratify the dependencies that remain via principled layering. -* Break any circular dependencies by converting hard dependencies to soft dependencies. - * Also accept that data from other components from another source, such as local files, which can then be manually populated at bootstrap time and then continuously updated once those other components are available. +* Break any circular dependencies by converting hard dependencies to soft +dependencies. + * Also accept that data from other components from another source, such as +local files, which can then be manually populated at bootstrap time and then +continuously updated once those other components are available. * State should be rediscoverable and/or reconstructable. - * Make it easy to run temporary, bootstrap instances of all components in order to create the runtime state needed to run the components in the steady state; use a lock (master election for distributed components, file lock for local components like Kubelet) to coordinate handoff. We call this technique "pivoting". - * Have a solution to restart dead components. For distributed components, replication works well. For local components such as Kubelet, a process manager or even a simple shell loop works. + * Make it easy to run temporary, bootstrap instances of all components in +order to create the runtime state needed to run the components in the steady +state; use a lock (master election for distributed components, file lock for +local components like Kubelet) to coordinate handoff. We call this technique +"pivoting". + * Have a solution to restart dead components. For distributed components, +replication works well. For local components such as Kubelet, a process manager +or even a simple shell loop works. ## Availability diff --git a/resources.md b/resources.md index 6a7ee449..2a75c987 100644 --- a/resources.md +++ b/resources.md @@ -31,16 +31,19 @@ Documentation for other releases can be found at -**Note: this is a design doc, which describes features that have not been completely implemented. -User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for -implementation of this model is -[#168](http://issue.k8s.io/168). Currently, both limits and requests of memory and -cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in -milli-cores.** +**Note: this is a design doc, which describes features that have not been +completely implemented. User documentation of the current state is +[here](../user-guide/compute-resources.md). The tracking issue for +implementation of this model is [#168](http://issue.k8s.io/168). Currently, both +limits and requests of memory and cpu on containers (not pods) are supported. +"memory" is in bytes and "cpu" is in milli-cores.** # The Kubernetes resource model -To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document. +To do good pod placement, Kubernetes needs to know how big pods are, as well as +the sizes of the nodes onto which they are being placed. The definition of "how +big" is given by the Kubernetes resource model — the subject of this +document. The resource model aims to be: * simple, for common cases; @@ -50,43 +53,107 @@ The resource model aims to be: ## The resource model -A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth. +A Kubernetes _resource_ is something that can be requested by, allocated to, or +consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, +and network bandwidth. -Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_. +Once resources on a node have been allocated to one pod, they should not be +allocated to another until that pod is removed or exits. This means that +Kubernetes schedulers should ensure that the sum of the resources allocated +(requested and granted) to its pods never exceeds the usable capacity of the +node. Testing whether a pod will fit on a node is called _feasibility checking_. -Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later. +Note that the resource model currently prohibits over-committing resources; we +will want to relax that restriction later. ### Resource types -All resources have a _type_ that is identified by their _typename_ (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots). - -A fully-qualified resource typename is constructed from a DNS-style _subdomain_, followed by a slash `/`, followed by a name. -* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) (e.g., `kubernetes.io`, `example.com`). -* The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the `-`, `_`, and `.` characters allowed anywhere except the first or last character. -* As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes _namespace_, `kubernetes.io/` in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace. - -Some example typenames include `memory` (which will be fully-qualified as `kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. - -For future reference, note that some resources, such as CPU and network bandwidth, are _compressible_, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are _incompressible_, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources. +All resources have a _type_ that is identified by their _typename_ (a string, +e.g., "memory"). Several resource types are predefined by Kubernetes (a full +list is below), although only two will be supported at first: CPU and memory. +Users and system administrators can define their own resource types if they wish +(e.g., Hadoop slots). + +A fully-qualified resource typename is constructed from a DNS-style _subdomain_, +followed by a slash `/`, followed by a name. +* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) +(e.g., `kubernetes.io`, `example.com`). +* The name must be not more than 63 characters, consisting of upper- or +lower-case alphanumeric characters, with the `-`, `_`, and `.` characters +allowed anywhere except the first or last character. +* As a shorthand, any resource typename that does not start with a subdomain and +a slash will automatically be prefixed with the built-in Kubernetes _namespace_, +`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for +code in the open source Kubernetes repository; as a result, all user typenames +MUST be fully qualified, and cannot be created in this namespace. + +Some example typenames include `memory` (which will be fully-qualified as +`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. + +For future reference, note that some resources, such as CPU and network +bandwidth, are _compressible_, which means that their usage can potentially be +throttled in a relatively benign manner. All other resources are +_incompressible_, which means that any attempt to throttle them is likely to +cause grief. This distinction will be important if a Kubernetes implementation +supports over-committing of resources. ### Resource quantities -Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?). - -Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources. - -To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m"). - * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people. - * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1. +Initially, all Kubernetes resource types are _quantitative_, and have an +associated _unit_ for quantities of the associated resource (e.g., bytes for +memory, bytes per seconds for bandwidth, instances for software licences). The +units will always be a resource type's natural base units (e.g., bytes, not MB), +to avoid confusion between binary and decimal multipliers and the underlying +unit multiplier (e.g., is memory measured in MiB, MB, or GB?). + +Resource quantities can be added and subtracted: for example, a node has a fixed +quantity of each resource type that can be allocated to pods/containers; once +such an allocation has been made, the allocated resources cannot be made +available to other pods/containers without over-committing the resources. + +To make life easier for people, quantities can be represented externally as +unadorned integers, or as fixed-point integers with one of these SI suffices +(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, + Ki). For example, the following represent roughly the same value: 128974848, +"129e6", "129M" , "123Mi". Small quantities can be represented directly as +decimals (e.g., 0.3), or using milli-units (e.g., "300m"). + * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML +resource specifications that might be generated or read by people. + * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI +suffix. There are no power-of-two equivalents for SI suffixes that represent +multipliers less than 1. * These conventions only apply to resource quantities, not arbitrary values. -Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not. - * Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations. - * Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs. +Internally (i.e., everywhere else), Kubernetes will represent resource +quantities as integers so it can avoid problems with rounding errors, and will +not use strings to represent numeric values. To achieve this, quantities that +naturally have fractional parts (e.g., CPU seconds/second) will be scaled to +integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. +Internal APIs, data structures, and protobufs will use these scaled integer +units. Raw measurement data such as usage may still need to be tracked and +calculated using floating point values, but internally they should be rescaled +to avoid some values being in milli-units and some not. + * Note that reading in a resource quantity and writing it out again may change +the way its values are represented, and truncate precision (e.g., 1.0001 may +become 1.000), so comparison and difference operations (e.g., by an updater) +must be done on the internal representations. + * Avoiding milli-units in external representations has advantages for people +who will use Kubernetes, but runs the risk of developers forgetting to rescale +or accidentally using floating-point representations. That seems like the right +choice. We will try to reduce the risk by providing libraries that automatically +do the quantization for JSON/YAML inputs. ### Resource specifications -Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now. +Both users and a number of system components, such as schedulers, (horizontal) +auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers +need to reason about resource requirements of workloads, resource capacities of +nodes, and resource usage. Kubernetes divides specifications of *desired state*, +aka the Spec, and representations of *current state*, aka the Status. Resource +requirements and total node capacity fall into the specification category, while +resource usage, characterizations derived from usage (e.g., maximum usage, +histograms), and other resource demand signals (e.g., CPU load) clearly fall +into the status category and are discussed in the Appendix for now. Resource requirements for a container or pod should have the following form: @@ -98,9 +165,24 @@ resourceRequirementSpec: [ ``` Where: -* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses. - -* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory. +* _request_ [optional]: the amount of resources being requested, or that were +requested and have been allocated. Scheduler algorithms will use these +quantities to test feasibility (whether a pod will fit onto a node). +If a container (or pod) tries to use more resources than its _request_, any +associated SLOs are voided — e.g., the program it is running may be +throttled (compressible resource types), or the attempt may be denied. If +_request_ is omitted for a container, it defaults to _limit_ if that is +explicitly specified, otherwise to an implementation-defined value; this will +always be 0 for a user-defined resource type. If _request_ is omitted for a pod, +it defaults to the sum of the (explicit or implicit) _request_ values for the +containers it encloses. + +* _limit_ [optional]: an upper bound or cap on the maximum amount of resources +that will be made available to a container or pod; if a container or pod uses +more resources than its _limit_, it may be terminated. The _limit_ defaults to +"unbounded"; in practice, this probably means the capacity of an enclosing +container, pod, or node, but may result in non-deterministic behavior, +especially for memory. Total capacity for a node should have a similar structure: @@ -111,36 +193,66 @@ resourceCapacitySpec: [ ``` Where: -* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes. +* _total_: the total allocatable resources of a node. Initially, the resources +at a given scope will bound the resources of the sum of inner scopes. #### Notes - * It is an error to specify the same resource type more than once in each list. + * It is an error to specify the same resource type more than once in each +list. - * It is an error for the _request_ or _limit_ values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.) + * It is an error for the _request_ or _limit_ values for a pod to be less than +the sum of the (explicit or defaulted) values for the containers it encloses. +(We may relax this later.) - * If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a _request_, it's probably a good idea. (Note that the _request_ could be filled in by an automated system that is observing actual usage and/or historical data.) + * If multiple pods are running on the same node and attempting to use more +resources than they have requested, the result is implementation-defined. For +example: unallocated or unused resources might be spread equally across +claimants, or the assignment might be weighted by the size of the original +request, or as a function of limits, or priority, or the phase of the moon, +perhaps modulated by the direction of the tide. Thus, although it's not +mandatory to provide a _request_, it's probably a good idea. (Note that the +_request_ could be filled in by an automated system that is observing actual +usage and/or historical data.) - * Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet. + * Internally, the Kubernetes master can decide the defaulting behavior and the +kubelet implementation may expected an absolute specification. For example, if +the master decided that "the default is unbounded" it would pass 2^64 to the +kubelet. ## Kubernetes-defined resource types -The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet. +The following resource types are predefined ("reserved") by Kubernetes in the +`kubernetes.io` namespace, and so cannot be used for user-defined resources. +Note that the syntax of all resource types in the resource spec is deliberately +similar, but some resource types (e.g., CPU) may receive significantly more +support than simply tracking quantities in the schedulers and/or the Kubelet. ### Processor cycles * Name: `cpu` (or `kubernetes.io/cpu`) - * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU") + * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to +a canonical "Kubernetes CPU") * Internal representation: milli-KCUs * Compressible? yes - * Qualities: this is a placeholder for the kind of thing that may be supported in the future — see [#147](http://issue.k8s.io/147) + * Qualities: this is a placeholder for the kind of thing that may be supported +in the future — see [#147](http://issue.k8s.io/147) * [future] `schedulingLatency`: as per lmctfy - * [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0). + * [future] `cpuConversionFactor`: property of a node: the speed of a CPU +core on the node's processor divided by the speed of the canonical Kubernetes +CPU (a floating point value; default = 1.0). -To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code. +To reduce performance portability problems for pods, and to avoid worse-case +provisioning behavior, the units of CPU will be normalized to a canonical +"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be +equivalent to a single CPU hyperthreaded core for some recent x86 processor. The +normalization may be implementation-defined, although some reasonable defaults +will be provided in the open-source Kubernetes code. -Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource _qualities_ (a future feature). +Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will +be allocated — control of aspects like this will be handled by resource +_qualities_ (a future feature). ### Memory @@ -149,15 +261,18 @@ Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will * Units: bytes * Compressible? no (at least initially) -The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying `memcg` mechanisms, support, and definitions. +The precise meaning of what "memory" means is implementation dependent, but the +basic idea is to rely on the underlying `memcg` mechanisms, support, and +definitions. -Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities -rather than decimal ones: "64MiB" rather than "64MB". +Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory +quantities rather than decimal ones: "64MiB" rather than "64MB". ## Resource metadata -A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example: +A resource type may have an associated read-only ResourceType structure, that +contains metadata about the type. For example: ```yaml resourceTypes: [ @@ -172,7 +287,10 @@ resourceTypes: [ ] ``` -Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value. +Kubernetes will provide ResourceType metadata for its predefined types. If no +resource metadata can be found for a resource type, Kubernetes will assume that +it is a quantified, incompressible resource that is not specified in +milli-units, and has no default value. The defined properties are as follows: @@ -188,13 +306,21 @@ The defined properties are as follows: # Appendix: future extensions -The following are planned future extensions to the resource model, included here to encourage comments. +The following are planned future extensions to the resource model, included here +to encourage comments. ## Usage data -Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD. +Because resource usage and related metrics change continuously, need to be +tracked over time (i.e., historically), can be characterized in a variety of +ways, and are fairly voluminous, we will not include usage in core API objects, +such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs +for accessing and managing that data. See the Appendix for possible +representations of usage data, but the representation we'll use is TBD. -Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information: +Singleton values for observed and predicted future usage will rapidly prove +inadequate, so we will support the following structure for extended usage +information: ```yaml resourceStatus: [ @@ -222,8 +348,12 @@ where a `` or `` structure looks like this: } ``` -All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_ -and predicted +All parts of this structure are optional, although we strongly encourage +including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. +_[In practice, it will be important to include additional info such as the +length of the time window over which the averages are calculated, the +confidence level, and information-quality metrics such as the number of dropped +or discarded data points.]_ and predicted ## Future resource types @@ -245,7 +375,10 @@ and predicted * Units: bytes * Compressible? no -The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work. +The amount of secondary storage space available to a container. The main target +is local disk drives and SSDs, although this could also be used to qualify +remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a +disk array, or a file system fronting any of these, is left for future work. ### _[future] Storage time_ @@ -254,7 +387,9 @@ The amount of secondary storage space available to a container. The main target * Internal representation: milli-units * Compressible? yes -This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second. +This is the amount of time a container spends accessing disk, including actuator +and transfer time. A standard disk drive provides 1.0 diskTime seconds per +second. ### _[future] Storage operations_ diff --git a/scheduler_extender.md b/scheduler_extender.md index 8612c39c..e8ad718f 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -34,11 +34,26 @@ Documentation for other releases can be found at # Scheduler extender -There are three ways to add new scheduling rules (predicates and priority functions) to Kubernetes: (1) by adding these rules to the scheduler and recompiling (described here: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), (2) implementing your own scheduler process that runs instead of, or alongside of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" process that the standard Kubernetes scheduler calls out to as a final pass when making scheduling decisions. - -This document describes the third approach. This approach is needed for use cases where scheduling decisions need to be made on resources not directly managed by the standard Kubernetes scheduler. The extender helps make scheduling decisions based on such resources. (Note that the three approaches are not mutually exclusive.) - -When scheduling a pod, the extender allows an external process to filter and prioritize nodes. Two separate http/https calls are issued to the extender, one for "filter" and one for "prioritize" actions. To use the extender, you must create a scheduler policy configuration file. The configuration specifies how to reach the extender, whether to use http or https and the timeout. +There are three ways to add new scheduling rules (predicates and priority +functions) to Kubernetes: (1) by adding these rules to the scheduler and +recompiling (described here: +https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), +(2) implementing your own scheduler process that runs instead of, or alongside +of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" +process that the standard Kubernetes scheduler calls out to as a final pass when +making scheduling decisions. + +This document describes the third approach. This approach is needed for use +cases where scheduling decisions need to be made on resources not directly +managed by the standard Kubernetes scheduler. The extender helps make scheduling +decisions based on such resources. (Note that the three approaches are not +mutually exclusive.) + +When scheduling a pod, the extender allows an external process to filter and +prioritize nodes. Two separate http/https calls are issued to the extender, one +for "filter" and one for "prioritize" actions. To use the extender, you must +create a scheduler policy configuration file. The configuration specifies how to +reach the extender, whether to use http or https and the timeout. ```go // Holds the parameters used to communicate with the extender. If a verb is unspecified/empty, @@ -94,7 +109,10 @@ A sample scheduler policy file with extender configuration: } ``` -Arguments passed to the FilterVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and the pod. Arguments passed to the PrioritizeVerb endpoint on the extender are the set of nodes filtered through the k8s predicates and extender predicates and the pod. +Arguments passed to the FilterVerb endpoint on the extender are the set of nodes +filtered through the k8s predicates and the pod. Arguments passed to the +PrioritizeVerb endpoint on the extender are the set of nodes filtered through +the k8s predicates and extender predicates and the pod. ```go // ExtenderArgs represents the arguments needed by the extender to filter/prioritize @@ -107,9 +125,12 @@ type ExtenderArgs struct { } ``` -The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList). +The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call +returns priorities for each node (schedulerapi.HostPriorityList). -The "filter" call may prune the set of nodes based on its predicates. Scores returned by the "prioritize" call are added to the k8s scores (computed through its priority functions) and used for final host selection. +The "filter" call may prune the set of nodes based on its predicates. Scores +returned by the "prioritize" call are added to the k8s scores (computed through +its priority functions) and used for final host selection. Multiple extenders can be configured in the scheduler policy. diff --git a/secrets.md b/secrets.md index a403ce4f..b1b83106 100644 --- a/secrets.md +++ b/secrets.md @@ -34,15 +34,17 @@ Documentation for other releases can be found at ## Abstract -A proposal for the distribution of [secrets](../user-guide/secrets.md) (passwords, keys, etc) to the Kubelet and to -containers inside Kubernetes using a custom [volume](../user-guide/volumes.md#secrets) type. See the [secrets example](../user-guide/secrets/) for more information. +A proposal for the distribution of [secrets](../user-guide/secrets.md) +(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using +a custom [volume](../user-guide/volumes.md#secrets) type. See the +[secrets example](../user-guide/secrets/) for more information. ## Motivation -Secrets are needed in containers to access internal resources like the Kubernetes master or -external resources such as git repositories, databases, etc. Users may also want behaviors in the -kubelet that depend on secret data (credentials for image pull from a docker registry) associated -with pods. +Secrets are needed in containers to access internal resources like the +Kubernetes master or external resources such as git repositories, databases, +etc. Users may also want behaviors in the kubelet that depend on secret data +(credentials for image pull from a docker registry) associated with pods. Goals of this design: @@ -52,114 +54,127 @@ Goals of this design: ## Constraints and Assumptions -* This design does not prescribe a method for storing secrets; storage of secrets should be - pluggable to accommodate different use-cases +* This design does not prescribe a method for storing secrets; storage of +secrets should be pluggable to accommodate different use-cases * Encryption of secret data and node security are orthogonal concerns -* It is assumed that node and master are secure and that compromising their security could also - compromise secrets: - * If a node is compromised, the only secrets that could potentially be exposed should be the - secrets belonging to containers scheduled onto it +* It is assumed that node and master are secure and that compromising their +security could also compromise secrets: + * If a node is compromised, the only secrets that could potentially be +exposed should be the secrets belonging to containers scheduled onto it * If the master is compromised, all secrets in the cluster may be exposed -* Secret rotation is an orthogonal concern, but it should be facilitated by this proposal -* A user who can consume a secret in a container can know the value of the secret; secrets must - be provisioned judiciously +* Secret rotation is an orthogonal concern, but it should be facilitated by +this proposal +* A user who can consume a secret in a container can know the value of the +secret; secrets must be provisioned judiciously ## Use Cases -1. As a user, I want to store secret artifacts for my applications and consume them securely in - containers, so that I can keep the configuration for my applications separate from the images - that use them: - 1. As a cluster operator, I want to allow a pod to access the Kubernetes master using a custom - `.kubeconfig` file, so that I can securely reach the master - 2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials - from a `.dockercfg` file, so that containers can push images - 3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys, - so that I can push to and fetch from the repository -2. As a user, I want to allow containers to consume supplemental information about services such - as username and password which should be kept secret, so that I can share secrets about a - service amongst the containers in my application securely -3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a secret and have - the kubelet implement some reserved behaviors based on the types of secrets the service account - consumes: +1. As a user, I want to store secret artifacts for my applications and consume +them securely in containers, so that I can keep the configuration for my +applications separate from the images that use them: + 1. As a cluster operator, I want to allow a pod to access the Kubernetes +master using a custom `.kubeconfig` file, so that I can securely reach the +master + 2. As a cluster operator, I want to allow a pod to access a Docker registry +using credentials from a `.dockercfg` file, so that containers can push images + 3. As a cluster operator, I want to allow a pod to access a git repository +using SSH keys, so that I can push to and fetch from the repository +2. As a user, I want to allow containers to consume supplemental information +about services such as username and password which should be kept secret, so +that I can share secrets about a service amongst the containers in my +application securely +3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a +secret and have the kubelet implement some reserved behaviors based on the types +of secrets the service account consumes: 1. Use credentials for a docker registry to pull the pod's docker image - 2. Present Kubernetes auth token to the pod or transparently decorate traffic between the pod - and master service -4. As a user, I want to be able to indicate that a secret expires and for that secret's value to - be rotated once it expires, so that the system can help me follow good practices + 2. Present Kubernetes auth token to the pod or transparently decorate +traffic between the pod and master service +4. As a user, I want to be able to indicate that a secret expires and for that +secret's value to be rotated once it expires, so that the system can help me +follow good practices ### Use-Case: Configuration artifacts -Many configuration files contain secrets intermixed with other configuration information. For -example, a user's application may contain a properties file than contains database credentials, -SaaS API tokens, etc. Users should be able to consume configuration artifacts in their containers -and be able to control the path on the container's filesystems where the artifact will be -presented. +Many configuration files contain secrets intermixed with other configuration +information. For example, a user's application may contain a properties file +than contains database credentials, SaaS API tokens, etc. Users should be able +to consume configuration artifacts in their containers and be able to control +the path on the container's filesystems where the artifact will be presented. ### Use-Case: Metadata about services -Most pieces of information about how to use a service are secrets. For example, a service that -provides a MySQL database needs to provide the username, password, and database name to consumers -so that they can authenticate and use the correct database. Containers in pods consuming the MySQL -service would also consume the secrets associated with the MySQL service. +Most pieces of information about how to use a service are secrets. For example, +a service that provides a MySQL database needs to provide the username, +password, and database name to consumers so that they can authenticate and use +the correct database. Containers in pods consuming the MySQL service would also +consume the secrets associated with the MySQL service. ### Use-Case: Secrets associated with service accounts -[Service Accounts](service_accounts.md) are proposed as a -mechanism to decouple capabilities and security contexts from individual human users. A -`ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is -associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and -other system components to take action based on the secret's type. +[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple +capabilities and security contexts from individual human users. A +`ServiceAccount` contains references to some number of secrets. A `Pod` can +specify that it is associated with a `ServiceAccount`. Secrets should have a +`Type` field to allow the Kubelet and other system components to take action +based on the secret's type. #### Example: service account consumes auth token secret -As an example, the service account proposal discusses service accounts consuming secrets which -contain Kubernetes auth tokens. When a Kubelet starts a pod associated with a service account -which consumes this type of secret, the Kubelet may take a number of actions: +As an example, the service account proposal discusses service accounts consuming +secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod +associated with a service account which consumes this type of secret, the +Kubelet may take a number of actions: -1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's - file system -2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the - `kubernetes-master` service with the auth token, e. g. by adding a header to the request - (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) +1. Expose the secret in a `.kubernetes_auth` file in a well-known location in +the container's file system +2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod +to the `kubernetes-master` service with the auth token, e. g. by adding a header +to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) #### Example: service account consumes docker registry credentials -Another example use case is where a pod is associated with a secret containing docker registry -credentials. The Kubelet could use these credentials for the docker pull to retrieve the image. +Another example use case is where a pod is associated with a secret containing +docker registry credentials. The Kubelet could use these credentials for the +docker pull to retrieve the image. ### Use-Case: Secret expiry and rotation -Rotation is considered a good practice for many types of secret data. It should be possible to -express that a secret has an expiry date; this would make it possible to implement a system -component that could regenerate expired secrets. As an example, consider a component that rotates -expired secrets. The rotator could periodically regenerate the values for expired secrets of -common types and update their expiry dates. +Rotation is considered a good practice for many types of secret data. It should +be possible to express that a secret has an expiry date; this would make it +possible to implement a system component that could regenerate expired secrets. +As an example, consider a component that rotates expired secrets. The rotator +could periodically regenerate the values for expired secrets of common types and +update their expiry dates. ## Deferral: Consuming secrets as environment variables -Some images will expect to receive configuration items as environment variables instead of files. -We should consider what the best way to allow this is; there are a few different options: +Some images will expect to receive configuration items as environment variables +instead of files. We should consider what the best way to allow this is; there +are a few different options: -1. Force the user to adapt files into environment variables. Users can store secrets that need to - be presented as environment variables in a format that is easy to consume from a shell: +1. Force the user to adapt files into environment variables. Users can store +secrets that need to be presented as environment variables in a format that is +easy to consume from a shell: $ cat /etc/secrets/my-secret.txt export MY_SECRET_ENV=MY_SECRET_VALUE - The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for - the image either inline in the command or in an init script, + The user could `source` the file at `/etc/secrets/my-secret` prior to +executing the command for the image either inline in the command or in an init +script. -2. Give secrets an attribute that allows users to express the intent that the platform should - generate the above syntax in the file used to present a secret. The user could consume these - files in the same manner as the above option. +2. Give secrets an attribute that allows users to express the intent that the +platform should generate the above syntax in the file used to present a secret. +The user could consume these files in the same manner as the above option. -3. Give secrets attributes that allow the user to express that the secret should be presented to - the container as an environment variable. The container's environment would contain the - desired values and the software in the container could use them without accommodation the - command or setup script. +3. Give secrets attributes that allow the user to express that the secret +should be presented to the container as an environment variable. The container's +environment would contain the desired values and the software in the container +could use them without accommodation the command or setup script. -For our initial work, we will treat all secrets as files to narrow the problem space. There will -be a future proposal that handles exposing secrets as environment variables. +For our initial work, we will treat all secrets as files to narrow the problem +space. There will be a future proposal that handles exposing secrets as +environment variables. ## Flow analysis of secret data with respect to the API server @@ -170,17 +185,19 @@ There are two fundamentally different use-cases for access to secrets: ### Use-Case: CRUD operations by owners -In use cases for CRUD operations, the user experience for secrets should be no different than for -other API resources. +In use cases for CRUD operations, the user experience for secrets should be no +different than for other API resources. #### Data store backing the REST API -The data store backing the REST API should be pluggable because different cluster operators will -have different preferences for the central store of secret data. Some possibilities for storage: +The data store backing the REST API should be pluggable because different +cluster operators will have different preferences for the central store of +secret data. Some possibilities for storage: 1. An etcd collection alongside the storage for other API resources 2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) -3. A secrets server like [Vault](https://www.vaultproject.io/) or [Keywhiz](https://square.github.io/keywhiz/) +3. A secrets server like [Vault](https://www.vaultproject.io/) or +[Keywhiz](https://square.github.io/keywhiz/) 4. An external datastore such as an external etcd, RDBMS, etc. #### Size limit for secrets @@ -188,101 +205,116 @@ have different preferences for the central store of secret data. Some possibili There should be a size limit for secrets in order to: 1. Prevent DOS attacks against the API server -2. Allow kubelet implementations that prevent secret data from touching the node's filesystem +2. Allow kubelet implementations that prevent secret data from touching the +node's filesystem The size limit should satisfy the following conditions: -1. Large enough to store common artifact types (encryption keypairs, certificates, small - configuration files) -2. Small enough to avoid large impact on node resource consumption (storage, RAM for tmpfs, etc) +1. Large enough to store common artifact types (encryption keypairs, +certificates, small configuration files) +2. Small enough to avoid large impact on node resource consumption (storage, +RAM for tmpfs, etc) To begin discussion, we propose an initial value for this size limit of **1MB**. #### Other limitations on secrets -Defining a policy for limitations on how a secret may be referenced by another API resource and how -constraints should be applied throughout the cluster is tricky due to the number of variables -involved: +Defining a policy for limitations on how a secret may be referenced by another +API resource and how constraints should be applied throughout the cluster is +tricky due to the number of variables involved: -1. Should there be a maximum number of secrets a pod can reference via a volume? +1. Should there be a maximum number of secrets a pod can reference via a +volume? 2. Should there be a maximum number of secrets a service account can reference? -3. Should there be a total maximum number of secrets a pod can reference via its own spec and its - associated service account? -4. Should there be a total size limit on the amount of secret data consumed by a pod? +3. Should there be a total maximum number of secrets a pod can reference via +its own spec and its associated service account? +4. Should there be a total size limit on the amount of secret data consumed by +a pod? 5. How will cluster operators want to be able to configure these limits? 6. How will these limits impact API server validations? 7. How will these limits affect scheduling? -For now, we will not implement validations around these limits. Cluster operators will decide how -much node storage is allocated to secrets. It will be the operator's responsibility to ensure that -the allocated storage is sufficient for the workload scheduled onto a node. +For now, we will not implement validations around these limits. Cluster +operators will decide how much node storage is allocated to secrets. It will be +the operator's responsibility to ensure that the allocated storage is sufficient +for the workload scheduled onto a node. -For now, kubelets will only attach secrets to api-sourced pods, and not file- or http-sourced -ones. Doing so would: +For now, kubelets will only attach secrets to api-sourced pods, and not file- +or http-sourced ones. Doing so would: - confuse the secrets admission controller in the case of mirror pods. - - create an apiserver-liveness dependency -- avoiding this dependency is a main reason to use non-api-source pods. + - create an apiserver-liveness dependency -- avoiding this dependency is a +main reason to use non-api-source pods. ### Use-Case: Kubelet read of secrets for node The use-case where the kubelet reads secrets has several additional requirements: -1. Kubelets should only be able to receive secret data which is required by pods scheduled onto - the kubelet's node +1. Kubelets should only be able to receive secret data which is required by +pods scheduled onto the kubelet's node 2. Kubelets should have read-only access to secret data 3. Secret data should not be transmitted over the wire insecurely 4. Kubelets must ensure pods do not have access to each other's secrets #### Read of secret data by the Kubelet -The Kubelet should only be allowed to read secrets which are consumed by pods scheduled onto that -Kubelet's node and their associated service accounts. Authorization of the Kubelet to read this -data would be delegated to an authorization plugin and associated policy rule. +The Kubelet should only be allowed to read secrets which are consumed by pods +scheduled onto that Kubelet's node and their associated service accounts. +Authorization of the Kubelet to read this data would be delegated to an +authorization plugin and associated policy rule. #### Secret data on the node: data at rest -Consideration must be given to whether secret data should be allowed to be at rest on the node: +Consideration must be given to whether secret data should be allowed to be at +rest on the node: -1. If secret data is not allowed to be at rest, the size of secret data becomes another draw on - the node's RAM - should it affect scheduling? +1. If secret data is not allowed to be at rest, the size of secret data becomes +another draw on the node's RAM - should it affect scheduling? 2. If secret data is allowed to be at rest, should it be encrypted? 1. If so, how should be this be done? - 2. If not, what threats exist? What types of secret are appropriate to store this way? + 2. If not, what threats exist? What types of secret are appropriate to +store this way? -For the sake of limiting complexity, we propose that initially secret data should not be allowed -to be at rest on a node; secret data should be stored on a node-level tmpfs filesystem. This -filesystem can be subdivided into directories for use by the kubelet and by the volume plugin. +For the sake of limiting complexity, we propose that initially secret data +should not be allowed to be at rest on a node; secret data should be stored on a +node-level tmpfs filesystem. This filesystem can be subdivided into directories +for use by the kubelet and by the volume plugin. #### Secret data on the node: resource consumption -The Kubelet will be responsible for creating the per-node tmpfs file system for secret storage. -It is hard to make a prescriptive declaration about how much storage is appropriate to reserve for -secrets because different installations will vary widely in available resources, desired pod to -node density, overcommit policy, and other operation dimensions. That being the case, we propose -for simplicity that the amount of secret storage be controlled by a new parameter to the kubelet -with a default value of **64MB**. It is the cluster operator's responsibility to handle choosing -the right storage size for their installation and configuring their Kubelets correctly. - -Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that -the cluster-wide storage size be readable from a central configuration store like the one proposed -in [#1553](http://issue.k8s.io/1553). When such a store -exists, the Kubelet could be modified to read this configuration item from the store. +The Kubelet will be responsible for creating the per-node tmpfs file system for +secret storage. It is hard to make a prescriptive declaration about how much +storage is appropriate to reserve for secrets because different installations +will vary widely in available resources, desired pod to node density, overcommit +policy, and other operation dimensions. That being the case, we propose for +simplicity that the amount of secret storage be controlled by a new parameter to +the kubelet with a default value of **64MB**. It is the cluster operator's +responsibility to handle choosing the right storage size for their installation +and configuring their Kubelets correctly. + +Configuring each Kubelet is not the ideal story for operator experience; it is +more intuitive that the cluster-wide storage size be readable from a central +configuration store like the one proposed in [#1553](http://issue.k8s.io/1553). +When such a store exists, the Kubelet could be modified to read this +configuration item from the store. When the Kubelet is modified to advertise node resources (as proposed in [#4441](http://issue.k8s.io/4441)), the capacity calculation -for available memory should factor in the potential size of the node-level tmpfs in order to avoid -memory overcommit on the node. +for available memory should factor in the potential size of the node-level tmpfs +in order to avoid memory overcommit on the node. #### Secret data on the node: isolation Every pod will have a [security context](security_context.md). -Secret data on the node should be isolated according to the security context of the container. The -Kubelet volume plugin API will be changed so that a volume plugin receives the security context of -a volume along with the volume spec. This will allow volume plugins to implement setting the -security context of volumes they manage. +Secret data on the node should be isolated according to the security context of +the container. The Kubelet volume plugin API will be changed so that a volume +plugin receives the security context of a volume along with the volume spec. +This will allow volume plugins to implement setting the security context of +volumes they manage. ## Community work -Several proposals / upstream patches are notable as background for this proposal: +Several proposals / upstream patches are notable as background for this +proposal: 1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) 2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) @@ -292,14 +324,15 @@ Several proposals / upstream patches are notable as background for this proposal ## Proposed Design -We propose a new `Secret` resource which is mounted into containers with a new volume type. Secret -volumes will be handled by a volume plugin that does the actual work of fetching the secret and -storing it. Secrets contain multiple pieces of data that are presented as different files within -the secret volume (example: SSH key pair). +We propose a new `Secret` resource which is mounted into containers with a new +volume type. Secret volumes will be handled by a volume plugin that does the +actual work of fetching the secret and storing it. Secrets contain multiple +pieces of data that are presented as different files within the secret volume +(example: SSH key pair). -In order to remove the burden from the end user in specifying every file that a secret consists of, -it should be possible to mount all files provided by a secret with a single `VolumeMount` entry -in the container specification. +In order to remove the burden from the end user in specifying every file that a +secret consists of, it should be possible to mount all files provided by a +secret with a single `VolumeMount` entry in the container specification. ### Secret API Resource @@ -331,27 +364,30 @@ const ( const MaxSecretSize = 1 * 1024 * 1024 ``` -A Secret can declare a type in order to provide type information to system components that work -with secrets. The default type is `opaque`, which represents arbitrary user-owned data. +A Secret can declare a type in order to provide type information to system +components that work with secrets. The default type is `opaque`, which +represents arbitrary user-owned data. -Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must be valid DNS -subdomains. +Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must +be valid DNS subdomains. -A new REST API and registry interface will be added to accompany the `Secret` resource. The -default implementation of the registry will store `Secret` information in etcd. Future registry -implementations could store the `TypeMeta` and `ObjectMeta` fields in etcd and store the secret -data in another data store entirely, or store the whole object in another data store. +A new REST API and registry interface will be added to accompany the `Secret` +resource. The default implementation of the registry will store `Secret` +information in etcd. Future registry implementations could store the `TypeMeta` +and `ObjectMeta` fields in etcd and store the secret data in another data store +entirely, or store the whole object in another data store. #### Other validations related to secrets -Initially there will be no validations for the number of secrets a pod references, or the number of -secrets that can be associated with a service account. These may be added in the future as the -finer points of secrets and resource allocation are fleshed out. +Initially there will be no validations for the number of secrets a pod +references, or the number of secrets that can be associated with a service +account. These may be added in the future as the finer points of secrets and +resource allocation are fleshed out. ### Secret Volume Source -A new `SecretSource` type of volume source will be added to the `VolumeSource` struct in the -API: +A new `SecretSource` type of volume source will be added to the `VolumeSource` +struct in the API: ```go type VolumeSource struct { @@ -366,19 +402,21 @@ type SecretSource struct { } ``` -Secret volume sources are validated to ensure that the specified object reference actually points -to an object of type `Secret`. +Secret volume sources are validated to ensure that the specified object +reference actually points to an object of type `Secret`. In the future, the `SecretSource` will be extended to allow: -1. Fine-grained control over which pieces of secret data are exposed in the volume +1. Fine-grained control over which pieces of secret data are exposed in the +volume 2. The paths and filenames for how secret data are exposed ### Secret Volume Plugin -A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will -require access to the API server to retrieve secret data and therefore the volume `Host` interface -will have to change to expose a client interface: +A new Kubelet volume plugin will be added to handle volumes with a secret +source. This plugin will require access to the API server to retrieve secret +data and therefore the volume `Host` interface will have to change to expose a +client interface: ```go type Host interface { @@ -394,36 +432,42 @@ The secret volume plugin will be responsible for: 1. Returning a `volume.Mounter` implementation from `NewMounter` that: 1. Retrieves the secret data for the volume from the API server 2. Places the secret data onto the container's filesystem - 3. Sets the correct security attributes for the volume based on the pod's `SecurityContext` -2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that cleans the volume from the - container's filesystem + 3. Sets the correct security attributes for the volume based on the pod's +`SecurityContext` +2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that +cleans the volume from the container's filesystem ### Kubelet: Node-level secret storage -The Kubelet must be modified to accept a new parameter for the secret storage size and to create -a tmpfs file system of that size to store secret data. Rough accounting of specific changes: +The Kubelet must be modified to accept a new parameter for the secret storage +size and to create a tmpfs file system of that size to store secret data. Rough +accounting of specific changes: -1. The Kubelet should have a new field added called `secretStorageSize`; units are megabytes +1. The Kubelet should have a new field added called `secretStorageSize`; units +are megabytes 2. `NewMainKubelet` should accept a value for secret storage size 3. The Kubelet server should have a new flag added for secret storage size -4. The Kubelet's `setupDataDirs` method should be changed to create the secret storage +4. The Kubelet's `setupDataDirs` method should be changed to create the secret +storage ### Kubelet: New behaviors for secrets associated with service accounts -For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's -`ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type -`docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed -to accommodate this. Subsequent proposals can address this on a type-by-type basis. +For use-cases where the Kubelet's behavior is affected by the secrets associated +with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example, +if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the +Kubelet will need to be changed to accommodate this. Subsequent proposals can +address this on a type-by-type basis. ## Examples -For clarity, let's examine some detailed examples of some common use-cases in terms of the -suggested changes. All of these examples are assumed to be created in a namespace called -`example`. +For clarity, let's examine some detailed examples of some common use-cases in +terms of the suggested changes. All of these examples are assumed to be created +in a namespace called `example`. ### Use-Case: Pod with ssh keys -To create a pod that uses an ssh key stored as a secret, we first need to create a secret: +To create a pod that uses an ssh key stored as a secret, we first need to create +a secret: ```json { @@ -443,7 +487,8 @@ To create a pod that uses an ssh key stored as a secret, we first need to create base64 strings. Newlines are not valid within these strings and must be omitted. -Now we can create a pod which references the secret with the ssh key and consumes it in a volume: +Now we can create a pod which references the secret with the ssh key and +consumes it in a volume: ```json { @@ -486,7 +531,8 @@ When the container's command runs, the pieces of the key will be available in: /etc/secret-volume/id-rsa.pub /etc/secret-volume/id-rsa -The container is then free to use the secret data to establish an ssh connection. +The container is then free to use the secret data to establish an ssh +connection. ### Use-Case: Pods with pod / test credentials @@ -602,8 +648,9 @@ The pods: } ``` -The specs for the two pods differ only in the value of the object referred to by the secret volume -source. Both containers will have the following files present on their filesystems: +The specs for the two pods differ only in the value of the object referred to by +the secret volume source. Both containers will have the following files present +on their filesystems: /etc/secret-volume/username /etc/secret-volume/password diff --git a/security.md b/security.md index b9c7942a..06bb3979 100644 --- a/security.md +++ b/security.md @@ -34,37 +34,57 @@ Documentation for other releases can be found at # Security in Kubernetes -Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster. +Kubernetes should define a reasonable set of security best practices that allows +processes to be isolated from each other, from the cluster infrastructure, and +which preserves important boundaries between those who manage the cluster, and +those who use the cluster. -While Kubernetes today is not primarily a multi-tenant system, the long term evolution of Kubernetes will increasingly rely on proper boundaries between users and administrators. The code running on the cluster must be appropriately isolated and secured to prevent malicious parties from affecting the entire cluster. +While Kubernetes today is not primarily a multi-tenant system, the long term +evolution of Kubernetes will increasingly rely on proper boundaries between +users and administrators. The code running on the cluster must be appropriately +isolated and secured to prevent malicious parties from affecting the entire +cluster. ## High Level Goals -1. Ensure a clear isolation between the container and the underlying host it runs on -2. Limit the ability of the container to negatively impact the infrastructure or other containers -3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components -4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components +1. Ensure a clear isolation between the container and the underlying host it +runs on +2. Limit the ability of the container to negatively impact the infrastructure +or other containers +3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - +ensure components are only authorized to perform the actions they need, and +limit the scope of a compromise by limiting the capabilities of individual +components +4. Reduce the number of systems that have to be hardened and secured by +defining clear boundaries between components 5. Allow users of the system to be cleanly separated from administrators 6. Allow administrative functions to be delegated to users where necessary -7. Allow applications to be run on the cluster that have "secret" data (keys, certs, passwords) which is properly abstracted from "public" data. - +7. Allow applications to be run on the cluster that have "secret" data (keys, +certs, passwords) which is properly abstracted from "public" data. ## Use cases ### Roles -We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories: +We define "user" as a unique identity accessing the Kubernetes API server, which +may be a human or an automated process. Human users fall into the following +categories: -1. k8s admin - administers a Kubernetes cluster and has access to the underlying components of the system -2. k8s project administrator - administrates the security of a small subset of the cluster -3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster resources +1. k8s admin - administers a Kubernetes cluster and has access to the underlying +components of the system +2. k8s project administrator - administrates the security of a small subset of +the cluster +3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster +resources Automated process users fall into the following categories: -1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project -2. k8s infrastructure user - the user that Kubernetes infrastructure components use to perform cluster functions with clearly defined roles - +1. k8s container user - a user that processes running inside a container (on the +cluster) can use to access other cluster resources independent of the human +users attached to a project +2. k8s infrastructure user - the user that Kubernetes infrastructure components +use to perform cluster functions with clearly defined roles ### Description of roles @@ -73,9 +93,11 @@ Automated process users fall into the following categories: * making some of their own images, and using some "community" docker images * know which pods need to talk to which other pods * decide which pods should share files with other pods, and which should not. - * reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod. + * reason about application level security, such as containing the effects of a +local-file-read exploit in a webserver pod. * do not often reason about operating system or organizational security. - * are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. + * are not necessarily comfortable reasoning about the security properties of a +system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. * Project Admins: * allocate identity and roles within a namespace @@ -85,44 +107,81 @@ Automated process users fall into the following categories: * are less focused about application security * Administrators: - * are less focused on application security. Focused on operating system security. - * protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers. - * comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. - * decides who can use which Linux Capabilities, run privileged containers, use hostPath, etc. - * e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not. + * are less focused on application security. Focused on operating system +security. + * protect the node from bad actors in containers, and properly-configured +innocent containers from bad actors in other containers. + * comfortable reasoning about the security properties of a system at the level +of detail of Linux Capabilities, SELinux, AppArmor, etc. + * decides who can use which Linux Capabilities, run privileged containers, use +hostPath, etc. + * e.g. a team that manages Ceph or a mysql server might be trusted to have +raw access to storage devices in some organizations, but teams that develop the +applications at higher layers would not. ## Proposed Design -A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*. +A pod runs in a *security context* under a *service account* that is defined by +an administrator or project administrator, and the *secrets* a pod has access to +is limited by that *service account*. 1. The API should authenticate and authorize user actions [authn and authz](access.md) -2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API. -3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd) -4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](service_accounts.md) - 1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption - 2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk - 3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action -5. When container processes run on the cluster, they should run in a [security context](security_context.md) that isolates those processes via Linux user security, user namespaces, and permissions. - 1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID - 2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID - 3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions - 4. Project administrators should be able to run pods within a namespace under different security contexts, and developers must be able to specify which of the available security contexts they may use - 5. Developers should be able to run their own images or images from the community and expect those images to run correctly - 6. Developers may need to ensure their images work within higher security requirements specified by administrators - 7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met. - 8. When application developers want to share filesystem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes -6. Developers should be able to define [secrets](secrets.md) that are automatically added to the containers when pods are run - 1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples: +2. All infrastructure components (kubelets, kube-proxies, controllers, +scheduler) should have an infrastructure user that they can authenticate with +and be authorized to perform only the functions they require against the API. +3. Most infrastructure components should use the API as a way of exchanging data +and changing the system, and only the API should have access to the underlying +data store (etcd) +4. When containers run on the cluster and need to talk to other containers or +the API server, they should be identified and authorized clearly as an +autonomous process via a [service account](service_accounts.md) + 1. If the user who started a long-lived process is removed from access to +the cluster, the process should be able to continue without interruption + 2. If the user who started processes are removed from the cluster, +administrators may wish to terminate their processes in bulk + 3. When containers run with a service account, the user that created / +triggered the service account behavior must be associated with the container's +action +5. When container processes run on the cluster, they should run in a +[security context](security_context.md) that isolates those processes via Linux +user security, user namespaces, and permissions. + 1. Administrators should be able to configure the cluster to automatically +confine all container processes as a non-root, randomly assigned UID + 2. Administrators should be able to ensure that container processes within +the same namespace are all assigned the same unix user UID + 3. Administrators should be able to limit which developers and project +administrators have access to higher privilege actions + 4. Project administrators should be able to run pods within a namespace +under different security contexts, and developers must be able to specify which +of the available security contexts they may use + 5. Developers should be able to run their own images or images from the +community and expect those images to run correctly + 6. Developers may need to ensure their images work within higher security +requirements specified by administrators + 7. When available, Linux kernel user namespaces can be used to ensure 5.2 +and 5.4 are met. + 8. When application developers want to share filesystem data via distributed +filesystems, the Unix user ids on those filesystems must be consistent across +different container processes +6. Developers should be able to define [secrets](secrets.md) that are +automatically added to the containers when pods are run + 1. Secrets are files injected into the container whose values should not be +displayed within a pod. Examples: 1. An SSH private key for git cloning remote data 2. A client certificate for accessing a remote system 3. A private key and certificate for a web server - 4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master + 4. A .kubeconfig file with embedded cert / token data for accessing the +Kubernetes master 5. A .dockercfg file for pulling images from a protected registry - 2. Developers should be able to define the pod spec so that a secret lands in a specific location - 3. Project administrators should be able to limit developers within a namespace from viewing or modifying secrets (anyone who can launch an arbitrary pod can view secrets) - 4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied + 2. Developers should be able to define the pod spec so that a secret lands +in a specific location + 3. Project administrators should be able to limit developers within a +namespace from viewing or modifying secrets (anyone who can launch an arbitrary +pod can view secrets) + 4. Secrets are generally not copied from one namespace to another when a +developer's application definitions are copied ### Related design discussion @@ -140,15 +199,52 @@ A pod runs in a *security context* under a *service account* that is defined by ### Isolate the data store from the nodes and supporting infrastructure -Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer. - -As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks. - -Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more. - -The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes. - -The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). +Access to the central data store (etcd) in Kubernetes allows an attacker to run +arbitrary containers on hosts, to gain access to any protected information +stored in either volumes or in pods (such as access tokens or shared secrets +provided as environment variables), to intercept and redirect traffic from +running services by inserting middlemen, or to simply delete the entire history +of the custer. + +As a general principle, access to the central data store should be restricted to +the components that need full control over the system and which can apply +appropriate authorization and authentication of change requests. In the future, +etcd may offer granular access control, but that granularity will require an +administrator to understand the schema of the data to properly apply security. +An administrator must be able to properly secure Kubernetes at a policy level, +rather than at an implementation level, and schema changes over time should not +risk unintended security leaks. + +Both the Kubelet and Kube Proxy need information related to their specific roles - +for the Kubelet, the set of pods it should be running, and for the Proxy, the +set of services and endpoints to load balance. The Kubelet also needs to provide +information about running pods and historical termination data. The access +pattern for both Kubelet and Proxy to load their configuration is an efficient +"wait for changes" request over HTTP. It should be possible to limit the Kubelet +and Proxy to only access the information they need to perform their roles and no +more. + +The controller manager for Replication Controllers and other future controllers +act on behalf of a user via delegation to perform automated maintenance on +Kubernetes resources. Their ability to access or modify resource state should be +strictly limited to their intended duties and they should be prevented from +accessing information not pertinent to their role. For example, a replication +controller needs only to create a copy of a known pod configuration, to +determine the running state of an existing pod, or to delete an existing pod +that it created - it does not need to know the contents or current state of a +pod, nor have access to any data in the pods attached volumes. + +The Kubernetes pod scheduler is responsible for reading data from the pod to fit +it onto a node in the cluster. At a minimum, it needs access to view the ID of a +pod (to craft the binding), its current state, any resource information +necessary to identify placement, and other data relevant to concerns like +anti-affinity, zone or region preference, or custom logic. It does not need the +ability to modify pods or see other resources, only to create bindings. It +should not need the ability to delete bindings unless the scheduler takes +control of relocating components on failed hosts (which could be implemented by +a separate component that can delete bindings but not create them). The +scheduler may need read access to user or project-container information to +determine preferential location (underspecified at this time). diff --git a/security_context.md b/security_context.md index 24a34878..2b7d8b96 100644 --- a/security_context.md +++ b/security_context.md @@ -36,41 +36,59 @@ Documentation for other releases can be found at ## Abstract -A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)): +A security context is a set of constraints that are applied to a container in +order to achieve the following goals (from [security design](security.md)): -1. Ensure a clear isolation between container and the underlying host it runs on -2. Limit the ability of the container to negatively impact the infrastructure or other containers +1. Ensure a clear isolation between container and the underlying host it runs +on +2. Limit the ability of the container to negatively impact the infrastructure +or other containers ## Background -The problem of securing containers in Kubernetes has come up [before](http://issue.k8s.io/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface. +The problem of securing containers in Kubernetes has come up +[before](http://issue.k8s.io/398) and the potential problems with container +security are [well known](http://opensource.com/business/14/7/docker-security-selinux). +Although it is not possible to completely isolate Docker containers from their +hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) +make it possible to greatly reduce the attack surface. ## Motivation ### Container isolation -In order to improve container isolation from host and other containers running on the host, containers should only be -granted the access they need to perform their work. To this end it should be possible to take advantage of Docker -features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) +In order to improve container isolation from host and other containers running +on the host, containers should only be granted the access they need to perform +their work. To this end it should be possible to take advantage of Docker +features such as the ability to +[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) +and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) to the container process. -Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers. +Support for user namespaces has recently been +[merged](https://github.com/docker/libcontainer/pull/304) into Docker's +libcontainer project and should soon surface in Docker itself. It will make it +possible to assign a range of unprivileged uids and gids from the host to each +container, improving the isolation between host and container and between +containers. ### External integration with shared storage -In order to support external integration with shared storage, processes running in a Kubernetes cluster -should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. -Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks. +In order to support external integration with shared storage, processes running +in a Kubernetes cluster should be able to be uniquely identified by their Unix +UID, such that a chain of ownership can be established. Processes in pods will +need to have consistent UID/GID/SELinux category labels in order to access +shared disks. ## Constraints and Assumptions -* It is out of the scope of this document to prescribe a specific set - of constraints to isolate containers from their host. Different use cases need different - settings. -* The concept of a security context should not be tied to a particular security mechanism or platform - (ie. SELinux, AppArmor) -* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for - [service accounts](service_accounts.md). +* It is out of the scope of this document to prescribe a specific set of +constraints to isolate containers from their host. Different use cases need +different settings. +* The concept of a security context should not be tied to a particular security +mechanism or platform (ie. SELinux, AppArmor) +* Applying a different security context to a scope (namespace or pod) requires +a solution such as the one proposed for [service accounts](service_accounts.md). ## Use Cases @@ -78,47 +96,51 @@ In order of increasing complexity, following are example use cases that would be addressed with security contexts: 1. Kubernetes is used to run a single cloud application. In order to protect - nodes from containers: +nodes from containers: * All containers run as a single non-root user * Privileged containers are disabled * All containers run with a particular MCS label * Kernel capabilities like CHOWN and MKNOD are removed from containers 2. Just like case #1, except that I have more than one application running on - the Kubernetes cluster. +the Kubernetes cluster. * Each application is run in its own namespace to avoid name collisions * For each application a different uid and MCS label is used -3. Kubernetes is used as the base for a PAAS with - multiple projects, each project represented by a namespace. +3. Kubernetes is used as the base for a PAAS with multiple projects, each +project represented by a namespace. * Each namespace is associated with a range of uids/gids on the node that - are mapped to uids/gids on containers using linux user namespaces. +are mapped to uids/gids on containers using linux user namespaces. * Certain pods in each namespace have special privileges to perform system - actions such as talking back to the server for deployment, run docker - builds, etc. +actions such as talking back to the server for deployment, run docker builds, +etc. * External NFS storage is assigned to each namespace and permissions set - using the range of uids/gids assigned to that namespace. +using the range of uids/gids assigned to that namespace. ## Proposed Design ### Overview -A *security context* consists of a set of constraints that determine how a container -is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will -be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance -to mutate Docker API calls in order to apply the security context. +A *security context* consists of a set of constraints that determine how a +container is secured before getting created and run. A security context resides +on the container and represents the runtime parameters that will be used to +create and run the container via container APIs. A *security context provider* +is passed to the Kubelet so it can have a chance to mutate Docker API calls in +order to apply the security context. It is recommended that this design be implemented in two phases: 1. Implement the security context provider extension point in the Kubelet - so that a default security context can be applied on container run and creation. +so that a default security context can be applied on container run and creation. 2. Implement a security context structure that is part of a service account. The - default context provider can then be used to apply a security context based - on the service account associated with the pod. +default context provider can then be used to apply a security context based on +the service account associated with the pod. ### Security Context Provider -The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container: +The Kubelet will have an interface that points to a `SecurityContextProvider`. +The `SecurityContextProvider` is invoked before creating and running a given +container: ```go type SecurityContextProvider interface { @@ -138,12 +160,14 @@ type SecurityContextProvider interface { } ``` -If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. +If the value of the SecurityContextProvider field on the Kubelet is nil, the +kubelet will create and run the container as it does today. ### Security Context -A security context resides on the container and represents the runtime parameters that will -be used to create and run the container via container APIs. Following is an example of an initial implementation: +A security context resides on the container and represents the runtime +parameters that will be used to create and run the container via container APIs. +Following is an example of an initial implementation: ```go type Container struct { @@ -189,11 +213,12 @@ type SELinuxOptions struct { ### Admission -It is up to an admission plugin to determine if the security context is acceptable or not. At the -time of writing, the admission control plugin for security contexts will only allow a context that -has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options -will be denied by default. In the future the admission plugin will base this decision upon -configurable policies that reside within the [service account](http://pr.k8s.io/2297). +It is up to an admission plugin to determine if the security context is +acceptable or not. At the time of writing, the admission control plugin for +security contexts will only allow a context that has defined capabilities or +privileged. Contexts that attempt to define a UID or SELinux options will be +denied by default. In the future the admission plugin will base this decision +upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). diff --git a/selector-generation.md b/selector-generation.md index 28db17fc..cd91615b 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -37,40 +37,64 @@ Design # Goals -Make it really hard to accidentally create a job which has an overlapping selector, while still making it possible to chose an arbitrary selector, and without adding complex constraint solving to the APIserver. +Make it really hard to accidentally create a job which has an overlapping +selector, while still making it possible to chose an arbitrary selector, and +without adding complex constraint solving to the APIserver. # Use Cases -1. user can leave all label and selector fields blank and system will fill in reasonable ones: non-overlappingness guaranteed. -2. user can put on the pod template some labels that are useful to the user, without reasoning about non-overlappingness. System adds additional label to assure not overlapping. -3. If user wants to reparent pods to new job (very rare case) and knows what they are doing, they can completely disable this behavior and specify explicit selector. -4. If a controller that makes jobs, like scheduled job, wants to use different labels, such as the time and date of the run, it can do that. -5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and just changes the API group, the user should not automatically be allowed to specify a selector, since this is very rarely what people want to do and is error prone. -6. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, he should not create an overlapping job. -7. If User downloads an existing job definition, e.g. with `kubectl get jobs/old -o yaml` and tries to modify and post it, and he accidentally copies the uniquifying label from the old one, then he should not get an error from a label-key conflict, nor get erratic behavior. -8. If user reads swagger docs and sees the selector field, he should not be able to set it without realizing the risks. -8. (Deferred requirement:) If user wants to specify a preferred name for the non-overlappingness key, they can pick a name. +1. user can leave all label and selector fields blank and system will fill in +reasonable ones: non-overlappingness guaranteed. +2. user can put on the pod template some labels that are useful to the user, +without reasoning about non-overlappingness. System adds additional label to +assure not overlapping. +3. If user wants to reparent pods to new job (very rare case) and knows what +they are doing, they can completely disable this behavior and specify explicit +selector. +4. If a controller that makes jobs, like scheduled job, wants to use different +labels, such as the time and date of the run, it can do that. +5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and +just changes the API group, the user should not automatically be allowed to +specify a selector, since this is very rarely what people want to do and is +error prone. +6. If User downloads an existing job definition, e.g. with +`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not +create an overlapping job. +7. If User downloads an existing job definition, e.g. with +`kubectl get jobs/old -o yaml` and tries to modify and post it, and he +accidentally copies the uniquifying label from the old one, then he should not +get an error from a label-key conflict, nor get erratic behavior. +8. If user reads swagger docs and sees the selector field, he should not be able +to set it without realizing the risks. +8. (Deferred requirement:) If user wants to specify a preferred name for the +non-overlappingness key, they can pick a name. # Proposed changes ## API -`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as follows. +`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as +follows. -Field `job.spec.manualSelector` is added. It controls whether selectors are automatically -generated. In automatic mode, user cannot make the mistake of creating non-unique selectors. -In manual mode, certain rare use cases are supported. +Field `job.spec.manualSelector` is added. It controls whether selectors are +automatically generated. In automatic mode, user cannot make the mistake of +creating non-unique selectors. In manual mode, certain rare use cases are +supported. -Validation is not changed. A selector must be provided, and it must select the pod template. +Validation is not changed. A selector must be provided, and it must select the +pod template. -Defaulting changes. Defaulting happens in one of two modes: +Defaulting changes. Defaulting happens in one of two modes: ### Automatic Mode - User does not specify `job.spec.selector`. -- User is probably unaware of the `job.spec.manualSelector` field and does not think about it. -- User optionally puts labels on pod template (optional). user does not think about uniqueness, just labeling for user's own reasons. -- Defaulting logic sets `job.spec.selector` to `matchLabels["controller-uid"]="$UIDOFJOB"` +- User is probably unaware of the `job.spec.manualSelector` field and does not +think about it. +- User optionally puts labels on pod template (optional). User does not think +about uniqueness, just labeling for user's own reasons. +- Defaulting logic sets `job.spec.selector` to +`matchLabels["controller-uid"]="$UIDOFJOB"` - Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. - The first label is controller-uid=$UIDOFJOB. - The second label is "job-name=$NAMEOFJOB". @@ -80,19 +104,30 @@ Defaulting changes. Defaulting happens in one of two modes: - User means User or Controller for the rest of this list. - User does specify `job.spec.selector`. - User does specify `job.spec.manualSelector=true` -- User puts a unique label or label(s) on pod template (required). user does think carefully about uniqueness. +- User puts a unique label or label(s) on pod template (required). User does +think carefully about uniqueness. - No defaulting of pod labels or the selector happen. ### Rationale UID is better than Name in that: - it allows cross-namespace control someday if we need it. -- it is unique across all kinds. `controller-name=foo` does not ensure uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the latter cannot use label `job-name=foo`, though there is a temptation to do so. -- it uniquely identifies the controller across time. This prevents the case where, for example, someone deletes a job via the REST api or client (where cascade=false), leaving pods around. We don't want those to be picked up unintentionally. It also prevents the case where a user looks at an old job that finished but is not deleted, and tries to select its pods, and gets the wrong impression that it is still running. +- it is unique across all kinds. `controller-name=foo` does not ensure +uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a +problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the +latter cannot use label `job-name=foo`, though there is a temptation to do so. +- it uniquely identifies the controller across time. This prevents the case +where, for example, someone deletes a job via the REST api or client +(where cascade=false), leaving pods around. We don't want those to be picked up +unintentionally. It also prevents the case where a user looks at an old job that +finished but is not deleted, and tries to select its pods, and gets the wrong +impression that it is still running. Job name is more user friendly. It is self documenting -Commands like `kubectl get pods -l job-name=myjob` should do exactly what is wanted 99.9% of the time. Automated control loops should still use the controller-uid=label. +Commands like `kubectl get pods -l job-name=myjob` should do exactly what is +wanted 99.9% of the time. Automated control loops should still use the +controller-uid=label. Using both gets the benefits of both, at the cost of some label verbosity. @@ -102,11 +137,15 @@ users looking at a stored pod spec do not need to be aware of this field. ### Overriding Unique Labels -If user does specify `job.spec.selector` then the user must also specify `job.spec.manualSelector`. -This ensures the user knows that what he is doing is not the normal thing to do. +If user does specify `job.spec.selector` then the user must also specify +`job.spec.manualSelector`. This ensures the user knows that what he is doing is +not the normal thing to do. -To prevent users from copying the `job.spec.manualSelector` flag from existing jobs, it will be -optional and default to false, which means when you ask GET and existing job back that didn't use this feature, you don't even see the `job.spec.manualSelector` flag, so you are not tempted to wonder if you should fiddle with it. +To prevent users from copying the `job.spec.manualSelector` flag from existing +jobs, it will be optional and default to false, which means when you ask GET and +existing job back that didn't use this feature, you don't even see the +`job.spec.manualSelector` flag, so you are not tempted to wonder if you should +fiddle with it. ## Job Controller @@ -114,8 +153,8 @@ No changes ## Kubectl -No required changes. -Suggest moving SELECTOR to wide output of `kubectl get jobs` since users do not write the selector. +No required changes. Suggest moving SELECTOR to wide output of `kubectl get +jobs` since users do not write the selector. ## Docs @@ -124,42 +163,50 @@ Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. # Conversion -The following applies to Job, as well as to other types that adopt this pattern. +The following applies to Job, as well as to other types that adopt this pattern: - Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`. -- Both the internal type and the `batch/v1` type will get `job.spec.manualSelector`. +- Both the internal type and the `batch/v1` type will get +`job.spec.manualSelector`. - The fields `manualSelector` and `autoSelector` have opposite meanings. -- Each field defaults to false when unset, and so v1beta1 has a different default than v1 and internal. This is intentional: we want new - uses to default to the less error-prone behavior, and we do not want to change the behavior - of v1beta1. +- Each field defaults to false when unset, and so v1beta1 has a different +default than v1 and internal. This is intentional: we want new uses to default +to the less error-prone behavior, and we do not want to change the behavior of +v1beta1. -*Note*: since the internal default is changing, client -library consumers that create Jobs may need to add "job.spec.manualSelector=true" to keep working, or switch -to auto selectors. +*Note*: since the internal default is changing, client library consumers that +create Jobs may need to add "job.spec.manualSelector=true" to keep working, or +switch to auto selectors. Conversion is as follows: -- `extensions/__internal` to `extensions/v1beta1`: the value of `__internal.Spec.ManualSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. -- `extensions/v1beta1` to `extensions/__internal`: the value of `v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to nil if false, and written to `__internal.Spec.ManualSelector`. +- `extensions/__internal` to `extensions/v1beta1`: the value of +`__internal.Spec.ManualSelector` is defaulted to false if nil, negated, +defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. +- `extensions/v1beta1` to `extensions/__internal`: the value of +`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to +nil if false, and written to `__internal.Spec.ManualSelector`. This conversion gives the following properties. -1. Users that previously used v1beta1 do not start seeing a new field when they get back objects. -2. Distinction between originally unset versus explicitly set to false is not preserved (would have been nice to do so, but requires more complicated - solution). -3. Users who only created v1beta1 examples or v1 examples, will not ever see the existence of either field. -4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) does not need to change, allowing scriptable rollforward/rollback. +1. Users that previously used v1beta1 do not start seeing a new field when they +get back objects. +2. Distinction between originally unset versus explicitly set to false is not +preserved (would have been nice to do so, but requires more complicated +solution). +3. Users who only created v1beta1 examples or v1 examples, will not ever see the +existence of either field. +4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) +does not need to change, allowing scriptable rollforward/rollback. # Future Work -Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if it works well for job. +Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if +it works well for job. Docs will be edited to show examples without a `job.spec.selector`. -We probably want as much as possible the same behavior for Job and ReplicationController. - - - - +We probably want as much as possible the same behavior for Job and +ReplicationController. diff --git a/service_accounts.md b/service_accounts.md index 445de310..2affa10e 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -40,26 +40,30 @@ Processes in Pods may need to call the Kubernetes API. For example: - scheduler - replication controller - node controller - - a map-reduce type framework which has a controller that then tries to make a dynamically determined number of workers and watch them + - a map-reduce type framework which has a controller that then tries to make a +dynamically determined number of workers and watch them - continuous build and push system - monitoring system They also may interact with services other than the Kubernetes API, such as: - - an image repository, such as docker -- both when the images are pulled to start the containers, and for writing - images in the case of pods that generate images. - - accessing other cloud services, such as blob storage, in the context of a large, integrated, cloud offering (hosted - or private). + - an image repository, such as docker -- both when the images are pulled to +start the containers, and for writing images in the case of pods that generate +images. + - accessing other cloud services, such as blob storage, in the context of a +large, integrated, cloud offering (hosted or private). - accessing files in an NFS volume attached to the pod ## Design Overview A service account binds together several things: - - a *name*, understood by users, and perhaps by peripheral systems, for an identity + - a *name*, understood by users, and perhaps by peripheral systems, for an +identity - a *principal* that can be authenticated and [authorized](../admin/authorization.md) - - a [security context](security_context.md), which defines the Linux Capabilities, User IDs, Groups IDs, and other - capabilities and controls on interaction with the file system and OS. - - a set of [secrets](secrets.md), which a container may use to - access various networked resources. + - a [security context](security_context.md), which defines the Linux +Capabilities, User IDs, Groups IDs, and other capabilities and controls on +interaction with the file system and OS. + - a set of [secrets](secrets.md), which a container may use to access various +networked resources. ## Design Discussion @@ -76,94 +80,119 @@ type ServiceAccount struct { } ``` -The name ServiceAccount is chosen because it is widely used already (e.g. by Kerberos and LDAP) -to refer to this type of account. Note that it has no relation to Kubernetes Service objects. +The name ServiceAccount is chosen because it is widely used already (e.g. by +Kerberos and LDAP) to refer to this type of account. Note that it has no +relation to Kubernetes Service objects. -The ServiceAccount object does not include any information that could not be defined separately: +The ServiceAccount object does not include any information that could not be +defined separately: - username can be defined however users are defined. - - securityContext and secrets are only referenced and are created using the REST API. + - securityContext and secrets are only referenced and are created using the +REST API. The purpose of the serviceAccount object is twofold: - - to bind usernames to securityContexts and secrets, so that the username can be used to refer succinctly - in contexts where explicitly naming securityContexts and secrets would be inconvenient - - to provide an interface to simplify allocation of new securityContexts and secrets. + - to bind usernames to securityContexts and secrets, so that the username can +be used to refer succinctly in contexts where explicitly naming securityContexts +and secrets would be inconvenient + - to provide an interface to simplify allocation of new securityContexts and +secrets. + These features are explained later. ### Names -From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to Kubernetes API. -This includes a human running `kubectl` on her desktop and a container in a Pod on a Node making API calls. - -There is already a notion of a username in Kubernetes, which is populated into a request context after authentication. -However, there is no API object representing a user. While this may evolve, it is expected that in mature installations, -the canonical storage of user identifiers will be handled by a system external to Kubernetes. - -Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be -simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity ( -`alice@example.com` vs `alice@example.org`.) Naming convention may distinguish service accounts from user -accounts (e.g. `alice@example.com` vs `build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), -but Kubernetes does not require this. - -Kubernetes also does not require that there be a distinction between human and Pod users. It will be possible -to setup a cluster where Alice the human talks to the Kubernetes API as username `alice` and starts pods that -also talk to the API as user `alice` and write files to NFS as user `alice`. But, this is not recommended. - -Instead, it is recommended that Pods and Humans have distinct identities, and reference implementations will -make this distinction. +From the standpoint of the Kubernetes API, a `user` is any principal which can +authenticate to Kubernetes API. This includes a human running `kubectl` on her +desktop and a container in a Pod on a Node making API calls. + +There is already a notion of a username in Kubernetes, which is populated into a +request context after authentication. However, there is no API object +representing a user. While this may evolve, it is expected that in mature +installations, the canonical storage of user identifiers will be handled by a +system external to Kubernetes. + +Kubernetes does not dictate how to divide up the space of user identifier +strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or +may be qualified to allow for federated identity (`alice@example.com` vs +`alice@example.org`.) Naming convention may distinguish service accounts from +user accounts (e.g. `alice@example.com` vs +`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but +Kubernetes does not require this. + +Kubernetes also does not require that there be a distinction between human and +Pod users. It will be possible to setup a cluster where Alice the human talks to +the Kubernetes API as username `alice` and starts pods that also talk to the API +as user `alice` and write files to NFS as user `alice`. But, this is not +recommended. + +Instead, it is recommended that Pods and Humans have distinct identities, and +reference implementations will make this distinction. The distinction is useful for a number of reasons: - the requirements for humans and automated processes are different: - - Humans need a wide range of capabilities to do their daily activities. Automated processes often have more narrowly-defined activities. - - Humans may better tolerate the exceptional conditions created by expiration of a token. Remembering to handle - this in a program is more annoying. So, either long-lasting credentials or automated rotation of credentials is - needed. - - A Human typically keeps credentials on a machine that is not part of the cluster and so not subject to automatic - management. A VM with a role/service-account can have its credentials automatically managed. + - Humans need a wide range of capabilities to do their daily activities. +Automated processes often have more narrowly-defined activities. + - Humans may better tolerate the exceptional conditions created by +expiration of a token. Remembering to handle this in a program is more annoying. +So, either long-lasting credentials or automated rotation of credentials is +needed. + - A Human typically keeps credentials on a machine that is not part of the +cluster and so not subject to automatic management. A VM with a +role/service-account can have its credentials automatically managed. - the identity of a Pod cannot in general be mapped to a single human. - - If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human. + - If policy allows, it may be created by one human, and then updated by +another, and another, until its behavior cannot be attributed to a single human. -**TODO**: consider getting rid of separate serviceAccount object and just rolling its parts into the SecurityContext or -Pod Object. +**TODO**: consider getting rid of separate serviceAccount object and just +rolling its parts into the SecurityContext or Pod Object. -The `secrets` field is a list of references to /secret objects that an process started as that service account should -have access to be able to assert that role. +The `secrets` field is a list of references to /secret objects that an process +started as that service account should have access to be able to assert that +role. -The secrets are not inline with the serviceAccount object. This way, most or all users can have permission to `GET /serviceAccounts` so they can remind themselves -what serviceAccounts are available for use. +The secrets are not inline with the serviceAccount object. This way, most or +all users can have permission to `GET /serviceAccounts` so they can remind +themselves what serviceAccounts are available for use. -Nothing will prevent creation of a serviceAccount with two secrets of type `SecretTypeKubernetesAuth`, or secrets of two -different types. Kubelet and client libraries will have some behavior, TBD, to handle the case of multiple secrets of a -given type (pick first or provide all and try each in order, etc). +Nothing will prevent creation of a serviceAccount with two secrets of type +`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and +client libraries will have some behavior, TBD, to handle the case of multiple +secrets of a given type (pick first or provide all and try each in order, etc). -When a serviceAccount and a matching secret exist, then a `User.Info` for the serviceAccount and a `BearerToken` from the secret -are added to the map of tokens used by the authentication process in the apiserver, and similarly for other types. (We -might have some types that do not do anything on apiserver but just get pushed to the kubelet.) +When a serviceAccount and a matching secret exist, then a `User.Info` for the +serviceAccount and a `BearerToken` from the secret are added to the map of +tokens used by the authentication process in the apiserver, and similarly for +other types. (We might have some types that do not do anything on apiserver but +just get pushed to the kubelet.) ### Pods -The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a -default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the -Service Account Finalizer (see below). +The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If +this is unset, then a default value is chosen. If it is set, then the +corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account +Finalizer (see below). TBD: how policy limits which users can make pods with which service accounts. ### Authorization -Kubernetes API Authorization Policies refer to users. Pods created with a `Pods.Spec.ServiceAccountUsername` typically -get a `Secret` which allows them to authenticate to the Kubernetes APIserver as a particular user. So any -policy that is desired can be applied to them. +Kubernetes API Authorization Policies refer to users. Pods created with a +`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to +authenticate to the Kubernetes APIserver as a particular user. So any policy +that is desired can be applied to them. -A higher level workflow is needed to coordinate creation of serviceAccounts, secrets and relevant policy objects. -Users are free to extend Kubernetes to put this business logic wherever is convenient for them, though the -Service Account Finalizer is one place where this can happen (see below). +A higher level workflow is needed to coordinate creation of serviceAccounts, +secrets and relevant policy objects. Users are free to extend Kubernetes to put +this business logic wherever is convenient for them, though the Service Account +Finalizer is one place where this can happen (see below). ### Kubelet -The kubelet will treat as "not ready to run" (needing a finalizer to act on it) any Pod which has an empty -SecurityContext. +The kubelet will treat as "not ready to run" (needing a finalizer to act on it) +any Pod which has an empty SecurityContext. -The kubelet will set a default, restrictive, security context for any pods created from non-Apiserver config -sources (http, file). +The kubelet will set a default, restrictive, security context for any pods +created from non-Apiserver config sources (http, file). Kubelet watches apiserver for secrets which are needed by pods bound to it. @@ -173,32 +202,41 @@ Kubelet watches apiserver for secrets which are needed by pods bound to it. There are several ways to use Pods with SecurityContexts and Secrets. -One way is to explicitly specify the securityContext and all secrets of a Pod when the pod is initially created, -like this: +One way is to explicitly specify the securityContext and all secrets of a Pod +when the pod is initially created, like this: **TODO**: example of pod with explicit refs. -Another way is with the *Service Account Finalizer*, a plugin process which is optional, and which handles -business logic around service accounts. +Another way is with the *Service Account Finalizer*, a plugin process which is +optional, and which handles business logic around service accounts. -The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount definitions. +The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount +definitions. -First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no `Pod.Spec.SecurityContext` set, -then it copies in the referenced securityContext and secrets references for the corresponding `serviceAccount`. +First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no +`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext +and secrets references for the corresponding `serviceAccount`. Second, if ServiceAccount definitions change, it may take some actions. -**TODO**: decide what actions it takes when a serviceAccount definition changes. Does it stop pods, or just -allow someone to list ones that are out of spec? In general, people may want to customize this? - -Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include -a new username (e.g. `NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), a new -securityContext, a newly generated secret to authenticate that serviceAccount to the Kubernetes API, and default -policies for that service account. -**TODO**: more concrete example. What are typical default permissions for default service account (e.g. readonly access -to services in the same namespace and read-write access to events in that namespace?) - -Finally, it may provide an interface to automate creation of new serviceAccounts. In that case, the user may want -to GET serviceAccounts to see what has been created. + +**TODO**: decide what actions it takes when a serviceAccount definition changes. +Does it stop pods, or just allow someone to list ones that are out of spec? In +general, people may want to customize this? + +Third, if a new namespace is created, it may create a new serviceAccount for +that namespace. This may include a new username (e.g. +`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), +a new securityContext, a newly generated secret to authenticate that +serviceAccount to the Kubernetes API, and default policies for that service +account. + +**TODO**: more concrete example. What are typical default permissions for +default service account (e.g. readonly access to services in the same namespace +and read-write access to events in that namespace?) + +Finally, it may provide an interface to automate creation of new +serviceAccounts. In that case, the user may want to GET serviceAccounts to see +what has been created. diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 0ac77d23..eb528580 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -34,32 +34,47 @@ Documentation for other releases can be found at ## Simple rolling update -This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. +This is a lightweight design document for simple +[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. -Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information. +Complete execution flow can be found [here](#execution-details). See the +[example of rolling update](../user-guide/update-demo/) for more information. ### Lightweight rollout -Assume that we have a current replication controller named `foo` and it is running image `image:v1` +Assume that we have a current replication controller named `foo` and it is +running image `image:v1` `kubectl rolling-update foo [foo-v2] --image=myimage:v2` -If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to +If the user doesn't specify a name for the 'next' replication controller, then +the 'next' replication controller is renamed to the name of the original replication controller. -Obviously there is a race here, where if you kill the client between delete foo, and creating the new version of 'foo' you might be surprised about what is there, but I think that's ok. -See [Recovery](#recovery) below +Obviously there is a race here, where if you kill the client between delete foo, +and creating the new version of 'foo' you might be surprised about what is +there, but I think that's ok. See [Recovery](#recovery) below -If the user does specify a name for the 'next' replication controller, then the 'next' replication controller is retained with its existing name, -and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label `kubernetes.io/deployment` to both the `foo` and `foo-next` replication controllers. -The value of that label is the hash of the complete JSON representation of the`foo-next` or`foo` replication controller. The name of this label can be overridden by the user with the `--deployment-label-key` flag. +If the user does specify a name for the 'next' replication controller, then the +'next' replication controller is retained with its existing name, and the old +'foo' replication controller is deleted. For the purposes of the rollout, we add +a unique-ifying label `kubernetes.io/deployment` to both the `foo` and +`foo-next` replication controllers. The value of that label is the hash of the +complete JSON representation of the`foo-next` or`foo` replication controller. +The name of this label can be overridden by the user with the +`--deployment-label-key` flag. #### Recovery -If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out. -To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the `kubernetes.io/` annotation namespace: - * `desired-replicas` The desired number of replicas for this replication controller (either N or zero) - * `update-partner` A pointer to the replication controller resource that is the other half of this update (syntax `` the namespace is assumed to be identical to the namespace of this replication controller.) +If a rollout fails or is terminated in the middle, it is important that the user +be able to resume the roll out. To facilitate recovery in the case of a crash of +the updating process itself, we add the following annotations to each +replication controller in the `kubernetes.io/` annotation namespace: + * `desired-replicas` The desired number of replicas for this replication +controller (either N or zero) + * `update-partner` A pointer to the replication controller resource that is +the other half of this update (syntax `` the namespace is assumed to be +identical to the namespace of this replication controller.) Recovery is achieved by issuing the same command again: @@ -67,9 +82,12 @@ Recovery is achieved by issuing the same command again: kubectl rolling-update foo [foo-v2] --image=myimage:v2 ``` -Whenever the rolling update command executes, the kubectl client looks for replication controllers called `foo` and `foo-next`, if they exist, an attempt is -made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is created, and the rollout is a new rollout. If `foo` doesn't exist, then -it is assumed that the rollout is nearly completed, and `foo-next` is renamed to `foo`. Details of the execution flow are given below. +Whenever the rolling update command executes, the kubectl client looks for +replication controllers called `foo` and `foo-next`, if they exist, an attempt +is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is +created, and the rollout is a new rollout. If `foo` doesn't exist, then it is +assumed that the rollout is nearly completed, and `foo-next` is renamed to +`foo`. Details of the execution flow are given below. ### Aborting a rollout @@ -82,22 +100,28 @@ This is really just semantic sugar for: `kubectl rolling-update foo-v2 foo` -With the added detail that it moves the `desired-replicas` annotation from `foo-v2` to `foo` +With the added detail that it moves the `desired-replicas` annotation from +`foo-v2` to `foo` ### Execution Details -For the purposes of this example, assume that we are rolling from `foo` to `foo-next` where the only change is an image update from `v1` to `v2` +For the purposes of this example, assume that we are rolling from `foo` to +`foo-next` where the only change is an image update from `v1` to `v2` -If the user doesn't specify a `foo-next` name, then it is either discovered from the `update-partner` annotation on `foo`. If that annotation doesn't exist, -then `foo-next` is synthesized using the pattern `-` +If the user doesn't specify a `foo-next` name, then it is either discovered from +the `update-partner` annotation on `foo`. If that annotation doesn't exist, +then `foo-next` is synthesized using the pattern +`-` #### Initialization * If `foo` and `foo-next` do not exist: - * Exit, and indicate an error to the user, that the specified controller doesn't exist. + * Exit, and indicate an error to the user, that the specified controller +doesn't exist. * If `foo` exists, but `foo-next` does not: - * Create `foo-next` populate it with the `v2` image, set `desired-replicas` to `foo.Spec.Replicas` + * Create `foo-next` populate it with the `v2` image, set +`desired-replicas` to `foo.Spec.Replicas` * Goto Rollout * If `foo-next` exists, but `foo` does not: * Assume that we are in the rename phase. @@ -105,7 +129,8 @@ then `foo-next` is synthesized using the pattern `-- using the matching operator . type Toleration struct { - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - // operator represents a key's relationship to the value. - // Valid operators are Exists and Equal. Defaults to Equal. - // Exists is equivalent to wildcard for value, so that a pod can - // tolerate all taints of a particular category. - Operator TolerationOperator `json:"operator"` - Value string `json:"value,omitempty"` - Effect TaintEffect `json:"effect"` - // TODO: For forgiveness (#1574), we'd eventually add at least a grace period - // here, and possibly an occurrence threshold and period. + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + // operator represents a key's relationship to the value. + // Valid operators are Exists and Equal. Defaults to Equal. + // Exists is equivalent to wildcard for value, so that a pod can + // tolerate all taints of a particular category. + Operator TolerationOperator `json:"operator"` + Value string `json:"value,omitempty"` + Effect TaintEffect `json:"effect"` + // TODO: For forgiveness (#1574), we'd eventually add at least a grace period + // here, and possibly an occurrence threshold and period. } // A toleration operator is the set of operators that can be used in a toleration. type TolerationOperator string const ( - TolerationOpExists TolerationOperator = "Exists" - TolerationOpEqual TolerationOperator = "Equal" + TolerationOpExists TolerationOperator = "Exists" + TolerationOpEqual TolerationOperator = "Equal" ) ``` @@ -169,18 +179,17 @@ const ( (See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) to understand the motivation for the various taint effects.) -We will add +We will add: ```go // Multiple tolerations with the same key are allowed. Tolerations []Toleration `json:"tolerations,omitempty"` ``` -to `PodSpec`. A pod must tolerate all of a node's taints (except taints -of type TaintEffectPreferNoSchedule) in order to be able -to schedule onto that node. +to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type +TaintEffectPreferNoSchedule) in order to be able to schedule onto that node. -We will add +We will add: ```go // Multiple taints with the same key are not allowed. @@ -201,30 +210,32 @@ Taints and tolerations are not scoped to namespace. Using taints and tolerations to implement dedicated nodes requires these steps: 1. Add the API described above -1. Add a scheduler predicate function that respects taints and tolerations (for TaintEffectNoSchedule) -and a scheduler priority function that respects taints and tolerations (for TaintEffectPreferNoSchedule). -1. Add to the Kubelet code to implement the "no admit" behavior of TaintEffectNoScheduleNoAdmit and -TaintEffectNoScheduleNoAdmitNoExecute +1. Add a scheduler predicate function that respects taints and tolerations (for +TaintEffectNoSchedule) and a scheduler priority function that respects taints +and tolerations (for TaintEffectPreferNoSchedule). +1. Add to the Kubelet code to implement the "no admit" behavior of +TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute 1. Implement code in Kubelet that evicts a pod that no longer satisfies -TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the controllers -instead, but since taints might be used to enforce security policies, it is better -to do in kubelet because kubelet can respond quickly and can guarantee the rules will -be applied to all pods. -Eviction may need to happen under a variety of circumstances: when a taint is added, when an existing -taint is updated, when a toleration is removed from a pod, or when a toleration is modified on a pod. +TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the +controllers instead, but since taints might be used to enforce security +policies, it is better to do in kubelet because kubelet can respond quickly and +can guarantee the rules will be applied to all pods. Eviction may need to happen +under a variety of circumstances: when a taint is added, when an existing taint +is updated, when a toleration is removed from a pod, or when a toleration is +modified on a pod. 1. Add a new `kubectl` command that adds/removes taints to/from nodes, -1. (This is the one step is that is specific to dedicated nodes) -Implement an admission controller that adds tolerations to pods that are supposed -to be allowed to use dedicated nodes (for example, based on pod's namespace). +1. (This is the one step is that is specific to dedicated nodes) Implement an +admission controller that adds tolerations to pods that are supposed to be +allowed to use dedicated nodes (for example, based on pod's namespace). -In the future one can imagine a generic policy configuration that configures -an admission controller to apply the appropriate tolerations to the desired class of pods and -taints to Nodes upon node creation. It could be used not just for policies about dedicated nodes, -but also other uses of taints and tolerations, e.g. nodes that are restricted -due to their hardware configuration. +In the future one can imagine a generic policy configuration that configures an +admission controller to apply the appropriate tolerations to the desired class +of pods and taints to Nodes upon node creation. It could be used not just for +policies about dedicated nodes, but also other uses of taints and tolerations, +e.g. nodes that are restricted due to their hardware configuration. -The `kubectl` command to add and remove taints on nodes will be modeled after `kubectl label`. -Examples usages: +The `kubectl` command to add and remove taints on nodes will be modeled after +`kubectl label`. Examples usages: ```sh # Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. @@ -258,36 +269,41 @@ to enumerate them by name. ## Future work -At present, the Kubernetes security model allows any user to add and remove any taints and tolerations. -Obviously this makes it impossible to securely enforce -rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints` -field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`) -and from mutating the `Tolerations` field of their pods. #17549 is relevant. +At present, the Kubernetes security model allows any user to add and remove any +taints and tolerations. Obviously this makes it impossible to securely enforce +rules like dedicated nodes. We need some mechanism that prevents regular users +from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them +from mutating any fields of `NodeSpec`) and from mutating the `Tolerations` +field of their pods. #17549 is relevant. -Another security vulnterability arises if nodes are added to the cluster before receiving -their taint. Thus we need to ensure that a new node does not become "Ready" until it has been -configured with its taints. One way to do this is to have an admission controller that adds the taint whenever -a Node object is created. +Another security vulnerability arises if nodes are added to the cluster before +receiving their taint. Thus we need to ensure that a new node does not become +"Ready" until it has been configured with its taints. One way to do this is to +have an admission controller that adds the taint whenever a Node object is +created. A quota policy may want to treat nodes differently based on what taints, if any, -they have. For example, if a particular namespace is only allowed to access dedicated nodes, -then it may be convenient to give the namespace unlimited quota. (To use finite quota, -you'd have to size the namespace's quota to the sum of the sizes of the machines in the -dedicated node group, and update it when nodes are added/removed to/from the group.) +they have. For example, if a particular namespace is only allowed to access +dedicated nodes, then it may be convenient to give the namespace unlimited +quota. (To use finite quota, you'd have to size the namespace's quota to the sum +of the sizes of the machines in the dedicated node group, and update it when +nodes are added/removed to/from the group.) -It's conceivable that taints and tolerations could be unified with [pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). -We have chosen not to do this for the reasons described in the "Future work" section of that doc. +It's conceivable that taints and tolerations could be unified with +[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). +We have chosen not to do this for the reasons described in the "Future work" +section of that doc. ## Backward compatibility -Old scheduler versions will ignore taints and tolerations. New scheduler versions -will respect them. +Old scheduler versions will ignore taints and tolerations. New scheduler +versions will respect them. -Users should not start using taints and tolerations until the full implementation -has been in Kubelet and the master for enough binary versions that we -feel comfortable that we will not need to roll back either Kubelet or -master to a version that does not support them. Longer-term we will -use a progamatic approach to enforcing this (#4855). +Users should not start using taints and tolerations until the full +implementation has been in Kubelet and the master for enough binary versions +that we feel comfortable that we will not need to roll back either Kubelet or +master to a version that does not support them. Longer-term we will use a +progamatic approach to enforcing this (#4855). ## Related issues diff --git a/versioning.md b/versioning.md index 6e4c5d26..4d387af9 100644 --- a/versioning.md +++ b/versioning.md @@ -38,7 +38,9 @@ Reference: [Semantic Versioning](http://semver.org) Legend: -* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the major version, **Y** is the minor version, and **Z** is the patch version.) +* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. +This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the +major version, **Y** is the minor version, and **Z** is the patch version.) * **API vX[betaY]** refers to the version of the HTTP API. ## Release versioning @@ -46,43 +48,76 @@ Legend: ### Minor version scheme and timeline * Kube X.Y.0-alpha.W, W > 0 (Branch: master) - * Alpha releases are released roughly every two weeks directly from the master branch. - * No cherrypick releases. If there is a critical bugfix, a new release from master can be created ahead of schedule. + * Alpha releases are released roughly every two weeks directly from the master +branch. + * No cherrypick releases. If there is a critical bugfix, a new release from +master can be created ahead of schedule. * Kube X.Y.Z-beta.W (Branch: release-X.Y) - * When master is feature-complete for Kube X.Y, we will cut the release-X.Y branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential to X.Y. + * When master is feature-complete for Kube X.Y, we will cut the release-X.Y +branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential +to X.Y. * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. - * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, (X.Y.0-beta.W | W > 0) as necessary. + * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, +(X.Y.0-beta.W | W > 0) as necessary. * Kube X.Y.0 (Branch: release-X.Y) * Final release, cut from the release-X.Y branch cut two weeks prior. * X.Y.1-beta.0 will be tagged at the same commit on the same branch. * X.Y.0 occur 3 to 4 months after X.(Y-1).0. * Kube X.Y.Z, Z > 0 (Branch: release-X.Y) - * [Patch releases](#patch-releases) are released as we cherrypick commits into the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. - * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is tagged on the followup commit that updates pkg/version/base.go with the beta version. + * [Patch releases](#patch-releases) are released as we cherrypick commits into +the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. + * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is +tagged on the followup commit that updates pkg/version/base.go with the beta +version. * Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z) - * These are special and different in that the X.Y.Z tag is branched to isolate the emergency/critical fix from all other changes that have landed on the release branch since the previous tag + * These are special and different in that the X.Y.Z tag is branched to isolate +the emergency/critical fix from all other changes that have landed on the +release branch since the previous tag * Cut release-X.Y.Z branch to hold the isolated patch release * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1) - * Branched [patch releases](#patch-releases) are rarely needed but used for emergency/critical fixes to the latest release - * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed for this kind of release to be possible. + * Branched [patch releases](#patch-releases) are rarely needed but used for +emergency/critical fixes to the latest release + * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed +for this kind of release to be possible. ### Major version timeline -There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date. +There is no mandated timeline for major versions. They only occur when we need +to start the clock on deprecating features. A given major version should be the +latest major version for at least one year from its original release date. ### CI and dev version scheme -* Continuous integration versions also exist, and are versioned off of alpha and beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (during development, with things in the tree that are not checked it,) it will be appended with -dirty. +* Continuous integration versions also exist, and are versioned off of alpha and +beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an +additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after +X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds +that are built off of a dirty build tree, (during development, with things in +the tree that are not checked it,) it will be appended with -dirty. ### Supported releases -We expect users to stay reasonably up-to-date with the versions of Kubernetes they use in production, but understand that it may take time to upgrade. +We expect users to stay reasonably up-to-date with the versions of Kubernetes +they use in production, but understand that it may take time to upgrade. -We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as possible. Furthermore, we expect to "support" three minor releases at a time. "Support" means we expect users to be running that version in production, though we may not port fixes back before the latest minor version. For example, when v1.3 comes out, v1.0 will no longer be supported: basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months. +We expect users to be running approximately the latest patch release of a given +minor release; we often include critical bug fixes in +[patch releases](#patch-release), and so encourage users to upgrade as soon as +possible. Furthermore, we expect to "support" three minor releases at a time. +"Support" means we expect users to be running that version in production, though +we may not port fixes back before the latest minor version. For example, when +v1.3 comes out, v1.0 will no longer be supported: basically, that means that the +reasonable response to the question "my v1.0 cluster isn't working," is, "you +should probably upgrade it, (and probably should have some time ago)". With +minor releases happening approximately every three months, that means a minor +release is supported for approximately nine months. -This does *not* mean that we expect to introduce breaking changes between v1.0 and v1.3, but it does mean that we probably won't have reasonable confidence in clusters where some components are running at v1.0 and others running at v1.3. +This does *not* mean that we expect to introduce breaking changes between v1.0 +and v1.3, but it does mean that we probably won't have reasonable confidence in +clusters where some components are running at v1.0 and others running at v1.3. -This policy is in line with [GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). +This policy is in line with +[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). ## API versioning @@ -91,33 +126,74 @@ This policy is in line with [GKE's supported upgrades policy](https://cloud.goog Here is an example major release cycle: * **Kube 1.0 should have API v1 without v1beta\* API versions** - * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have the stable v1 API. This enables you to migrate all your objects off of the beta API versions of the API and allows us to remove those beta API versions in Kube 1.0 with no effect. There will be tooling to help you detect and migrate any v1beta\* data versions or calls to v1 before you do the upgrade. + * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have +the stable v1 API. This enables you to migrate all your objects off of the beta +API versions of the API and allows us to remove those beta API versions in Kube +1.0 with no effect. There will be tooling to help you detect and migrate any +v1beta\* data versions or calls to v1 before you do the upgrade. * **Kube 1.x may have API v2beta*** - * The first incarnation of a new (backwards-incompatible) API in HEAD is v2beta1. By default this will be unregistered in apiserver, so it can change freely. Once it is available by default in apiserver (which may not happen for several minor releases), it cannot change ever again because we serialize objects in versioned form, and we always need to be able to deserialize any objects that are saved in etcd, even between alpha versions. If further changes to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x versions. -* **Kube 1.y (where y is the last version of the 1.x series) must have final API v2** - * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades. - * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis. + * The first incarnation of a new (backwards-incompatible) API in HEAD is + v2beta1. By default this will be unregistered in apiserver, so it can change + freely. Once it is available by default in apiserver (which may not happen for +several minor releases), it cannot change ever again because we serialize +objects in versioned form, and we always need to be able to deserialize any +objects that are saved in etcd, even between alpha versions. If further changes +to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x +versions. +* **Kube 1.y (where y is the last version of the 1.x series) must have final +API v2** + * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two + things: (1) users can upgrade to API v2 when running Kube 1.x and then switch + over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can + cleanup and remove all API v2beta\* versions because no one should have + v2beta\* objects left in their database. As mentioned above, tooling will exist + to make sure there are no calls or references to a given API version anywhere + inside someone's kube installation before someone upgrades. + * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. +It *may* include the v1 API as well if the burden is not high - this will be +determined on a per-major-version basis. #### Rationale for API v2 being complete before v2.0's release -It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options. +It may seem a bit strange to complete the v2 API before v2.0 is released, +but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* +APIs *is* a breaking change, which is what necessitates the major version bump. +There are other ways to do this, but having the major release be the fresh start +of that release's API without the baggage of its beta versions seems most +intuitive out of the available options. ## Patch releases -Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes. +Patch releases are intended for critical bug fixes to the latest minor version, +such as addressing security vulnerabilities, fixes to problems affecting a large +number of users, severe problems with no workaround, and blockers for products +based on Kubernetes. -They should not contain miscellaneous feature additions or improvements, and especially no incompatibilities should be introduced between patch versions of the same minor version (or even major version). +They should not contain miscellaneous feature additions or improvements, and +especially no incompatibilities should be introduced between patch versions of +the same minor version (or even major version). -Dependencies, such as Docker or Etcd, should also not be changed unless absolutely necessary, and also just to fix critical bugs (so, at most patch version changes, not new major nor minor versions). +Dependencies, such as Docker or Etcd, should also not be changed unless +absolutely necessary, and also just to fix critical bugs (so, at most patch +version changes, not new major nor minor versions). ## Upgrades -* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.) - * However, we do not recommend upgrading more than two minor releases at a time (see [Supported releases](#supported-releases)), and do not recommend running non-latest patch releases of a given minor release. +* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a +rolling upgrade across their cluster. (Rolling upgrade means being able to +upgrade the master first, then one node at a time. See #4855 for details.) + * However, we do not recommend upgrading more than two minor releases at a +time (see [Supported releases](#supported-releases)), and do not recommend +running non-latest patch releases of a given minor release. * No hard breaking changes over version boundaries. - * For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x. - -There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. + * For example, if a user is at Kube 1.x, we may require them to upgrade to +Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across +major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as +graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone +to go from 1.x to 1.x+y before they go to 2.x. + +There is a separate question of how to track the capabilities of a kubelet to +facilitate rolling upgrades. That is not addressed here. -- cgit v1.2.3 From e5f80a88d3516761b2fb5d3b9c5ed64de3d1265d Mon Sep 17 00:00:00 2001 From: Isaac Hollander McCreery Date: Tue, 3 May 2016 10:09:47 -0700 Subject: Add clarifying language about supported version skews --- versioning.md | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/versioning.md b/versioning.md index 4d387af9..f6b8efaf 100644 --- a/versioning.md +++ b/versioning.md @@ -95,27 +95,34 @@ X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds that are built off of a dirty build tree, (during development, with things in the tree that are not checked it,) it will be appended with -dirty. -### Supported releases +### Supported releases and component skew We expect users to stay reasonably up-to-date with the versions of Kubernetes -they use in production, but understand that it may take time to upgrade. +they use in production, but understand that it may take time to upgrade, +especially for production-critical components. We expect users to be running approximately the latest patch release of a given minor release; we often include critical bug fixes in [patch releases](#patch-release), and so encourage users to upgrade as soon as -possible. Furthermore, we expect to "support" three minor releases at a time. -"Support" means we expect users to be running that version in production, though -we may not port fixes back before the latest minor version. For example, when -v1.3 comes out, v1.0 will no longer be supported: basically, that means that the +possible. + +Different components are expected to be compatible across different amounts of +skew, all relative to the master version. Nodes may lag masters components by +up to two minor versions but should be at a version no newer than the master; a +client should be skewed no more than one minor version from the master, but may +lead the master by up to one minor version. For example, a v1.3 master should +work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4 +clients. + +Furthermore, we expect to "support" three minor releases at a time. "Support" +means we expect users to be running that version in production, though we may +not port fixes back before the latest minor version. For example, when v1.3 +comes out, v1.0 will no longer be supported: basically, that means that the reasonable response to the question "my v1.0 cluster isn't working," is, "you should probably upgrade it, (and probably should have some time ago)". With minor releases happening approximately every three months, that means a minor release is supported for approximately nine months. -This does *not* mean that we expect to introduce breaking changes between v1.0 -and v1.3, but it does mean that we probably won't have reasonable confidence in -clusters where some components are running at v1.0 and others running at v1.3. - This policy is in line with [GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). -- cgit v1.2.3 From 51fb0714bc9b4cdd70f047c2c2a6469f059c98d8 Mon Sep 17 00:00:00 2001 From: Rudi Chiarito Date: Fri, 13 May 2016 11:42:37 -0400 Subject: Update AWS under the hood doc with ELB SSL annotations --- aws_under_the_hood.md | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 98d18251..13aa783c 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -139,7 +139,8 @@ pods. ELB has some restrictions: * ELB requires that all nodes listen on a single port, -* ELB acts as a forwarding proxy (i.e. the source IP is not preserved). +* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below +on ELB annotations for pods speaking HTTP). To work with these restrictions, in Kubernetes, [LoadBalancer services](../user-guide/services.md#type-loadbalancer) are exposed as @@ -162,6 +163,32 @@ services or for LoadBalancer. To consume a NodePort service externally, you will likely have to open the port in the node security group (`kubernetes-minion-`). +For SSL support, starting with 1.3 two annotations can be added to a service: + +``` +service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012 +``` + +The first specifies which certificate to use. It can be either a +certificate from a third party issuer that was uploaded to IAM or one created +within AWS Certificate Manager. + +``` +service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp) +``` + +The second annotation specificies which protocol a pod speaks. For HTTPS and +SSL, the ELB will expect the pod to authenticate itself over the encrypted +connection. + +HTTP and HTTPS will select layer 7 proxying: the ELB will terminate +the connection with the user, parse headers and inject the `X-Forwarded-For` +header with the user's IP address (pods will only see the IP address of the +ELB at the other end of its connection) when forwarding requests. + +TCP and SSL will select layer 4 proxying: the ELB will forward traffic without +modifying the headers. + ### Identity and Access Management (IAM) kube-proxy sets up two IAM roles, one for the master called @@ -308,6 +335,7 @@ Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually install Kubernetes. + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() -- cgit v1.2.3 From 8b2a68e257e42993082a5bc926535af0964c6e9e Mon Sep 17 00:00:00 2001 From: Avesh Agarwal Date: Mon, 21 Mar 2016 09:31:02 -0400 Subject: Downward API proposal for resources (cpu, memory) limits and requests --- downward_api_resources_limits_requests.md | 651 ++++++++++++++++++++++++++++++ 1 file changed, 651 insertions(+) create mode 100644 downward_api_resources_limits_requests.md diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md new file mode 100644 index 00000000..15f08550 --- /dev/null +++ b/downward_api_resources_limits_requests.md @@ -0,0 +1,651 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Downward API for resource limits and requests + +## Background + +Currently the downward API (via environment variables and volume plugin) only +supports exposing a Pod's name, namespace, annotations, labels and its IP +([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This +document explains the need and design to extend them to expose resources +(e.g. cpu, memory) limits and requests. + +## Motivation + +Software applications require configuration to work optimally with the resources they're allowed to use. +Exposing the requested and limited amounts of available resources inside containers will allow +these applications to be configured more easily. Although docker already +exposes some of this information inside containers, the downward API helps +exposing this information in a runtime-agnostic manner in Kubernetes. + +## Use cases + +As an application author, I want to be able to use cpu or memory requests and +limits to configure the operational requirements of my applications inside containers. +For example, Java applications expect to be made aware of the available heap size via +a command line argument to the JVM, for example: java -Xmx:``. Similarly, an +application may want to configure its thread pool based on available cpu resources and +the exported value of GOMAXPROCS. + +## Design + +This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473). +There are three approaches discussed in this document to obtain resources limits +and requests to be exposed as environment variables and volumes inside +containers: + +1. The first approach requires users to specify full json path selectors +in which selectors are relative to the pod spec. The benefit of this +approach is to specify pod-level resources, and since containers are +also part of a pod spec, it can be used to specify container-level +resources too. + +2. The second approach requires specifying partial json path selectors +which are relative to the container spec. This approach helps +in retrieving a container specific resource limits and requests, and at +the same time, it is simpler to specify than full json path selectors. + +3. In the third approach, users specify fixed strings (magic keys) to retrieve +resources limits and requests and do not specify any json path +selectors. This approach is similar to the existing downward API +implementation approach. The advantages of this approach are that it is +simpler to specify that the first two, and does not require any type of +conversion between internal and versioned objects or json selectors as +discussed below. + +Before discussing a bit more about merits of each approach, here is a +brief discussion about json path selectors and some implications related +to their use. + +#### JSONpath selectors + +Versioned objects in kubernetes have json tags as part of their golang fields. +Currently, objects in the internal API have json tags, but it is planned that +these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933) +for discussion). So for discussion in this proposal, we assume that +internal objects do not have json tags. In the first two approaches +(full and partial json selectors), when a user creates a pod and its +containers, the user specifies a json path selector in the pod's +spec to retrieve values of its limits and requests. The selector +is composed of json tags similar to json paths used with kubectl +([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal +uses kubernetes' json path library to process the selectors to retrieve +the values. As kubelet operates on internal objects (without json tags), +and the selectors are part of versioned objects, retrieving values of +the limits and requests can be handled using these two solutions: + +1. By converting an internal object to versioned obejct, and then using +the json path library to retrieve the values from the versioned object +by processing the selector. + +2. By converting a json selector of the versioned objects to internal +object's golang expression and then using the json path library to +retrieve the values from the internal object by processing the golang +expression. However, converting a json selector of the versioned objects +to internal object's golang expression will still require an instance +of the versioned object, so it seems more work from the first solution +unless there is another way without requiring the versioned object. + +So there is a one time conversion cost associated with the first (full +path) and second (partial path) approaches, whereas the third approach +(magic keys) does not require any such conversion and can directly +work on internal objects. If we want to avoid conversion cost and to +have implementation simplicity, my opinion is that magic keys approach +is relatively easiest to implement to expose limits and requests with +least impact on existing functionality. + +To summarize merits/demerits of each approach: + +|Approach | Scope | Conversion cost | JSON selectors | Future extension| +| ---------- | ------------------- | -------------------| ------------------- | ------------------- | +|Full selectors | Pod/Container | Yes | Yes | Possible | +|Partial selectors | Container | Yes | Yes | Possible | +|Magic keys | Container | No | No | Possible| + +Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object +in conjunction with partial selectors and magic keys approaches. + +### API with full JSONpath selectors + +Full json path selectors specify the complete path to the resources +limits and requests relative to pod spec. + +#### Environment variables + +This table shows how selectors can be used for various requests and +limits to be exposed as environment variables. Environment variable names +are examples only and not necessarily as specified, and the selectors do not +have to start with dot. + +| Env Var Name | Selector | +| ---- | ------------------- | +| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu| +| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory| +| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu| +| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory | + +#### Volume plugin + +This table shows how selectors can be used for various requests and +limits to be exposed as volumes. The path names are examples only and +not necessarily as specified, and the selectors do not have to start with dot. + + +| Path | Selector | +| ---- | ------------------- | +| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu| +| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory| +| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu| +| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory| + +Volumes are pod scoped, so a selector must be specified with a container name. + +Full json path selectors will use existing `type ObjectFieldSelector` +to extend the current implementation for resources requests and limits. + +``` +// ObjectFieldSelector selects an APIVersioned field of an object. +type ObjectFieldSelector struct { + APIVersion string `json:"apiVersion"` + // Required: Path of the field to select in the specified API version + FieldPath string `json:"fieldPath"` +} +``` + +#### Examples + +These examples show how to use full selectors with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + fieldRef: + fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu +``` + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + fieldRef: + fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu +``` + +#### Validations + +For APIs with full json path selectors, verify that selectors are +valid relative to pod spec. + + +### API with partial JSONpath selectors + +Partial json path selectors specify paths to resources limits and requests +relative to the container spec. These will be implemented by introducing a +`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current +implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. + +``` +// ContainerSpecFieldSelector selects an APIVersioned field of an object. +type ContainerSpecFieldSelector struct { + APIVersion string `json:"apiVersion"` + // Container name + ContainerName string `json:"containerName,omitempty"` + // Required: Path of the field to select in the specified API version + FieldPath string `json:"fieldPath"` +} + +// Represents a single file containing information from the downward API +type DownwardAPIVolumeFile struct { + // Required: Path is the relative path name of the file to be created. + Path string `json:"path"` + // Selects a field of the pod: only annotations, labels, name and + // namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` + // Selects a field of the container: only resources limits and requests + // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, + // resources.requests.memory) are currently supported. + ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` +} + +// EnvVarSource represents a source for the value of an EnvVar. +// Only one of its fields may be set. +type EnvVarSource struct { + // Selects a field of the container: only resources limits and requests + // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, + // resources.requests.memory) are currently supported. + ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` + // Selects a field of the pod; only name and namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` + // Selects a key of a secret in the pod's namespace. + SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` +} +``` + +#### Environment variables + +This table shows how partial selectors can be used for various requests and +limits to be exposed as environment variables. Environment variable names +are examples only and not necessarily as specified, and the selectors do not +have to start with dot. + +| Env Var Name | Selector | +| -------------------- | -------------------| +| CPU_LIMIT | resources.limits.cpu | +| MEMORY_LIMIT | resources.limits.memory | +| CPU_REQUEST | resources.requests.cpu | +| MEMORY_REQUEST | resources.requests.memory | + +Since environment variables are container scoped, it is optional +to specify container name as part of the partial selectors as they are +relative to container spec. If container name is not specified, then +it defaults to current container. However, container name could be specified +to expose variables from other containers. + +#### Volume plugin + +This table shows volume paths and partial selectors used for resources cpu and memory. +Volume path names are examples only and not necessarily as specified, and the +selectors do not have to start with dot. + +| Path | Selector | +| -------------------- | -------------------| +| cpu_limit | resources.limits.cpu | +| memory_limit | resources.limits.memory | +| cpu_request | resources.requests.cpu | +| memory_request | resources.requests.memory | + +Volumes are pod scoped, the container name must be specified as part of +`containerSpecFieldRef` with them. + +#### Examples + +These examples show how to use partial selectors with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + containerSpecFieldRef: + fieldPath: resources.limits.cpu +``` + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + containerSpecFieldRef: + containerName: "client-container" + fieldPath: resources.limits.cpu +``` + +#### Validations + +For APIs with partial json path selectors, verify +that selectors are valid relative to container spec. +Also verify that container name is provided with volumes. + + +### API with magic keys + +In this approach, users specify fixed strings (or magic keys) to retrieve resources +limits and requests. This approach is similar to the existing downward +API implementation approach. The fixed string used for resources limits and requests +for cpu and memory are `limits.cpu`, `limits.memory`, +`requests.cpu` and `requests.memory`. Though these strings are same +as json path selectors but are processed as fixed strings. These will be implemented by +introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current +implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. + +The fields in ResourceFieldSelector are `containerName` to specify the name of a +container, `resource` to specify the type of a resource (cpu or memory), and `divisor` +to specify the output format of values of exposed resources. The default value of divisor +is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid +values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer +(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes), +`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kilobytes)`, +`1Mi`(megabytes), `1Gi`(gigabytes), `1Ti`(terabytes), `1Pi`(petabytes), `1Ei`(exabytes). +For more information about these resource formats, [see details](resources.md). + +Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor. +For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then +exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and +the ceiling of 0.25 is 1. + +``` +type ResourceFieldSelector struct { + // Container name + ContainerName string `json:"containerName,omitempty"` + // Required: Resource to select + Resource string `json:"resource"` + // Specifies the output format of the exposed resources + Divisor resource.Quantity `json:"divisor,omitempty"` +} + +// Represents a single file containing information from the downward API +type DownwardAPIVolumeFile struct { + // Required: Path is the relative path name of the file to be created. + Path string `json:"path"` + // Selects a field of the pod: only annotations, labels, name and + // namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` + // Selects a resource of the container: only resources limits and requests + // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. + ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` +} + +// EnvVarSource represents a source for the value of an EnvVar. +// Only one of its fields may be set. +type EnvVarSource struct { + // Selects a resource of the container: only resources limits and requests + // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. + ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` + // Selects a field of the pod; only name and namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` + // Selects a key of a secret in the pod's namespace. + SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` +} +``` + +#### Environment variables + +This table shows environment variable names and strings used for resources cpu and memory. +The variable names are examples only and not necessarily as specified. + +| Env Var Name | Resource | +| -------------------- | -------------------| +| CPU_LIMIT | limits.cpu | +| MEMORY_LIMIT | limits.memory | +| CPU_REQUEST | requests.cpu | +| MEMORY_REQUEST | requests.memory | + +Since environment variables are container scoped, it is optional +to specify container name as part of the partial selectors as they are +relative to container spec. If container name is not specified, then +it defaults to current container. However, container name could be specified +to expose variables from other containers. + +#### Volume plugin + +This table shows volume paths and strings used for resources cpu and memory. +Volume path names are examples only and not necessarily as specified. + +| Path | Resource | +| -------------------- | -------------------| +| cpu_limit | limits.cpu | +| memory_limit | limits.memory| +| cpu_request | requests.cpu | +| memory_request | requests.memory | + +Volumes are pod scoped, the container name must be specified as part of +`containerSpecFieldRef` with them. + +#### Examples + +These examples show how to use magic keys approach with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + resourceFieldRef: + resource: limits.cpu + - name: MEMORY_LIMIT + valueFrom: + resourceFieldRef: + resource: limits.memory + divisor: "1Mi" +``` + +In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + resourceFieldRef: + containerName: client-container + resource: limits.cpu + divisor: "1m" + - path: "memory_limit" + resourceFieldRef: + containerName: client-container + resource: limits.memory +``` + +In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively. + + +#### Validations + +For APIs with magic keys, verify that the resource strings are valid and is one +of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`. +Also verify that container name is provided with volumes. + +## Pod-level and container-level resource access + +Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in +all approaches. Container-level resources will be accessed by `type ObjectFieldSelector` +with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef` +with partial and magic keys approaches, respectively. The following table +summarizes resource access with these approaches. + +| Approach | Pod resources| Container resources | +| -------------------- | -------------------|-------------------| +| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`| +| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` | +| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` | + +## Output format + +The output format for resources limits and requests will be same as +cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024 +and rounded to integer) and memory in bytes. For example, memory request +or limit of `64Mi` in the container spec will be output as `67108864` +bytes, and cpu request or limit of `250m` (millicores) will be output as +`256` of cpu shares. + +## Implementation approach + +The current implementation of this proposal will focus on the API with magic keys +approach. The main reason for selecting this approach is that it might be +easier to incorporate and extend resource specific functionality. + +## Applied example + +Here we discuss how to use exposed resource values to set, for example, Java +memory size or GOMAXPROCS for your applications. Lets say, you expose a container's +(running an application like tomcat for example) requested memory as `HEAP_SIZE` +and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable. +One way to set the heap size or cpu for this application would be to wrap the binary +in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it) +and GOMAXPROCS environment variables inside the container image. The spec file for the +application pod could look like: + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64M" + cpu: "250m" + limits: + memory: "128M" + cpu: "500m" + env: + - name: HEAP_SIZE + valueFrom: + resourceFieldRef: + resource: requests.memory + - name: CPU_LIMIT + valueFrom: + resourceFieldRef: + resource: requests.cpu +``` + +Note that the value of divisor by default is `1`. Now inside the container, +the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as: + +``` +export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)" + +and + +export GOMAXPROCS=$(CPU_LIMIT)" +``` + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() + -- cgit v1.2.3 From 4851179bbd260756ba3802692d541159ba4696b7 Mon Sep 17 00:00:00 2001 From: Avesh Agarwal Date: Fri, 20 May 2016 08:21:58 -0400 Subject: Fix a nit in the downward api proposal for resources. --- downward_api_resources_limits_requests.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 15f08550..60ec4787 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -485,7 +485,7 @@ Volume path names are examples only and not necessarily as specified. | memory_request | requests.memory | Volumes are pod scoped, the container name must be specified as part of -`containerSpecFieldRef` with them. +`resourceFieldRef` with them. #### Examples -- cgit v1.2.3 From c90ed7e246e02dcabce874c767dee28c24bb53e4 Mon Sep 17 00:00:00 2001 From: Vishnu kannan Date: Thu, 1 Oct 2015 11:57:17 -0700 Subject: Updating QoS policy to be per-pod instead of per-resource. Signed-off-by: Vishnu kannan --- resource-qos.md | 246 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 resource-qos.md diff --git a/resource-qos.md b/resource-qos.md new file mode 100644 index 00000000..e5088c85 --- /dev/null +++ b/resource-qos.md @@ -0,0 +1,246 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Resource Quality of Service in Kubernetes + +**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) +**Last Updated**: 5/17/2016 + +**Status**: Implemented + +*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.* + +## Introduction + +This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*. +Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee. + +Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. +The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. +When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers. +This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees. +Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes. + +## Requests and Limits + +For each resource, containers can specify a resource request and limit, `0 <= request <= [Node Allocatable](../proposals/node-allocatable.md)` & `request <= limit <= Infinity`. +If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. +Scheduling is based on `requests` and not `limits`. +The pods and its containers will not be allowed to exceed the specified limit. +How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md). + +### Compressible Resource Guarantees + +- For now, we are only supporting CPU. +- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal. +- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections). +- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available. + +### Incompressible Resource Guarantees + +- For now, we are only supporting memory. +- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory). +- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel. + +### Admission/Scheduling Policy + +- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU). + +## QoS Classes + +In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority. + +The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section. + +Pods can be of one of 3 different classes: + +- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the container is classified as **Guaranteed**. + +Examples: + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + name: bar + resources: + limits: + cpu: 100m + memory: 100Mi +``` + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + requests: + cpu: 10m + memory: 1Gi + + name: bar + resources: + limits: + cpu: 100m + memory: 100Mi + requests: + cpu: 10m + memory: 1Gi +``` + +- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**. +When `limits` are not specified, they default to the node capacity. + +Examples: + +Container `bar` has not resources specified. + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + requests: + cpu: 10m + memory: 1Gi + + name: bar +``` + +Container `foo` and `bar` have limits set for different resources. + +```yaml +containers: + name: foo + resources: + limits: + memory: 1Gi + + name: bar + resources: + limits: + cpu: 100m +``` + +Container `foo` has no limits set, and `bar` has neither requests nor limits specified. + +```yaml +containers: + name: foo + resources: + requests: + cpu: 10m + memory: 1Gi + + name: bar +``` + +- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**. + +Examples: + +```yaml +containers: + name: foo + resources: + name: bar + resources: +``` + +Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled. + +Memory is an incompressible resource and so let's discuss the semantics of memory management a bit. + +- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. +These containers can use any amount of free memory in the node though. + +- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted. + +- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available. +Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist. + +### OOM Score configuration at the Nodes + +Pod OOM score configuration +- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed. +- The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ - process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B. +- The final OOM score of a process is also between 0 and 1000 + +*Best-effort* + - Set OOM_SCORE_ADJ: 1000 + - So processes in best-effort containers will have an OOM_SCORE of 1000 + +*Guaranteed* + - Set OOM_SCORE_ADJ: -998 + - So processes in guaranteed containers will have an OOM_SCORE of 0 or 1 + +*Burstable* + - If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2 + - Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested) + - This ensures that the OOM_SCORE of burstable pod is > 1 + - If memory request is `0`, OOM_SCORE_ADJ is set to `999`. + - So burstable pods will be killed if they conflict with guaranteed pods + - If a burstable pod uses less memory than requested, its OOM_SCORE < 1000 + - So best-effort pods will be killed if they conflict with burstable pods using less than requested memory + - If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000 + - Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed + - If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. + +*Pod infra containers* or *Special Pod init process* + - OOM_SCORE_ADJ: -998 +*Kubelet, Docker* + - OOM_SCORE_ADJ: -999 (won’t be OOM killed) + - Hack, because these critical tasks might die if they conflict with guaranteed containers. in the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. + +## Known issues and possible improvements + +The above implementation provides for basic oversubscription with protection, but there are a few known limitations. + +#### Support for Swap + +- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior. + +## Alternative QoS Class Policy + +An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). +A strict hierarchy of user-specified numerical priorities is not desirable because: + +1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively +2. Changes to desired priority bands would require changes to all user pod configurations. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() + -- cgit v1.2.3 From 04bc418b3115722615e43dca834e386a436c3bca Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Wed, 20 Apr 2016 19:51:46 -0400 Subject: Seccomp Proposal --- seccomp.md | 295 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 295 insertions(+) create mode 100644 seccomp.md diff --git a/seccomp.md b/seccomp.md new file mode 100644 index 00000000..7d65611e --- /dev/null +++ b/seccomp.md @@ -0,0 +1,295 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +## Abstract + +A proposal for adding **alpha** support for +[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a +system call filtering facility in the Linux kernel which lets applications +define limits on system calls they may make, and what should happen when +system calls are made. Seccomp is used to reduce the attack surface available +to applications. + +## Motivation + +Applications use seccomp to restrict the set of system calls they can make. +Recently, container runtimes have begun adding features to allow the runtime +to interact with seccomp on behalf of the application, which eliminates the +need for applications to link against libseccomp directly. Adding support in +the Kubernetes API for describing seccomp profiles will allow administrators +greater control over the security of workloads running in Kubernetes. + +Goals of this design: + +1. Describe how to reference seccomp profiles in containers that use them + +## Constraints and Assumptions + +This design should: + +* build upon previous security context work +* be container-runtime agnostic +* allow use of custom profiles +* facilitate containerized applications that link directly to libseccomp + +## Use Cases + +1. As an administrator, I want to be able to grant access to a seccomp profile + to a class of users +2. As a user, I want to run an application with a seccomp profile similar to + the default one provided by my container runtime +3. As a user, I want to run an application which is already libseccomp-aware + in a container, and for my application to manage interacting with seccomp + unmediated by Kubernetes +4. As a user, I want to be able to use a custom seccomp profile and use + it with my containers + +### Use Case: Administrator access control + +Controlling access to seccomp profiles is a cluster administrator +concern. It should be possible for an administrator to control which users +have access to which profiles. + +The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893) +API extension governs the ability of users to make requests that affect pod +and container security contexts. The proposed design should deal with +required changes to control access to new functionality. + +### Use Case: Seccomp profiles similar to container runtime defaults + +Many users will want to use images that make assumptions about running in the +context of their chosen container runtime. Such images are likely to +frequently assume that they are running in the context of the container +runtime's default seccomp settings. Therefore, it should be possible to +express a seccomp profile similar to a container runtime's defaults. + +As an example, all dockerhub 'official' images are compatible with the Docker +default seccomp profile. So, any user who wanted to run one of these images +with seccomp would want the default profile to be accessible. + +### Use Case: Applications that link to libseccomp + +Some applications already link to libseccomp and control seccomp directly. It +should be possible to run these applications unmodified in Kubernetes; this +implies there should be a way to disable seccomp control in Kubernetes for +certain containers, or to run with a "no-op" or "unconfined" profile. + +Sometimes, applications that link to seccomp can use the default profile for a +container runtime, and restrict further on top of that. It is important to +note here that in this case, applications can only place _further_ +restrictions on themselves. It is not possible to re-grant the ability of a +process to make a system call once it has been removed with seccomp. + +As an example, elasticsearch manages its own seccomp filters in its code. +Currently, elasticsearch is capable of running in the context of the default +Docker profile, but if in the future, elasticsearch needed to be able to call +`ioperm` or `iopr` (both of which are disallowed in the default profile), it +should be possible to run elasticsearch by delegating the seccomp controls to +the pod. + +### Use Case: Custom profiles + +Different applications have different requirements for seccomp profiles; it +should be possible to specify an arbitrary seccomp profile and use it in a +container. This is more of a concern for applications which need a higher +level of privilege than what is granted by the default profile for a cluster, +since applications that want to restrict privileges further can always make +additional calls in their own code. + +An example of an application that requires the use of a syscall disallowed in +the Docker default profile is Chrome, which needs `clone` to create a new user +namespace. Another example would be a program which uses `ptrace` to +implement a sandbox for user-provided code, such as +[eval.in](https://eval.in/). + +## Community Work + +### Container runtime support for seccomp + +#### Docker / opencontainers + +Docker supports the open container initiative's API for +seccomp, which is very close to the libseccomp API. It allows full +specification of seccomp filters, with arguments, operators, and actions. + +Docker allows the specification of a single seccomp filter. There are +community requests for: + +Issues: + +* [docker/22109](https://github.com/docker/docker/issues/22109): composable + seccomp filters +* [docker/21105](https://github.com/docker/docker/issues/22105): custom + seccomp filters for builds + +#### rkt / appcontainers + +The `rkt` runtime delegates to systemd for seccomp support; there is an open +issue to add support once `appc` supports it. The `appc` project has an open +issue to be able to describe seccomp as an isolator in an appc pod. + +The systemd seccomp facility is based on a whitelist of system calls that can +be made, rather than a full filter specification. + +Issues: + +* [appc/529](https://github.com/appc/spec/issues/529) +* [rkt/1614](https://github.com/coreos/rkt/issues/1614) + +#### HyperContainer + +[HyperContainer](https://hypercontainer.io) does not support seccomp. + +### Other platforms and seccomp-like capabilities + +FreeBSD has a seccomp/capability-like facility called +[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). + +#### lxd + +[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile. + +Issues: + +* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp + +## Proposed Design + +### Seccomp API Resource? + +An earlier draft of this proposal described a new global API resource that +could be used to describe seccomp profiles. After some discussion, it was +determined that without a feedback signal from users indicating a need to +describe new profiles in the Kubernetes API, it is not possible to know +whether a new API resource is warranted. + +That being the case, we will not propose a new API resource at this time. If +there is strong community desire for such a resource, we may consider it in +the future. + +Instead of implementing a new API resource, we propose that pods be able to +reference seccomp profiles by name. Since this is an alpha feature, we will +use annotations instead of extending the API with new fields. + +### API changes? + +In the alpha version of this feature we will use annotations to store the +names of seccomp profiles. The keys will be: + +`security.alpha.kubernetes.io/seccomp/container/` + +which will be used to set the seccomp profile of a container, and: + +`security.alpha.kubernetes.io/seccomp/pod` + +which will set the seccomp profile for the containers of an entire pod. If a +pod-level annotation is present, and a container-level annotation present for +a container, then the container-level profile takes precedence. + +The value of these keys should be container-runtime agnostic. We will +establish a format that expresses the conventions for distinguishing between +an unconfined profile, the container runtime's default, or a custom profile. +Since format of profile is likely to be runtime dependent, we will consider +profiles to be opaque to kubernetes for now. + +The following format is scoped as follows: + +1. `runtime/default` - the default profile for the container runtime +2. `unconfined` - unconfined profile, ie, no seccomp sandboxing +3. `localhost/` - the profile installed to the node's local seccomp profile root + +Since seccomp profile schemes may vary between container runtimes, we will +treat the contents of profiles as opaque for now and avoid attempting to find +a common way to describe them. It is up to the container runtime to be +sensitive to the annotations proposed here and to interpret instructions about +local profiles. + +A new area on disk (which we will call the seccomp profile root) must be +established to hold seccomp profiles. A field will be added to the Kubelet +for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to +allow admins to set it. If unset, it should default to the `seccomp` +subdirectory of the kubelet root directory. + +### Pod Security Policy annotation + +The `PodSecurityPolicy` type should be annotated with the allowed seccomp +profiles using the key +`security.alpha.kubernetes.io/allowedSeccompProfileNames`. The value of this +key should be a comma delimited list. + +## Examples + +### Unconfined profile + +Here's an example of a pod that uses the unconfined profile: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: trustworthy-pod + annotations: + security.alpha.kubernetes.io/seccomp/pod: unconfined +spec: + containers: + - name: trustworthy-container + image: sotrustworthy:latest +``` + +### Custom profile + +Here's an example of a pod that uses a profile called `example-explorer- +profile` using the container-level annotation: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: explorer + annotations: + security.alpha.kubernetes.io/seccomp/container/explorer: localhost/example-explorer-profile +spec: + containers: + - name: explorer + image: gcr.io/google_containers/explorer:1.0 + args: ["-port=8080"] + ports: + - containerPort: 8080 + protocol: TCP + volumeMounts: + - mountPath: "/mount/test-volume" + name: test-volume + volumes: + - name: test-volume + emptyDir: {} +``` + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() + -- cgit v1.2.3 From 4de0ecb8c1f8ee60ef0b0ae85149de72f175b806 Mon Sep 17 00:00:00 2001 From: Boris Mattijssen Date: Thu, 2 Jun 2016 15:08:50 +0200 Subject: Update scheduler_extender.md The filter call should actually return a schedulerapi.ExtenderFilterResult with an api.NodeList in it, instead of a raw api.NodeList. --- scheduler_extender.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scheduler_extender.md b/scheduler_extender.md index e8ad718f..fa1edbb4 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -125,7 +125,7 @@ type ExtenderArgs struct { } ``` -The "filter" call returns a list of nodes (api.NodeList). The "prioritize" call +The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call returns priorities for each node (schedulerapi.HostPriorityList). The "filter" call may prune the set of nodes based on its predicates. Scores -- cgit v1.2.3 From 6522639ceace700a43657f97b3b622137f8ed4e2 Mon Sep 17 00:00:00 2001 From: "Dr. Stefan Schimanski" Date: Thu, 2 Jun 2016 15:44:57 +0200 Subject: Move /seccomp/ into domain prefix in seccomp annotations Double slashes are not allowed in annotation keys. Moreover, using the 63 characters of the name component in an annotation key will shorted the space for the container name. --- seccomp.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/seccomp.md b/seccomp.md index 7d65611e..4a28d705 100644 --- a/seccomp.md +++ b/seccomp.md @@ -202,11 +202,11 @@ use annotations instead of extending the API with new fields. In the alpha version of this feature we will use annotations to store the names of seccomp profiles. The keys will be: -`security.alpha.kubernetes.io/seccomp/container/` +`container.seccomp.security.alpha.kubernetes.io/` which will be used to set the seccomp profile of a container, and: -`security.alpha.kubernetes.io/seccomp/pod` +`seccomp.security.alpha.kubernetes.io/pod` which will set the seccomp profile for the containers of an entire pod. If a pod-level annotation is present, and a container-level annotation present for @@ -240,7 +240,7 @@ subdirectory of the kubelet root directory. The `PodSecurityPolicy` type should be annotated with the allowed seccomp profiles using the key -`security.alpha.kubernetes.io/allowedSeccompProfileNames`. The value of this +`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this key should be a comma delimited list. ## Examples @@ -255,7 +255,7 @@ kind: Pod metadata: name: trustworthy-pod annotations: - security.alpha.kubernetes.io/seccomp/pod: unconfined + seccomp.security.alpha.kubernetes.io/pod: unconfined spec: containers: - name: trustworthy-container @@ -273,7 +273,7 @@ kind: Pod metadata: name: explorer annotations: - security.alpha.kubernetes.io/seccomp/container/explorer: localhost/example-explorer-profile + container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile spec: containers: - name: explorer -- cgit v1.2.3 From 2b760742fe1ed3d25e0981e5481f7074b15facea Mon Sep 17 00:00:00 2001 From: Avesh Agarwal Date: Thu, 2 Jun 2016 11:13:41 -0400 Subject: Fix byte terminology --- downward_api_resources_limits_requests.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 60ec4787..4afb974a 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -408,8 +408,8 @@ to specify the output format of values of exposed resources. The default value o is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer (decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes), -`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kilobytes)`, -`1Mi`(megabytes), `1Gi`(gigabytes), `1Ti`(terabytes), `1Pi`(petabytes), `1Ei`(exabytes). +`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`, +`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes). For more information about these resource formats, [see details](resources.md). Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor. -- cgit v1.2.3 From bca5a57c0bdab71c3cd7c6310304a6d2afe32caa Mon Sep 17 00:00:00 2001 From: David McMahon Date: Fri, 10 Jun 2016 14:21:20 -0700 Subject: Versioning docs and examples for v1.4.0-alpha.0. --- README.md | 36 +++++----------------- access.md | 36 +++++----------------- admission_control.md | 36 +++++----------------- admission_control_limit_range.md | 44 +++++++-------------------- admission_control_resource_quota.md | 50 +++++++++---------------------- architecture.md | 36 +++++----------------- aws_under_the_hood.md | 36 +++++----------------- clustering.md | 36 +++++----------------- clustering/README.md | 36 +++++----------------- command_execution_port_forwarding.md | 36 +++++----------------- configmap.md | 36 +++++----------------- control-plane-resilience.md | 31 +++++-------------- daemon.md | 36 +++++----------------- downward_api_resources_limits_requests.md | 31 +++++-------------- enhance-pluggable-policy.md | 36 +++++----------------- event_compression.md | 40 ++++++------------------- expansion.md | 36 +++++----------------- extending-api.md | 36 +++++----------------- federated-services.md | 31 +++++-------------- federation-phase-1.md | 31 +++++-------------- horizontal-pod-autoscaler.md | 38 +++++------------------ identifiers.md | 36 +++++----------------- indexed-job.md | 36 +++++----------------- metadata-policy.md | 44 +++++++-------------------- namespaces.md | 36 +++++----------------- networking.md | 36 +++++----------------- nodeaffinity.md | 36 +++++----------------- persistent-storage.md | 36 +++++----------------- podaffinity.md | 36 +++++----------------- principles.md | 36 +++++----------------- resource-qos.md | 31 +++++-------------- resources.md | 36 +++++----------------- scheduler_extender.md | 36 +++++----------------- seccomp.md | 31 +++++-------------- secrets.md | 36 +++++----------------- security.md | 36 +++++----------------- security_context.md | 36 +++++----------------- selector-generation.md | 36 +++++----------------- service_accounts.md | 36 +++++----------------- simple-rolling-update.md | 36 +++++----------------- taint-toleration-dedicated.md | 36 +++++----------------- versioning.md | 36 +++++----------------- 42 files changed, 312 insertions(+), 1206 deletions(-) diff --git a/README.md b/README.md index 2f1de058..e5ca4552 100644 --- a/README.md +++ b/README.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -91,6 +62,13 @@ transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() diff --git a/access.md b/access.md index 7cf1ad39..e5c729e3 100644 --- a/access.md +++ b/access.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/access.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -405,6 +376,13 @@ Improvements: performing audit or other sensitive functions. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() diff --git a/admission_control.md b/admission_control.md index eef323b7..2a944e1e 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -124,6 +95,13 @@ following: If at any step, there is an error, the request is canceled. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 8a6c751d..fe26b819 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -100,12 +71,12 @@ type LimitRange struct { TypeMeta `json:",inline"` // Standard object's metadata. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the limits enforced. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status Spec LimitRangeSpec `json:"spec,omitempty"` } @@ -114,12 +85,12 @@ type LimitRangeList struct { TypeMeta `json:",inline"` // Standard list metadata. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds ListMeta `json:"metadata,omitempty"` // Items is a list of LimitRange objects. // More info: - // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + // http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_limit_range.md Items []LimitRange `json:"items"` } ``` @@ -244,6 +215,13 @@ the following would happen. 3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index bfac66eb..199ab752 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -69,13 +40,13 @@ const ( // ResourceQuotaSpec defines the desired hard limits to enforce for Quota type ResourceQuotaSpec struct { // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } // ResourceQuotaStatus defines the enforced hard limits and observed use type ResourceQuotaStatus struct { // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` // Used is the current observed total usage of the resource in the namespace Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` } @@ -83,22 +54,22 @@ type ResourceQuotaStatus struct { // ResourceQuota sets aggregate quota restrictions enforced per namespace type ResourceQuota struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` } // ResourceQuotaList is a list of ResourceQuota items type ResourceQuotaList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } ``` @@ -244,6 +215,13 @@ services 0 5 See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/architecture.md b/architecture.md index b8ce990f..d6c653a9 100644 --- a/architecture.md +++ b/architecture.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -114,6 +85,13 @@ API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 13aa783c..6c55d9f3 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -336,6 +307,13 @@ install Kubernetes. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() diff --git a/clustering.md b/clustering.md index 327456b3..8a61cfa9 100644 --- a/clustering.md +++ b/clustering.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -157,6 +128,13 @@ code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() diff --git a/clustering/README.md b/clustering/README.md index 193f343b..49f0c901 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - This directory contains diagrams for the clustering design doc. @@ -67,6 +38,13 @@ system and automatically rebuild when files have changed. Just do a `make watch`. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 4e579f8d..78f8ea89 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -187,6 +158,13 @@ data. This can most likely be achieved via SELinux labeling and unique process contexts. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() diff --git a/configmap.md b/configmap.md index e4b69eaa..4eb65702 100644 --- a/configmap.md +++ b/configmap.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -329,6 +300,13 @@ spec: In the future, we may add the ability to specify an init-container that can watch the volume contents for updates and respond to changes when they occur. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 39110e3a..5dac8709 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -265,6 +241,13 @@ be automated and continuously tested. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() diff --git a/daemon.md b/daemon.md index 9b66e0e1..1719c327 100644 --- a/daemon.md +++ b/daemon.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -235,6 +206,13 @@ restartPolicy set to Always. - Should work similarly to [Deployment](http://issues.k8s.io/1743). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 60ec4787..cc6540cd 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -646,6 +622,13 @@ and export GOMAXPROCS=$(CPU_LIMIT)" ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 8f184af9..ed65b6fc 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -458,6 +429,13 @@ type LocalResourceAccessReviewResponse struct { ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() diff --git a/event_compression.md b/event_compression.md index c4dfc154..43f6d52b 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -71,7 +42,7 @@ entries. ## Design Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following +[contains](http://releases.k8s.io/v1.4.0-alpha.0/pkg/api/types.go#L1111) the following fields: * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. @@ -132,7 +103,7 @@ of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/v1.4.0-alpha.0/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and @@ -198,6 +169,13 @@ single event to optimize etcd storage. instead of map. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() diff --git a/expansion.md b/expansion.md index cf44baed..7c0fff2f 100644 --- a/expansion.md +++ b/expansion.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -446,6 +417,13 @@ spec: ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() diff --git a/extending-api.md b/extending-api.md index aa1821c8..ef6cf902 100644 --- a/extending-api.md +++ b/extending-api.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -230,6 +201,13 @@ ${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/$ ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() diff --git a/federated-services.md b/federated-services.md index 7e9933e3..2d98bdc6 100644 --- a/federated-services.md +++ b/federated-services.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -538,6 +514,13 @@ also have the anti-entropy mechanism for reconciling ubernetes "desired desired" state against kubernetes "actual desired" state. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() diff --git a/federation-phase-1.md b/federation-phase-1.md index 53087fd8..b4a90b6f 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -431,6 +407,13 @@ document Please refer to that document for details. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 1b0d78bd..73ea8a98 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -42,7 +13,7 @@ is responsible for dynamically controlling the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), for example a target per-pod CPU utilization. -This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). +This design supersedes [autoscaling.md](http://releases.k8s.io/v1.4.0-alpha.0/docs/proposals/autoscaling.md). ## Overview @@ -290,6 +261,13 @@ the same node, kill one of them. Discussed in issue [#4301](https://github.com/k + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() diff --git a/identifiers.md b/identifiers.md index 175d25c9..bb4ac2b1 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -142,6 +113,13 @@ unique across time. 1. This may correspond to Docker's container ID. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() diff --git a/indexed-job.md b/indexed-job.md index 6c41bd64..a6f4eb37 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -929,6 +900,13 @@ This differs from PetSet in that PetSet uses names and not indexes. PetSet is intended to support ones to tens of things. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() diff --git a/metadata-policy.md b/metadata-policy.md index da7d5425..d51a4151 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -105,11 +76,11 @@ type PolicyAction struct { type MetadataPolicy struct { unversioned.TypeMeta `json:",inline"` // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status Spec MetadataPolicySpec `json:"spec,omitempty"` } @@ -117,11 +88,11 @@ type MetadataPolicy struct { type MetadataPolicyList struct { unversioned.TypeMeta `json:",inline"` // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds unversioned.ListMeta `json:"metadata,omitempty"` // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota Items []MetadataPolicy `json:"items"` } ``` @@ -166,6 +137,13 @@ API for matching "claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() diff --git a/namespaces.md b/namespaces.md index d63015bc..dd1e27bd 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -399,6 +370,13 @@ At this point, all content associated with that Namespace, and the Namespace itself are gone. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() diff --git a/networking.md b/networking.md index ca2527e5..2e9f5de7 100644 --- a/networking.md +++ b/networking.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -216,6 +187,13 @@ by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() diff --git a/nodeaffinity.md b/nodeaffinity.md index 8c999fec..7f27bcc9 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -274,6 +245,13 @@ The main related issue is #341. Issue #367 is also related. Those issues reference other related issues. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index 00eb2fef..417bb4f8 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -276,6 +247,13 @@ Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() diff --git a/podaffinity.md b/podaffinity.md index 2c57ed90..db615c3d 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -697,6 +668,13 @@ This proposal is to satisfy #14816. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() diff --git a/principles.md b/principles.md index 297ae923..77ddaf9d 100644 --- a/principles.md +++ b/principles.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -130,6 +101,13 @@ TODO * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() diff --git a/resource-qos.md b/resource-qos.md index e5088c85..d0c709bd 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -241,6 +217,13 @@ A strict hierarchy of user-specified numerical priorities is not desirable becau 1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively 2. Changes to desired priority bands would require changes to all user pod configurations. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() diff --git a/resources.md b/resources.md index 2a75c987..131b67cb 100644 --- a/resources.md +++ b/resources.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - **Note: this is a design doc, which describes features that have not been @@ -398,6 +369,13 @@ second. * Compressible? yes + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() diff --git a/scheduler_extender.md b/scheduler_extender.md index e8ad718f..bf102c8b 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -134,6 +105,13 @@ its priority functions) and used for final host selection. Multiple extenders can be configured in the scheduler policy. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() diff --git a/seccomp.md b/seccomp.md index 7d65611e..25f43de4 100644 --- a/seccomp.md +++ b/seccomp.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -290,6 +266,13 @@ spec: emptyDir: {} ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() diff --git a/secrets.md b/secrets.md index b1b83106..28504cf7 100644 --- a/secrets.md +++ b/secrets.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -656,6 +627,13 @@ on their filesystems: /etc/secret-volume/password + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() diff --git a/security.md b/security.md index 06bb3979..3c321334 100644 --- a/security.md +++ b/security.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -247,6 +218,13 @@ scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() diff --git a/security_context.md b/security_context.md index 2b7d8b96..077fade1 100644 --- a/security_context.md +++ b/security_context.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -221,6 +192,13 @@ denied by default. In the future the admission plugin will base this decision upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() diff --git a/selector-generation.md b/selector-generation.md index cd91615b..13f429b9 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -209,6 +180,13 @@ We probably want as much as possible the same behavior for Job and ReplicationController. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() diff --git a/service_accounts.md b/service_accounts.md index 2affa10e..c9afb699 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -239,6 +210,13 @@ serviceAccounts. In that case, the user may want to GET serviceAccounts to see what has been created. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() diff --git a/simple-rolling-update.md b/simple-rolling-update.md index eb528580..4ce67569 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -160,6 +131,13 @@ rollout with the old version * Goto Rollout with `foo` and `foo-next` trading places. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index dfa3f213..dd146944 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -317,6 +288,13 @@ Omega project at Google. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() diff --git a/versioning.md b/versioning.md index f6b8efaf..2d46c46d 100644 --- a/versioning.md +++ b/versioning.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -203,6 +174,13 @@ There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() -- cgit v1.2.3 From d0496f5fea551e227fdaa9206b35c4ed5dc39b8f Mon Sep 17 00:00:00 2001 From: David McMahon Date: Fri, 10 Jun 2016 14:21:20 -0700 Subject: Versioning docs and examples for v1.4.0-alpha.0. --- README.md | 36 +++++----------------- access.md | 36 +++++----------------- admission_control.md | 36 +++++----------------- admission_control_limit_range.md | 44 +++++++-------------------- admission_control_resource_quota.md | 50 +++++++++---------------------- architecture.md | 36 +++++----------------- aws_under_the_hood.md | 36 +++++----------------- clustering.md | 36 +++++----------------- clustering/README.md | 36 +++++----------------- command_execution_port_forwarding.md | 36 +++++----------------- configmap.md | 36 +++++----------------- control-plane-resilience.md | 31 +++++-------------- daemon.md | 36 +++++----------------- downward_api_resources_limits_requests.md | 31 +++++-------------- enhance-pluggable-policy.md | 36 +++++----------------- event_compression.md | 40 ++++++------------------- expansion.md | 36 +++++----------------- extending-api.md | 36 +++++----------------- federated-services.md | 31 +++++-------------- federation-phase-1.md | 31 +++++-------------- horizontal-pod-autoscaler.md | 38 +++++------------------ identifiers.md | 36 +++++----------------- indexed-job.md | 36 +++++----------------- metadata-policy.md | 44 +++++++-------------------- namespaces.md | 36 +++++----------------- networking.md | 36 +++++----------------- nodeaffinity.md | 36 +++++----------------- persistent-storage.md | 36 +++++----------------- podaffinity.md | 36 +++++----------------- principles.md | 36 +++++----------------- resource-qos.md | 31 +++++-------------- resources.md | 36 +++++----------------- scheduler_extender.md | 36 +++++----------------- seccomp.md | 31 +++++-------------- secrets.md | 36 +++++----------------- security.md | 36 +++++----------------- security_context.md | 36 +++++----------------- selector-generation.md | 36 +++++----------------- service_accounts.md | 36 +++++----------------- simple-rolling-update.md | 36 +++++----------------- taint-toleration-dedicated.md | 36 +++++----------------- versioning.md | 36 +++++----------------- 42 files changed, 312 insertions(+), 1206 deletions(-) diff --git a/README.md b/README.md index 2f1de058..e5ca4552 100644 --- a/README.md +++ b/README.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -91,6 +62,13 @@ transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() diff --git a/access.md b/access.md index 7cf1ad39..e5c729e3 100644 --- a/access.md +++ b/access.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/access.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -405,6 +376,13 @@ Improvements: performing audit or other sensitive functions. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() diff --git a/admission_control.md b/admission_control.md index eef323b7..2a944e1e 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -124,6 +95,13 @@ following: If at any step, there is an error, the request is canceled. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 8a6c751d..fe26b819 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -100,12 +71,12 @@ type LimitRange struct { TypeMeta `json:",inline"` // Standard object's metadata. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the limits enforced. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status Spec LimitRangeSpec `json:"spec,omitempty"` } @@ -114,12 +85,12 @@ type LimitRangeList struct { TypeMeta `json:",inline"` // Standard list metadata. // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds ListMeta `json:"metadata,omitempty"` // Items is a list of LimitRange objects. // More info: - // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + // http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_limit_range.md Items []LimitRange `json:"items"` } ``` @@ -244,6 +215,13 @@ the following would happen. 3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index bfac66eb..199ab752 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -69,13 +40,13 @@ const ( // ResourceQuotaSpec defines the desired hard limits to enforce for Quota type ResourceQuotaSpec struct { // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } // ResourceQuotaStatus defines the enforced hard limits and observed use type ResourceQuotaStatus struct { // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` // Used is the current observed total usage of the resource in the namespace Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` } @@ -83,22 +54,22 @@ type ResourceQuotaStatus struct { // ResourceQuota sets aggregate quota restrictions enforced per namespace type ResourceQuota struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` } // ResourceQuotaList is a list of ResourceQuota items type ResourceQuotaList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } ``` @@ -244,6 +215,13 @@ services 0 5 See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/architecture.md b/architecture.md index b8ce990f..d6c653a9 100644 --- a/architecture.md +++ b/architecture.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -114,6 +85,13 @@ API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 13aa783c..6c55d9f3 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -336,6 +307,13 @@ install Kubernetes. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() diff --git a/clustering.md b/clustering.md index 327456b3..8a61cfa9 100644 --- a/clustering.md +++ b/clustering.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -157,6 +128,13 @@ code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() diff --git a/clustering/README.md b/clustering/README.md index 193f343b..49f0c901 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - This directory contains diagrams for the clustering design doc. @@ -67,6 +38,13 @@ system and automatically rebuild when files have changed. Just do a `make watch`. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 4e579f8d..78f8ea89 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -187,6 +158,13 @@ data. This can most likely be achieved via SELinux labeling and unique process contexts. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() diff --git a/configmap.md b/configmap.md index e4b69eaa..4eb65702 100644 --- a/configmap.md +++ b/configmap.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -329,6 +300,13 @@ spec: In the future, we may add the ability to specify an init-container that can watch the volume contents for updates and respond to changes when they occur. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 39110e3a..5dac8709 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -265,6 +241,13 @@ be automated and continuously tested. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() diff --git a/daemon.md b/daemon.md index 9b66e0e1..1719c327 100644 --- a/daemon.md +++ b/daemon.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -235,6 +206,13 @@ restartPolicy set to Always. - Should work similarly to [Deployment](http://issues.k8s.io/1743). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 60ec4787..cc6540cd 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -646,6 +622,13 @@ and export GOMAXPROCS=$(CPU_LIMIT)" ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 8f184af9..ed65b6fc 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -458,6 +429,13 @@ type LocalResourceAccessReviewResponse struct { ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() diff --git a/event_compression.md b/event_compression.md index c4dfc154..43f6d52b 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -71,7 +42,7 @@ entries. ## Design Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following +[contains](http://releases.k8s.io/v1.4.0-alpha.0/pkg/api/types.go#L1111) the following fields: * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. @@ -132,7 +103,7 @@ of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/v1.4.0-alpha.0/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and @@ -198,6 +169,13 @@ single event to optimize etcd storage. instead of map. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() diff --git a/expansion.md b/expansion.md index cf44baed..7c0fff2f 100644 --- a/expansion.md +++ b/expansion.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -446,6 +417,13 @@ spec: ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() diff --git a/extending-api.md b/extending-api.md index aa1821c8..ef6cf902 100644 --- a/extending-api.md +++ b/extending-api.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -230,6 +201,13 @@ ${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/$ ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() diff --git a/federated-services.md b/federated-services.md index 7e9933e3..2d98bdc6 100644 --- a/federated-services.md +++ b/federated-services.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -538,6 +514,13 @@ also have the anti-entropy mechanism for reconciling ubernetes "desired desired" state against kubernetes "actual desired" state. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() diff --git a/federation-phase-1.md b/federation-phase-1.md index 53087fd8..b4a90b6f 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -431,6 +407,13 @@ document Please refer to that document for details. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 1b0d78bd..73ea8a98 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -42,7 +13,7 @@ is responsible for dynamically controlling the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), for example a target per-pod CPU utilization. -This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). +This design supersedes [autoscaling.md](http://releases.k8s.io/v1.4.0-alpha.0/docs/proposals/autoscaling.md). ## Overview @@ -290,6 +261,13 @@ the same node, kill one of them. Discussed in issue [#4301](https://github.com/k + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() diff --git a/identifiers.md b/identifiers.md index 175d25c9..bb4ac2b1 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -142,6 +113,13 @@ unique across time. 1. This may correspond to Docker's container ID. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() diff --git a/indexed-job.md b/indexed-job.md index 6c41bd64..a6f4eb37 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -929,6 +900,13 @@ This differs from PetSet in that PetSet uses names and not indexes. PetSet is intended to support ones to tens of things. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() diff --git a/metadata-policy.md b/metadata-policy.md index da7d5425..d51a4151 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -105,11 +76,11 @@ type PolicyAction struct { type MetadataPolicy struct { unversioned.TypeMeta `json:",inline"` // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status Spec MetadataPolicySpec `json:"spec,omitempty"` } @@ -117,11 +88,11 @@ type MetadataPolicy struct { type MetadataPolicyList struct { unversioned.TypeMeta `json:",inline"` // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds unversioned.ListMeta `json:"metadata,omitempty"` // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota Items []MetadataPolicy `json:"items"` } ``` @@ -166,6 +137,13 @@ API for matching "claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() diff --git a/namespaces.md b/namespaces.md index d63015bc..dd1e27bd 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -399,6 +370,13 @@ At this point, all content associated with that Namespace, and the Namespace itself are gone. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() diff --git a/networking.md b/networking.md index ca2527e5..2e9f5de7 100644 --- a/networking.md +++ b/networking.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -216,6 +187,13 @@ by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() diff --git a/nodeaffinity.md b/nodeaffinity.md index 8c999fec..7f27bcc9 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -274,6 +245,13 @@ The main related issue is #341. Issue #367 is also related. Those issues reference other related issues. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index 00eb2fef..417bb4f8 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -276,6 +247,13 @@ Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() diff --git a/podaffinity.md b/podaffinity.md index 2c57ed90..db615c3d 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -697,6 +668,13 @@ This proposal is to satisfy #14816. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() diff --git a/principles.md b/principles.md index 297ae923..77ddaf9d 100644 --- a/principles.md +++ b/principles.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -130,6 +101,13 @@ TODO * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() diff --git a/resource-qos.md b/resource-qos.md index e5088c85..d0c709bd 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -241,6 +217,13 @@ A strict hierarchy of user-specified numerical priorities is not desirable becau 1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively 2. Changes to desired priority bands would require changes to all user pod configurations. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() diff --git a/resources.md b/resources.md index 2a75c987..131b67cb 100644 --- a/resources.md +++ b/resources.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - **Note: this is a design doc, which describes features that have not been @@ -398,6 +369,13 @@ second. * Compressible? yes + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() diff --git a/scheduler_extender.md b/scheduler_extender.md index e8ad718f..bf102c8b 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -134,6 +105,13 @@ its priority functions) and used for final host selection. Multiple extenders can be configured in the scheduler policy. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() diff --git a/seccomp.md b/seccomp.md index 7d65611e..25f43de4 100644 --- a/seccomp.md +++ b/seccomp.md @@ -1,29 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -290,6 +266,13 @@ spec: emptyDir: {} ``` + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() diff --git a/secrets.md b/secrets.md index b1b83106..28504cf7 100644 --- a/secrets.md +++ b/secrets.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -656,6 +627,13 @@ on their filesystems: /etc/secret-volume/password + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() diff --git a/security.md b/security.md index 06bb3979..3c321334 100644 --- a/security.md +++ b/security.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -247,6 +218,13 @@ scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() diff --git a/security_context.md b/security_context.md index 2b7d8b96..077fade1 100644 --- a/security_context.md +++ b/security_context.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -221,6 +192,13 @@ denied by default. In the future the admission plugin will base this decision upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() diff --git a/selector-generation.md b/selector-generation.md index cd91615b..13f429b9 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -209,6 +180,13 @@ We probably want as much as possible the same behavior for Job and ReplicationController. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() diff --git a/service_accounts.md b/service_accounts.md index 2affa10e..c9afb699 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -239,6 +210,13 @@ serviceAccounts. In that case, the user may want to GET serviceAccounts to see what has been created. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() diff --git a/simple-rolling-update.md b/simple-rolling-update.md index eb528580..4ce67569 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -160,6 +131,13 @@ rollout with the old version * Goto Rollout with `foo` and `foo-next` trading places. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index dfa3f213..dd146944 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -317,6 +288,13 @@ Omega project at Google. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() diff --git a/versioning.md b/versioning.md index f6b8efaf..2d46c46d 100644 --- a/versioning.md +++ b/versioning.md @@ -1,34 +1,5 @@ - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - @@ -203,6 +174,13 @@ There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. + + + + + + + [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() -- cgit v1.2.3 From 3455d88bc3d6d401b2146175c99259e31dd20e3e Mon Sep 17 00:00:00 2001 From: Dawn Chen Date: Fri, 10 Jun 2016 16:46:46 -0700 Subject: Revert "Versioning docs and examples for v1.4.0-alpha.0." This reverts commit cce9db3aa9555671c5ddf69549b46ed0fd7e472a. --- README.md | 36 +++++++++++++++++----- access.md | 36 +++++++++++++++++----- admission_control.md | 36 +++++++++++++++++----- admission_control_limit_range.md | 44 ++++++++++++++++++++------- admission_control_resource_quota.md | 50 ++++++++++++++++++++++--------- architecture.md | 36 +++++++++++++++++----- aws_under_the_hood.md | 36 +++++++++++++++++----- clustering.md | 36 +++++++++++++++++----- clustering/README.md | 36 +++++++++++++++++----- command_execution_port_forwarding.md | 36 +++++++++++++++++----- configmap.md | 36 +++++++++++++++++----- control-plane-resilience.md | 31 ++++++++++++++----- daemon.md | 36 +++++++++++++++++----- downward_api_resources_limits_requests.md | 31 ++++++++++++++----- enhance-pluggable-policy.md | 36 +++++++++++++++++----- event_compression.md | 40 +++++++++++++++++++------ expansion.md | 36 +++++++++++++++++----- extending-api.md | 36 +++++++++++++++++----- federated-services.md | 31 ++++++++++++++----- federation-phase-1.md | 31 ++++++++++++++----- horizontal-pod-autoscaler.md | 38 ++++++++++++++++++----- identifiers.md | 36 +++++++++++++++++----- indexed-job.md | 36 +++++++++++++++++----- metadata-policy.md | 44 ++++++++++++++++++++------- namespaces.md | 36 +++++++++++++++++----- networking.md | 36 +++++++++++++++++----- nodeaffinity.md | 36 +++++++++++++++++----- persistent-storage.md | 36 +++++++++++++++++----- podaffinity.md | 36 +++++++++++++++++----- principles.md | 36 +++++++++++++++++----- resource-qos.md | 31 ++++++++++++++----- resources.md | 36 +++++++++++++++++----- scheduler_extender.md | 36 +++++++++++++++++----- seccomp.md | 31 ++++++++++++++----- secrets.md | 36 +++++++++++++++++----- security.md | 36 +++++++++++++++++----- security_context.md | 36 +++++++++++++++++----- selector-generation.md | 36 +++++++++++++++++----- service_accounts.md | 36 +++++++++++++++++----- simple-rolling-update.md | 36 +++++++++++++++++----- taint-toleration-dedicated.md | 36 +++++++++++++++++----- versioning.md | 36 +++++++++++++++++----- 42 files changed, 1206 insertions(+), 312 deletions(-) diff --git a/README.md b/README.md index e5ca4552..2f1de058 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -62,13 +91,6 @@ transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() diff --git a/access.md b/access.md index e5c729e3..7cf1ad39 100644 --- a/access.md +++ b/access.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/access.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -376,13 +405,6 @@ Improvements: performing audit or other sensitive functions. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() diff --git a/admission_control.md b/admission_control.md index 2a944e1e..eef323b7 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -95,13 +124,6 @@ following: If at any step, there is an error, the request is canceled. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index fe26b819..8a6c751d 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -71,12 +100,12 @@ type LimitRange struct { TypeMeta `json:",inline"` // Standard object's metadata. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the limits enforced. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status Spec LimitRangeSpec `json:"spec,omitempty"` } @@ -85,12 +114,12 @@ type LimitRangeList struct { TypeMeta `json:",inline"` // Standard list metadata. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds ListMeta `json:"metadata,omitempty"` // Items is a list of LimitRange objects. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_limit_range.md + // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md Items []LimitRange `json:"items"` } ``` @@ -215,13 +244,6 @@ the following would happen. 3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 199ab752..bfac66eb 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -40,13 +69,13 @@ const ( // ResourceQuotaSpec defines the desired hard limits to enforce for Quota type ResourceQuotaSpec struct { // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } // ResourceQuotaStatus defines the enforced hard limits and observed use type ResourceQuotaStatus struct { // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` // Used is the current observed total usage of the resource in the namespace Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` } @@ -54,22 +83,22 @@ type ResourceQuotaStatus struct { // ResourceQuota sets aggregate quota restrictions enforced per namespace type ResourceQuota struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` } // ResourceQuotaList is a list of ResourceQuota items type ResourceQuotaList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } ``` @@ -215,13 +244,6 @@ services 0 5 See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/architecture.md b/architecture.md index d6c653a9..b8ce990f 100644 --- a/architecture.md +++ b/architecture.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -85,13 +114,6 @@ API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 6c55d9f3..13aa783c 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -307,13 +336,6 @@ install Kubernetes. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() diff --git a/clustering.md b/clustering.md index 8a61cfa9..327456b3 100644 --- a/clustering.md +++ b/clustering.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -128,13 +157,6 @@ code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() diff --git a/clustering/README.md b/clustering/README.md index 49f0c901..193f343b 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + This directory contains diagrams for the clustering design doc. @@ -38,13 +67,6 @@ system and automatically rebuild when files have changed. Just do a `make watch`. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 78f8ea89..4e579f8d 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -158,13 +187,6 @@ data. This can most likely be achieved via SELinux labeling and unique process contexts. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() diff --git a/configmap.md b/configmap.md index 4eb65702..e4b69eaa 100644 --- a/configmap.md +++ b/configmap.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -300,13 +329,6 @@ spec: In the future, we may add the ability to specify an init-container that can watch the volume contents for updates and respond to changes when they occur. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 5dac8709..39110e3a 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -241,13 +265,6 @@ be automated and continuously tested. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() diff --git a/daemon.md b/daemon.md index 1719c327..9b66e0e1 100644 --- a/daemon.md +++ b/daemon.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -206,13 +235,6 @@ restartPolicy set to Always. - Should work similarly to [Deployment](http://issues.k8s.io/1743). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index cc6540cd..60ec4787 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -622,13 +646,6 @@ and export GOMAXPROCS=$(CPU_LIMIT)" ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index ed65b6fc..8f184af9 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -429,13 +458,6 @@ type LocalResourceAccessReviewResponse struct { ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() diff --git a/event_compression.md b/event_compression.md index 43f6d52b..c4dfc154 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -42,7 +71,7 @@ entries. ## Design Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/v1.4.0-alpha.0/pkg/api/types.go#L1111) the following +[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. @@ -103,7 +132,7 @@ of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/v1.4.0-alpha.0/pkg/client/record/event.go)). +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and @@ -169,13 +198,6 @@ single event to optimize etcd storage. instead of map. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() diff --git a/expansion.md b/expansion.md index 7c0fff2f..cf44baed 100644 --- a/expansion.md +++ b/expansion.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -417,13 +446,6 @@ spec: ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() diff --git a/extending-api.md b/extending-api.md index ef6cf902..aa1821c8 100644 --- a/extending-api.md +++ b/extending-api.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -201,13 +230,6 @@ ${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/$ ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() diff --git a/federated-services.md b/federated-services.md index 2d98bdc6..7e9933e3 100644 --- a/federated-services.md +++ b/federated-services.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -514,13 +538,6 @@ also have the anti-entropy mechanism for reconciling ubernetes "desired desired" state against kubernetes "actual desired" state. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() diff --git a/federation-phase-1.md b/federation-phase-1.md index b4a90b6f..53087fd8 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -407,13 +431,6 @@ document Please refer to that document for details. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 73ea8a98..1b0d78bd 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -13,7 +42,7 @@ is responsible for dynamically controlling the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), for example a target per-pod CPU utilization. -This design supersedes [autoscaling.md](http://releases.k8s.io/v1.4.0-alpha.0/docs/proposals/autoscaling.md). +This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). ## Overview @@ -261,13 +290,6 @@ the same node, kill one of them. Discussed in issue [#4301](https://github.com/k - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() diff --git a/identifiers.md b/identifiers.md index bb4ac2b1..175d25c9 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -113,13 +142,6 @@ unique across time. 1. This may correspond to Docker's container ID. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() diff --git a/indexed-job.md b/indexed-job.md index a6f4eb37..6c41bd64 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -900,13 +929,6 @@ This differs from PetSet in that PetSet uses names and not indexes. PetSet is intended to support ones to tens of things. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() diff --git a/metadata-policy.md b/metadata-policy.md index d51a4151..da7d5425 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -76,11 +105,11 @@ type PolicyAction struct { type MetadataPolicy struct { unversioned.TypeMeta `json:",inline"` // Standard object's metadata. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status Spec MetadataPolicySpec `json:"spec,omitempty"` } @@ -88,11 +117,11 @@ type MetadataPolicy struct { type MetadataPolicyList struct { unversioned.TypeMeta `json:",inline"` // Standard list metadata. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds unversioned.ListMeta `json:"metadata,omitempty"` // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota Items []MetadataPolicy `json:"items"` } ``` @@ -137,13 +166,6 @@ API for matching "claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() diff --git a/namespaces.md b/namespaces.md index dd1e27bd..d63015bc 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -370,13 +399,6 @@ At this point, all content associated with that Namespace, and the Namespace itself are gone. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() diff --git a/networking.md b/networking.md index 2e9f5de7..ca2527e5 100644 --- a/networking.md +++ b/networking.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -187,13 +216,6 @@ by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() diff --git a/nodeaffinity.md b/nodeaffinity.md index 7f27bcc9..8c999fec 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -245,13 +274,6 @@ The main related issue is #341. Issue #367 is also related. Those issues reference other related issues. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index 417bb4f8..00eb2fef 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -247,13 +276,6 @@ Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() diff --git a/podaffinity.md b/podaffinity.md index db615c3d..2c57ed90 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -668,13 +697,6 @@ This proposal is to satisfy #14816. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() diff --git a/principles.md b/principles.md index 77ddaf9d..297ae923 100644 --- a/principles.md +++ b/principles.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -101,13 +130,6 @@ TODO * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() diff --git a/resource-qos.md b/resource-qos.md index d0c709bd..e5088c85 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -217,13 +241,6 @@ A strict hierarchy of user-specified numerical priorities is not desirable becau 1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively 2. Changes to desired priority bands would require changes to all user pod configurations. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() diff --git a/resources.md b/resources.md index 131b67cb..2a75c987 100644 --- a/resources.md +++ b/resources.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + **Note: this is a design doc, which describes features that have not been @@ -369,13 +398,6 @@ second. * Compressible? yes - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() diff --git a/scheduler_extender.md b/scheduler_extender.md index bf102c8b..e8ad718f 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -105,13 +134,6 @@ its priority functions) and used for final host selection. Multiple extenders can be configured in the scheduler policy. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() diff --git a/seccomp.md b/seccomp.md index 25f43de4..7d65611e 100644 --- a/seccomp.md +++ b/seccomp.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -266,13 +290,6 @@ spec: emptyDir: {} ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() diff --git a/secrets.md b/secrets.md index 28504cf7..b1b83106 100644 --- a/secrets.md +++ b/secrets.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -627,13 +656,6 @@ on their filesystems: /etc/secret-volume/password - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() diff --git a/security.md b/security.md index 3c321334..06bb3979 100644 --- a/security.md +++ b/security.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/security.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -218,13 +247,6 @@ scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() diff --git a/security_context.md b/security_context.md index 077fade1..2b7d8b96 100644 --- a/security_context.md +++ b/security_context.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -192,13 +221,6 @@ denied by default. In the future the admission plugin will base this decision upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() diff --git a/selector-generation.md b/selector-generation.md index 13f429b9..cd91615b 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -180,13 +209,6 @@ We probably want as much as possible the same behavior for Job and ReplicationController. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() diff --git a/service_accounts.md b/service_accounts.md index c9afb699..2affa10e 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -210,13 +239,6 @@ serviceAccounts. In that case, the user may want to GET serviceAccounts to see what has been created. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 4ce67569..eb528580 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -131,13 +160,6 @@ rollout with the old version * Goto Rollout with `foo` and `foo-next` trading places. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index dd146944..dfa3f213 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -288,13 +317,6 @@ Omega project at Google. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() diff --git a/versioning.md b/versioning.md index 2d46c46d..f6b8efaf 100644 --- a/versioning.md +++ b/versioning.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -174,13 +203,6 @@ There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() -- cgit v1.2.3 From 9f30b2c79b151e7949ff47b72c00daf68633938e Mon Sep 17 00:00:00 2001 From: Piotr Szczesniak Date: Mon, 13 Jun 2016 14:18:05 +0200 Subject: Added warning to hpa design doc --- horizontal-pod-autoscaler.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 1b0d78bd..d8d2280f 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -32,6 +32,8 @@ Documentation for other releases can be found at +

Warning! This document might be outdated.

+ # Horizontal Pod Autoscaling ## Preface -- cgit v1.2.3 From 9658b3cb208f7db4fec80f11857f224fa781c11e Mon Sep 17 00:00:00 2001 From: David McMahon Date: Mon, 13 Jun 2016 12:24:34 -0700 Subject: Updated docs and examples for release-1.3. --- README.md | 2 +- access.md | 2 +- admission_control.md | 2 +- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- architecture.md | 2 +- aws_under_the_hood.md | 2 +- clustering.md | 2 +- clustering/README.md | 2 +- command_execution_port_forwarding.md | 2 +- configmap.md | 2 +- control-plane-resilience.md | 5 +++++ daemon.md | 2 +- downward_api_resources_limits_requests.md | 5 +++++ enhance-pluggable-policy.md | 2 +- event_compression.md | 2 +- expansion.md | 2 +- extending-api.md | 2 +- federated-services.md | 5 +++++ federation-phase-1.md | 5 +++++ horizontal-pod-autoscaler.md | 2 +- identifiers.md | 2 +- indexed-job.md | 2 +- metadata-policy.md | 2 +- namespaces.md | 2 +- networking.md | 2 +- nodeaffinity.md | 2 +- persistent-storage.md | 2 +- podaffinity.md | 2 +- principles.md | 2 +- resource-qos.md | 5 +++++ resources.md | 2 +- scheduler_extender.md | 2 +- seccomp.md | 5 +++++ secrets.md | 2 +- security.md | 2 +- security_context.md | 2 +- selector-generation.md | 2 +- service_accounts.md | 2 +- simple-rolling-update.md | 2 +- taint-toleration-dedicated.md | 2 +- versioning.md | 2 +- 42 files changed, 66 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 2f1de058..834534a3 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/README.md). +[here](http://releases.k8s.io/release-1.3/docs/design/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/access.md b/access.md index 7cf1ad39..a19e082e 100644 --- a/access.md +++ b/access.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/access.md). +[here](http://releases.k8s.io/release-1.3/docs/design/access.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control.md b/admission_control.md index eef323b7..ae842122 100644 --- a/admission_control.md +++ b/admission_control.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). +[here](http://releases.k8s.io/release-1.3/docs/design/admission_control.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 8a6c751d..e8afaa78 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). +[here](http://releases.k8s.io/release-1.3/docs/design/admission_control_limit_range.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index bfac66eb..076fe588 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). +[here](http://releases.k8s.io/release-1.3/docs/design/admission_control_resource_quota.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/architecture.md b/architecture.md index b8ce990f..94a14067 100644 --- a/architecture.md +++ b/architecture.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). +[here](http://releases.k8s.io/release-1.3/docs/design/architecture.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 13aa783c..12d31701 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). +[here](http://releases.k8s.io/release-1.3/docs/design/aws_under_the_hood.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering.md b/clustering.md index 327456b3..5ca676c4 100644 --- a/clustering.md +++ b/clustering.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). +[here](http://releases.k8s.io/release-1.3/docs/design/clustering.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering/README.md b/clustering/README.md index 193f343b..1a6bb48d 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). +[here](http://releases.k8s.io/release-1.3/docs/design/clustering/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 4e579f8d..489f936e 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). +[here](http://releases.k8s.io/release-1.3/docs/design/command_execution_port_forwarding.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/configmap.md b/configmap.md index e4b69eaa..a9f80f8a 100644 --- a/configmap.md +++ b/configmap.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). +[here](http://releases.k8s.io/release-1.3/docs/design/configmap.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 39110e3a..b3e76c40 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/control-plane-resilience.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/daemon.md b/daemon.md index 9b66e0e1..be78a035 100644 --- a/daemon.md +++ b/daemon.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). +[here](http://releases.k8s.io/release-1.3/docs/design/daemon.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 60ec4787..1443173c 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/downward_api_resources_limits_requests.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 8f184af9..bd6a329f 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). +[here](http://releases.k8s.io/release-1.3/docs/design/enhance-pluggable-policy.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/event_compression.md b/event_compression.md index c4dfc154..7ed46538 100644 --- a/event_compression.md +++ b/event_compression.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). +[here](http://releases.k8s.io/release-1.3/docs/design/event_compression.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/expansion.md b/expansion.md index cf44baed..2c8b775a 100644 --- a/expansion.md +++ b/expansion.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). +[here](http://releases.k8s.io/release-1.3/docs/design/expansion.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/extending-api.md b/extending-api.md index aa1821c8..4c7049af 100644 --- a/extending-api.md +++ b/extending-api.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). +[here](http://releases.k8s.io/release-1.3/docs/design/extending-api.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/federated-services.md b/federated-services.md index 7e9933e3..5572b12f 100644 --- a/federated-services.md +++ b/federated-services.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/federated-services.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/federation-phase-1.md b/federation-phase-1.md index 53087fd8..ba7386e7 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/federation-phase-1.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 1b0d78bd..65951a76 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). +[here](http://releases.k8s.io/release-1.3/docs/design/horizontal-pod-autoscaler.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/identifiers.md b/identifiers.md index 175d25c9..1966d250 100644 --- a/identifiers.md +++ b/identifiers.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). +[here](http://releases.k8s.io/release-1.3/docs/design/identifiers.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/indexed-job.md b/indexed-job.md index 6c41bd64..63dafc7b 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). +[here](http://releases.k8s.io/release-1.3/docs/design/indexed-job.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/metadata-policy.md b/metadata-policy.md index da7d5425..384d5ef4 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). +[here](http://releases.k8s.io/release-1.3/docs/design/metadata-policy.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/namespaces.md b/namespaces.md index d63015bc..2eff4512 100644 --- a/namespaces.md +++ b/namespaces.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). +[here](http://releases.k8s.io/release-1.3/docs/design/namespaces.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/networking.md b/networking.md index ca2527e5..93b243be 100644 --- a/networking.md +++ b/networking.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). +[here](http://releases.k8s.io/release-1.3/docs/design/networking.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/nodeaffinity.md b/nodeaffinity.md index 8c999fec..3d9266ae 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). +[here](http://releases.k8s.io/release-1.3/docs/design/nodeaffinity.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/persistent-storage.md b/persistent-storage.md index 00eb2fef..b973d2b1 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). +[here](http://releases.k8s.io/release-1.3/docs/design/persistent-storage.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/podaffinity.md b/podaffinity.md index 2c57ed90..fcc5fc87 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). +[here](http://releases.k8s.io/release-1.3/docs/design/podaffinity.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/principles.md b/principles.md index 297ae923..3e711986 100644 --- a/principles.md +++ b/principles.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). +[here](http://releases.k8s.io/release-1.3/docs/design/principles.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resource-qos.md b/resource-qos.md index e5088c85..24b966fd 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/resource-qos.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resources.md b/resources.md index 2a75c987..d9e0bc3d 100644 --- a/resources.md +++ b/resources.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). +[here](http://releases.k8s.io/release-1.3/docs/design/resources.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/scheduler_extender.md b/scheduler_extender.md index e8ad718f..604ed3ba 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). +[here](http://releases.k8s.io/release-1.3/docs/design/scheduler_extender.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/seccomp.md b/seccomp.md index 7d65611e..65963882 100644 --- a/seccomp.md +++ b/seccomp.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.3/docs/design/seccomp.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/secrets.md b/secrets.md index b1b83106..98c8f0ce 100644 --- a/secrets.md +++ b/secrets.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). +[here](http://releases.k8s.io/release-1.3/docs/design/secrets.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security.md b/security.md index 06bb3979..0ed8f2f0 100644 --- a/security.md +++ b/security.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security.md). +[here](http://releases.k8s.io/release-1.3/docs/design/security.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security_context.md b/security_context.md index 2b7d8b96..59b152ab 100644 --- a/security_context.md +++ b/security_context.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). +[here](http://releases.k8s.io/release-1.3/docs/design/security_context.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/selector-generation.md b/selector-generation.md index cd91615b..9627b2e5 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). +[here](http://releases.k8s.io/release-1.3/docs/design/selector-generation.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/service_accounts.md b/service_accounts.md index 2affa10e..2f656228 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). +[here](http://releases.k8s.io/release-1.3/docs/design/service_accounts.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/simple-rolling-update.md b/simple-rolling-update.md index eb528580..32a22ce8 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). +[here](http://releases.k8s.io/release-1.3/docs/design/simple-rolling-update.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index dfa3f213..e5a569c9 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). +[here](http://releases.k8s.io/release-1.3/docs/design/taint-toleration-dedicated.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/versioning.md b/versioning.md index f6b8efaf..b8d83e93 100644 --- a/versioning.md +++ b/versioning.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). +[here](http://releases.k8s.io/release-1.3/docs/design/versioning.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- cgit v1.2.3 From df7d4aa2049cfb4f07da1ab85651ee0b0fdb04db Mon Sep 17 00:00:00 2001 From: Zach Loafman Date: Tue, 14 Jun 2016 09:01:53 -0700 Subject: Revert "Redo v1.4.0-alpha.0" This reverts commit c7f1485e1b3491e98f102c30e7e342cb53dda818, reversing changes made to 939ad4115a2a96f1e18758ec45b7d312bec65aa7. --- README.md | 36 +++++++++++++++++----- access.md | 36 +++++++++++++++++----- admission_control.md | 36 +++++++++++++++++----- admission_control_limit_range.md | 44 ++++++++++++++++++++------- admission_control_resource_quota.md | 50 ++++++++++++++++++++++--------- architecture.md | 36 +++++++++++++++++----- aws_under_the_hood.md | 36 +++++++++++++++++----- clustering.md | 36 +++++++++++++++++----- clustering/README.md | 36 +++++++++++++++++----- command_execution_port_forwarding.md | 36 +++++++++++++++++----- configmap.md | 36 +++++++++++++++++----- control-plane-resilience.md | 31 ++++++++++++++----- daemon.md | 36 +++++++++++++++++----- downward_api_resources_limits_requests.md | 31 ++++++++++++++----- enhance-pluggable-policy.md | 36 +++++++++++++++++----- event_compression.md | 40 +++++++++++++++++++------ expansion.md | 36 +++++++++++++++++----- extending-api.md | 36 +++++++++++++++++----- federated-services.md | 31 ++++++++++++++----- federation-phase-1.md | 31 ++++++++++++++----- horizontal-pod-autoscaler.md | 38 ++++++++++++++++++----- identifiers.md | 36 +++++++++++++++++----- indexed-job.md | 36 +++++++++++++++++----- metadata-policy.md | 44 ++++++++++++++++++++------- namespaces.md | 36 +++++++++++++++++----- networking.md | 36 +++++++++++++++++----- nodeaffinity.md | 36 +++++++++++++++++----- persistent-storage.md | 36 +++++++++++++++++----- podaffinity.md | 36 +++++++++++++++++----- principles.md | 36 +++++++++++++++++----- resource-qos.md | 31 ++++++++++++++----- resources.md | 36 +++++++++++++++++----- scheduler_extender.md | 36 +++++++++++++++++----- seccomp.md | 31 ++++++++++++++----- secrets.md | 36 +++++++++++++++++----- security.md | 36 +++++++++++++++++----- security_context.md | 36 +++++++++++++++++----- selector-generation.md | 36 +++++++++++++++++----- service_accounts.md | 36 +++++++++++++++++----- simple-rolling-update.md | 36 +++++++++++++++++----- taint-toleration-dedicated.md | 36 +++++++++++++++++----- versioning.md | 36 +++++++++++++++++----- 42 files changed, 1206 insertions(+), 312 deletions(-) diff --git a/README.md b/README.md index e5ca4552..2f1de058 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -62,13 +91,6 @@ transparent, composable manner. For more about the Kubernetes architecture, see [architecture](architecture.md). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() diff --git a/access.md b/access.md index e5c729e3..7cf1ad39 100644 --- a/access.md +++ b/access.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/access.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -376,13 +405,6 @@ Improvements: performing audit or other sensitive functions. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() diff --git a/admission_control.md b/admission_control.md index 2a944e1e..eef323b7 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -95,13 +124,6 @@ following: If at any step, there is an error, the request is canceled. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index fe26b819..8a6c751d 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_limit_range.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -71,12 +100,12 @@ type LimitRange struct { TypeMeta `json:",inline"` // Standard object's metadata. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the limits enforced. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status Spec LimitRangeSpec `json:"spec,omitempty"` } @@ -85,12 +114,12 @@ type LimitRangeList struct { TypeMeta `json:",inline"` // Standard list metadata. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds ListMeta `json:"metadata,omitempty"` // Items is a list of LimitRange objects. // More info: - // http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_limit_range.md + // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md Items []LimitRange `json:"items"` } ``` @@ -215,13 +244,6 @@ the following would happen. 3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 199ab752..bfac66eb 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/admission_control_resource_quota.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -40,13 +69,13 @@ const ( // ResourceQuotaSpec defines the desired hard limits to enforce for Quota type ResourceQuotaSpec struct { // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } // ResourceQuotaStatus defines the enforced hard limits and observed use type ResourceQuotaStatus struct { // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` // Used is the current observed total usage of the resource in the namespace Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` } @@ -54,22 +83,22 @@ type ResourceQuotaStatus struct { // ResourceQuota sets aggregate quota restrictions enforced per namespace type ResourceQuota struct { TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status"` + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` } // ResourceQuotaList is a list of ResourceQuota items type ResourceQuotaList struct { TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` } ``` @@ -215,13 +244,6 @@ services 0 5 See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() diff --git a/architecture.md b/architecture.md index d6c653a9..b8ce990f 100644 --- a/architecture.md +++ b/architecture.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/architecture.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -85,13 +114,6 @@ API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 6c55d9f3..13aa783c 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/aws_under_the_hood.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -307,13 +336,6 @@ install Kubernetes. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() diff --git a/clustering.md b/clustering.md index 8a61cfa9..327456b3 100644 --- a/clustering.md +++ b/clustering.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/clustering.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -128,13 +157,6 @@ code that can verify the signing requests via other means. ![Dynamic Sequence Diagram](clustering/dynamic.png) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() diff --git a/clustering/README.md b/clustering/README.md index 49f0c901..193f343b 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/clustering/README.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + This directory contains diagrams for the clustering design doc. @@ -38,13 +67,6 @@ system and automatically rebuild when files have changed. Just do a `make watch`. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 78f8ea89..4e579f8d 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/command_execution_port_forwarding.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -158,13 +187,6 @@ data. This can most likely be achieved via SELinux labeling and unique process contexts. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() diff --git a/configmap.md b/configmap.md index 4eb65702..e4b69eaa 100644 --- a/configmap.md +++ b/configmap.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/configmap.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -300,13 +329,6 @@ spec: In the future, we may add the ability to specify an init-container that can watch the volume contents for updates and respond to changes when they occur. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 5dac8709..39110e3a 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -241,13 +265,6 @@ be automated and continuously tested. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() diff --git a/daemon.md b/daemon.md index 1719c327..9b66e0e1 100644 --- a/daemon.md +++ b/daemon.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/daemon.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -206,13 +235,6 @@ restartPolicy set to Always. - Should work similarly to [Deployment](http://issues.k8s.io/1743). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index cc6540cd..60ec4787 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -622,13 +646,6 @@ and export GOMAXPROCS=$(CPU_LIMIT)" ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index ed65b6fc..8f184af9 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/enhance-pluggable-policy.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -429,13 +458,6 @@ type LocalResourceAccessReviewResponse struct { ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() diff --git a/event_compression.md b/event_compression.md index 43f6d52b..c4dfc154 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/event_compression.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -42,7 +71,7 @@ entries. ## Design Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/v1.4.0-alpha.0/pkg/api/types.go#L1111) the following +[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following fields: * `FirstTimestamp unversioned.Time` * The date/time of the first occurrence of the event. @@ -103,7 +132,7 @@ of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache. * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/v1.4.0-alpha.0/pkg/client/record/event.go)). +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). * If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and @@ -169,13 +198,6 @@ single event to optimize etcd storage. instead of map. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() diff --git a/expansion.md b/expansion.md index 7c0fff2f..cf44baed 100644 --- a/expansion.md +++ b/expansion.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/expansion.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -417,13 +446,6 @@ spec: ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() diff --git a/extending-api.md b/extending-api.md index ef6cf902..aa1821c8 100644 --- a/extending-api.md +++ b/extending-api.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/extending-api.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -201,13 +230,6 @@ ${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/$ ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() diff --git a/federated-services.md b/federated-services.md index 2d98bdc6..7e9933e3 100644 --- a/federated-services.md +++ b/federated-services.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -514,13 +538,6 @@ also have the anti-entropy mechanism for reconciling ubernetes "desired desired" state against kubernetes "actual desired" state. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() diff --git a/federation-phase-1.md b/federation-phase-1.md index b4a90b6f..53087fd8 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -407,13 +431,6 @@ document Please refer to that document for details. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index b1638e7c..d8d2280f 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/horizontal-pod-autoscaler.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -15,7 +44,7 @@ is responsible for dynamically controlling the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s), for example a target per-pod CPU utilization. -This design supersedes [autoscaling.md](http://releases.k8s.io/v1.4.0-alpha.0/docs/proposals/autoscaling.md). +This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). ## Overview @@ -263,13 +292,6 @@ the same node, kill one of them. Discussed in issue [#4301](https://github.com/k - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() diff --git a/identifiers.md b/identifiers.md index bb4ac2b1..175d25c9 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/identifiers.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -113,13 +142,6 @@ unique across time. 1. This may correspond to Docker's container ID. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() diff --git a/indexed-job.md b/indexed-job.md index a6f4eb37..6c41bd64 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/indexed-job.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -900,13 +929,6 @@ This differs from PetSet in that PetSet uses names and not indexes. PetSet is intended to support ones to tens of things. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() diff --git a/metadata-policy.md b/metadata-policy.md index d51a4151..da7d5425 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/metadata-policy.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -76,11 +105,11 @@ type PolicyAction struct { type MetadataPolicy struct { unversioned.TypeMeta `json:",inline"` // Standard object's metadata. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#metadata + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata ObjectMeta `json:"metadata,omitempty"` // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#spec-and-status + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status Spec MetadataPolicySpec `json:"spec,omitempty"` } @@ -88,11 +117,11 @@ type MetadataPolicy struct { type MetadataPolicyList struct { unversioned.TypeMeta `json:",inline"` // Standard list metadata. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/devel/api-conventions.md#types-kinds + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds unversioned.ListMeta `json:"metadata,omitempty"` // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/v1.4.0-alpha.0/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota Items []MetadataPolicy `json:"items"` } ``` @@ -137,13 +166,6 @@ API for matching "claims" to "service classes"; matching a pod to a scheduler would be one use for such an API. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() diff --git a/namespaces.md b/namespaces.md index dd1e27bd..d63015bc 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/namespaces.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -370,13 +399,6 @@ At this point, all content associated with that Namespace, and the Namespace itself are gone. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() diff --git a/networking.md b/networking.md index 2e9f5de7..ca2527e5 100644 --- a/networking.md +++ b/networking.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/networking.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -187,13 +216,6 @@ by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() diff --git a/nodeaffinity.md b/nodeaffinity.md index 7f27bcc9..8c999fec 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/nodeaffinity.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -245,13 +274,6 @@ The main related issue is #341. Issue #367 is also related. Those issues reference other related issues. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() diff --git a/persistent-storage.md b/persistent-storage.md index 417bb4f8..00eb2fef 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/persistent-storage.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -247,13 +276,6 @@ Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() diff --git a/podaffinity.md b/podaffinity.md index db615c3d..2c57ed90 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -668,13 +697,6 @@ This proposal is to satisfy #14816. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() diff --git a/principles.md b/principles.md index 77ddaf9d..297ae923 100644 --- a/principles.md +++ b/principles.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/principles.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -101,13 +130,6 @@ TODO * [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() diff --git a/resource-qos.md b/resource-qos.md index d0c709bd..e5088c85 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -217,13 +241,6 @@ A strict hierarchy of user-specified numerical priorities is not desirable becau 1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively 2. Changes to desired priority bands would require changes to all user pod configurations. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() diff --git a/resources.md b/resources.md index 131b67cb..2a75c987 100644 --- a/resources.md +++ b/resources.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/resources.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + **Note: this is a design doc, which describes features that have not been @@ -369,13 +398,6 @@ second. * Compressible? yes - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() diff --git a/scheduler_extender.md b/scheduler_extender.md index bf102c8b..e8ad718f 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/scheduler_extender.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -105,13 +134,6 @@ its priority functions) and used for final host selection. Multiple extenders can be configured in the scheduler policy. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() diff --git a/seccomp.md b/seccomp.md index da33390b..4a28d705 100644 --- a/seccomp.md +++ b/seccomp.md @@ -1,5 +1,29 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + @@ -266,13 +290,6 @@ spec: emptyDir: {} ``` - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() diff --git a/secrets.md b/secrets.md index 28504cf7..b1b83106 100644 --- a/secrets.md +++ b/secrets.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/secrets.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -627,13 +656,6 @@ on their filesystems: /etc/secret-volume/password - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() diff --git a/security.md b/security.md index 3c321334..06bb3979 100644 --- a/security.md +++ b/security.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/security.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -218,13 +247,6 @@ scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() diff --git a/security_context.md b/security_context.md index 077fade1..2b7d8b96 100644 --- a/security_context.md +++ b/security_context.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/security_context.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -192,13 +221,6 @@ denied by default. In the future the admission plugin will base this decision upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() diff --git a/selector-generation.md b/selector-generation.md index 13f429b9..cd91615b 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/selector-generation.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -180,13 +209,6 @@ We probably want as much as possible the same behavior for Job and ReplicationController. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() diff --git a/service_accounts.md b/service_accounts.md index c9afb699..2affa10e 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/service_accounts.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -210,13 +239,6 @@ serviceAccounts. In that case, the user may want to GET serviceAccounts to see what has been created. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 4ce67569..eb528580 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/simple-rolling-update.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -131,13 +160,6 @@ rollout with the old version * Goto Rollout with `foo` and `foo-next` trading places. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index dd146944..dfa3f213 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/taint-toleration-dedicated.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -288,13 +317,6 @@ Omega project at Google. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() diff --git a/versioning.md b/versioning.md index 2d46c46d..f6b8efaf 100644 --- a/versioning.md +++ b/versioning.md @@ -1,5 +1,34 @@ + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.2/docs/design/versioning.md). + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + @@ -174,13 +203,6 @@ There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here. - - - - - - - [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() -- cgit v1.2.3 From 0c7bb231ed88f6126458ed71ef1acb5ed68e8cbe Mon Sep 17 00:00:00 2001 From: David McMahon Date: Thu, 2 Jun 2016 17:25:58 -0700 Subject: Remove "All rights reserved" from all the headers. --- clustering/Dockerfile | 2 +- clustering/Makefile | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/clustering/Dockerfile b/clustering/Dockerfile index 60d258c4..e7abc753 100644 --- a/clustering/Dockerfile +++ b/clustering/Dockerfile @@ -1,4 +1,4 @@ -# Copyright 2016 The Kubernetes Authors All rights reserved. +# Copyright 2016 The Kubernetes Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. diff --git a/clustering/Makefile b/clustering/Makefile index b1743cf4..945a5f0b 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -1,4 +1,4 @@ -# Copyright 2016 The Kubernetes Authors All rights reserved. +# Copyright 2016 The Kubernetes Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. -- cgit v1.2.3 From 179ec0d0c64cd1039ed5bec2210c9c0087865125 Mon Sep 17 00:00:00 2001 From: xiangpengzhao Date: Tue, 28 Jun 2016 04:08:11 -0400 Subject: Add link to issues referenced in nodeaffinity.md and podaffinity.md --- nodeaffinity.md | 11 ++++++----- podaffinity.md | 47 ++++++++++++++++++++++++++--------------------- 2 files changed, 32 insertions(+), 26 deletions(-) diff --git a/nodeaffinity.md b/nodeaffinity.md index 3d9266ae..3c29d6fe 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -211,7 +211,7 @@ Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will use a programatic -approach to enforcing this (#4855). +approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). ## Implementation plan @@ -234,7 +234,7 @@ longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this co We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See #9044. +name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). ## Extensibility @@ -268,10 +268,11 @@ Are there any other fields we should convert from `map[string]string` to ## Related issues -The review for this proposal is in #18261. +The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261). -The main related issue is #341. Issue #367 is also related. Those issues -reference other related issues. +The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341). +Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related. +Those issues reference other related issues. diff --git a/podaffinity.md b/podaffinity.md index fcc5fc87..d72a6db8 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -430,8 +430,8 @@ foreach node A of {N} In this section we discuss three issues with RequiredDuringScheduling anti-affinity: Denial of Service (DoS), co-existing with daemons, and -determining which pod(s) to kill. See issue #18265 for additional discussion of -these topics. +determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265) +for additional discussion of these topics. ### Denial of Service @@ -501,8 +501,9 @@ A cluster administrator may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with system daemon pods, such as those run by DaemonSet. In principle, we would like the specification for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or -more other pods (see #18263 for a more detailed explanation of the toleration -concept). There are at least two ways to accomplish this: +more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263) +for a more detailed explanation of the toleration concept). +There are at least two ways to accomplish this: * Scheduler special-cases the namespace(s) where daemons live, in the sense that it ignores pods in those namespaces when it is @@ -562,12 +563,12 @@ that trigger killing of P? More generally, how long should the system wait before declaring that P's affinity is violated? (Of course affinity is expressed in terms of label selectors, not for a specific pod, but the scenario is easier to describe using a concrete pod.) This is closely related to the concept of -forgiveness (see issue #1574). In theory we could make this time duration be -configurable by the user on a per-pod basis, but for the first version of this -feature we will make it a configurable property of whichever component does the -killing and that applies across all pods using the feature. Making it -configurable by the user would require a nontrivial change to the API syntax -(since the field would only apply to +forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)). +In theory we could make this time duration be configurable by the user on a per-pod +basis, but for the first version of this feature we will make it a configurable +property of whichever component does the killing and that applies across all pods +using the feature. Making it configurable by the user would require a nontrivial +change to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution affinity). ## Implementation plan @@ -602,7 +603,7 @@ Do so in a way that addresses the "determining which pod(s) to kill" issue. We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See #9044. +name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). ## Backward compatibility @@ -612,7 +613,7 @@ Users should not start using `Affinity` until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will use a programmatic approach to -enforcing this (#4855). +enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). ## Extensibility @@ -673,23 +674,27 @@ pod to name the data rather than the node. ## Related issues -The review for this proposal is in #18265. +The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265). The topic of affinity/anti-affinity has generated a lot of discussion. The main -issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, - -# 1965, and #2906 all have additional discussion and use cases. +issue is [#367](https://github.com/kubernetes/kubernetes/issues/367) +but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485), +[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369), +[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707), +[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341), +[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906) +all have additional discussion and use cases. As the examples in this document have demonstrated, topological affinity is very useful in clusters that are spread across availability zones, e.g. to co-locate pods of a service in the same zone to avoid a wide-area network hop, or to -spread pods across zones for failure tolerance. #17059, #13056, #13063, and - -# 4235 are relevant. +spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059), +[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063), +and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant. -Issue #15675 describes connection affinity, which is vaguely related. +Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related. -This proposal is to satisfy #14816. +This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816). ## Related work -- cgit v1.2.3 From 11aa9b27bbc2763f836e528b660d8efbf9df4b52 Mon Sep 17 00:00:00 2001 From: xiangpengzhao Date: Fri, 1 Jul 2016 21:45:30 -0400 Subject: Add issue links to taint-toleration-dedicated.md --- taint-toleration-dedicated.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index e5a569c9..e896519f 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -45,7 +45,8 @@ nodes with a particular piece of hardware could be reserved for pods that require that hardware, or a node could be marked as unschedulable when it is being drained before shutdown, or a node could trigger evictions when it experiences hardware or software problems or abnormal node configurations; see -issues #17190 and #3885 for more discussion. +issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and +[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion. ## Taints, tolerations, and dedicated nodes @@ -274,7 +275,8 @@ taints and tolerations. Obviously this makes it impossible to securely enforce rules like dedicated nodes. We need some mechanism that prevents regular users from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them from mutating any fields of `NodeSpec`) and from mutating the `Tolerations` -field of their pods. #17549 is relevant. +field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549) +is relevant. Another security vulnerability arises if nodes are added to the cluster before receiving their taint. Thus we need to ensure that a new node does not become @@ -303,14 +305,15 @@ Users should not start using taints and tolerations until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will use a -progamatic approach to enforcing this (#4855). +progamatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). ## Related issues -This proposal is based on the discussion in #17190. There are a number of other -related issues, all of which are linked to from #17190. +This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190). +There are a number of other related issues, all of which are linked to from +[#17190](https://github.com/kubernetes/kubernetes/issues/17190). -The relationship between taints and node drains is discussed in #1574. +The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574). The concepts of taints and tolerations were originally developed as part of the Omega project at Google. -- cgit v1.2.3 From 9813b6e476becc5bebb82bfc5be4fbfa56b31cdd Mon Sep 17 00:00:00 2001 From: Quinton Hoole Date: Wed, 6 Jul 2016 15:42:56 -0700 Subject: Deprecate the term "Ubernetes" in favor of "Cluster Federation" and "Multi-AZ Clusters" --- control-plane-resilience.md | 4 ++-- federated-services.md | 55 ++++++++++++++++++++++++--------------------- federation-phase-1.md | 22 +++++++++--------- podaffinity.md | 2 +- 4 files changed, 43 insertions(+), 40 deletions(-) diff --git a/control-plane-resilience.md b/control-plane-resilience.md index b3e76c40..9e7eecae 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -32,7 +32,7 @@ Documentation for other releases can be found at -# Kubernetes/Ubernetes Control Plane Resilience +# Kubernetes and Cluster Federation Control Plane Resilience ## Long Term Design and Current Status @@ -44,7 +44,7 @@ Documentation for other releases can be found at Some amount of confusion exists around how we currently, and in future want to ensure resilience of the Kubernetes (and by implication -Ubernetes) control plane. This document is an attempt to capture that +Kubernetes Cluster Federation) control plane. This document is an attempt to capture that definitively. It covers areas including self-healing, high availability, bootstrapping and recovery. Most of the information in this document already exists in the form of github comments, diff --git a/federated-services.md b/federated-services.md index 5572b12f..124ff30a 100644 --- a/federated-services.md +++ b/federated-services.md @@ -32,7 +32,7 @@ Documentation for other releases can be found at -# Kubernetes Cluster Federation (a.k.a. "Ubernetes") +# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") ## Cross-cluster Load Balancing and Service Discovery @@ -106,7 +106,7 @@ Documentation for other releases can be found at A Kubernetes application configuration (e.g. for a Pod, Replication Controller, Service etc) should be able to be successfully deployed -into any Kubernetes Cluster or Ubernetes Federation of Clusters, +into any Kubernetes Cluster or Federation of Clusters, without modification. More specifically, a typical configuration should work correctly (although possibly not optimally) across any of the following environments: @@ -154,7 +154,7 @@ environments. More specifically, for example: ## Component Cloud Services -Ubernetes cross-cluster load balancing is built on top of the following: +Cross-cluster Federated load balancing is built on top of the following: 1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) provide single, static global IP addresses which load balance and @@ -194,10 +194,11 @@ Ubernetes cross-cluster load balancing is built on top of the following: A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy. -## Ubernetes API +## Cluster Federation API -The Ubernetes API for load balancing should be compatible with the equivalent -Kubernetes API, to ease porting of clients between Ubernetes and Kubernetes. +The Cluster Federation API for load balancing should be compatible with the equivalent +Kubernetes API, to ease porting of clients between Kubernetes and +federations of Kubernetes clusters. Further details below. ## Common Client Behavior @@ -250,13 +251,13 @@ multiple) fixed server IP(s). Nothing else matters. ### General Control Plane Architecture -Each cluster hosts one or more Ubernetes master components (Ubernetes API +Each cluster hosts one or more Cluster Federation master components (Federation API servers, controller managers with leader election, and etcd quorum members. This is documented in more detail in a separate design doc: -[Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). +[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). In the description below, assume that 'n' clusters, named 'cluster-1'... -'cluster-n' have been registered against an Ubernetes Federation "federation-1", +'cluster-n' have been registered against a Cluster Federation "federation-1", each with their own set of Kubernetes API endpoints,so, "[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), [http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) @@ -264,13 +265,13 @@ each with their own set of Kubernetes API endpoints,so, ### Federated Services -Ubernetes Services are pretty straight-forward. They're comprised of multiple +Federated Services are pretty straight-forward. They're comprised of multiple equivalent underlying Kubernetes Services, each with their own external endpoint, and a load balancing mechanism across them. Let's work through how exactly that works in practice. -Our user creates the following Ubernetes Service (against an Ubernetes API -endpoint): +Our user creates the following Federated Service (against a Federation +API endpoint): $ kubectl create -f my-service.yaml --context="federation-1" @@ -296,7 +297,7 @@ where service.yaml contains the following: run: my-service type: LoadBalancer -Ubernetes in turn creates one equivalent service (identical config to the above) +The Cluster Federation control system in turn creates one equivalent service (identical config to the above) in each of the underlying Kubernetes clusters, each of which results in something like this: @@ -338,7 +339,7 @@ something like this: Similar services are created in `cluster-2` and `cluster-3`, each of which are allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. -In Ubernetes `federation-1`, the resulting federated service looks as follows: +In the Cluster Federation `federation-1`, the resulting federated service looks as follows: $ kubectl get -o yaml --context="federation-1" service my-service @@ -382,7 +383,7 @@ Note that the federated service: 1. has a federation-wide load balancer hostname In addition to the set of underlying Kubernetes services (one per cluster) -described above, Ubernetes has also created a DNS name (e.g. on +described above, the Cluster Federation control system has also created a DNS name (e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or [AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) which provides load balancing across all of those services. For example, in a @@ -397,7 +398,8 @@ Each of the above IP addresses (which are just the external load balancer ingress IP's of each cluster service) is of course load balanced across the pods comprising the service in each cluster. -In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes +In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster +Federation control system automatically creates a [GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) which exposes a single, globally load-balanced IP: @@ -405,7 +407,7 @@ which exposes a single, globally load-balanced IP: $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 -Optionally, Ubernetes also configures the local DNS servers (SkyDNS) +Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) in each Kubernetes cluster to preferentially return the local clusterIP for the service in that cluster, with other clusters' external service IP's (or a global load-balanced IP) also configured @@ -416,7 +418,7 @@ for failover purposes: my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 -If Ubernetes Global Service Health Checking is enabled, multiple service health +If Cluster Federation Global Service Health Checking is enabled, multiple service health checkers running across the federated clusters collaborate to monitor the health of the service endpoints, and automatically remove unhealthy endpoints from the DNS record (e.g. a majority quorum is required to vote a service endpoint @@ -460,7 +462,7 @@ where `my-service-rc.yaml` contains the following: - containerPort: 2380 protocol: TCP -Ubernetes in turn creates one equivalent replication controller +The Cluster Federation control system in turn creates one equivalent replication controller (identical config to the above, except for the replica count) in each of the underlying Kubernetes clusters, each of which results in something like this: @@ -510,8 +512,8 @@ entire cluster failures, various approaches are possible, including: replicas in its cluster in response to the additional traffic diverted from the failed cluster. This saves resources and is relatively simple, but there is some delay in the autoscaling. -3. **federated replica migration**, where the Ubernetes Federation - Control Plane detects the cluster failure and automatically +3. **federated replica migration**, where the Cluster Federation + control system detects the cluster failure and automatically increases the replica count in the remainaing clusters to make up for the lost replicas in the failed cluster. This does not seem to offer any benefits relative to pod autoscaling above, and is @@ -523,23 +525,24 @@ entire cluster failures, various approaches are possible, including: The implementation approach and architecture is very similar to Kubernetes, so if you're familiar with how Kubernetes works, none of what follows will be surprising. One additional design driver not present in Kubernetes is that -Ubernetes aims to be resilient to individual cluster and availability zone +the Cluster Federation control system aims to be resilient to individual cluster and availability zone failures. So the control plane spans multiple clusters. More specifically: -+ Ubernetes runs it's own distinct set of API servers (typically one ++ Cluster Federation runs it's own distinct set of API servers (typically one or more per underlying Kubernetes cluster). These are completely distinct from the Kubernetes API servers for each of the underlying clusters. -+ Ubernetes runs it's own distinct quorum-based metadata store (etcd, ++ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, by default). Approximately 1 quorum member runs in each underlying cluster ("approximately" because we aim for an odd number of quorum members, and typically don't want more than 5 quorum members, even if we have a larger number of federated clusters, so 2 clusters->3 quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). -Cluster Controllers in Ubernetes watch against the Ubernetes API server/etcd +Cluster Controllers in the Federation control system watch against the +Federation API server/etcd state, and apply changes to the underlying kubernetes clusters accordingly. They -also have the anti-entropy mechanism for reconciling ubernetes "desired desired" +also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" state against kubernetes "actual desired" state. diff --git a/federation-phase-1.md b/federation-phase-1.md index ba7386e7..d93046e6 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -320,8 +320,8 @@ Below is the state transition diagram. ## Replication Controller -A global workload submitted to control plane is represented as an -Ubernetes replication controller. When a replication controller +A global workload submitted to control plane is represented as a + replication controller in the Cluster Federation control plane. When a replication controller is submitted to control plane, clients need a way to express its requirements or preferences on clusters. Depending on different use cases it may be complex. For example: @@ -377,11 +377,11 @@ some implicit scheduling restrictions. For example it defines “nodeSelector” which can only be satisfied on some particular clusters. How to handle this will be addressed after phase one. -## Ubernetes Services +## Federated Services -The Service API object exposed by Ubernetes is similar to service +The Service API object exposed by the Cluster Federation is similar to service objects on Kubernetes. It defines the access to a group of pods. The -Ubernetes service controller will create corresponding Kubernetes +federation service controller will create corresponding Kubernetes service objects on underlying clusters. These are detailed in a separate design document: [Federated Services](federated-services.md). @@ -389,13 +389,13 @@ separate design document: [Federated Services](federated-services.md). In phase one we only support scheduling replication controllers. Pod scheduling will be supported in later phase. This is primarily in -order to keep the Ubernetes API compatible with the Kubernetes API. +order to keep the Cluster Federation API compatible with the Kubernetes API. ## ACTIVITY FLOWS ## Scheduling -The below diagram shows how workloads are scheduled on the Ubernetes control\ +The below diagram shows how workloads are scheduled on the Cluster Federation control\ plane: 1. A replication controller is created by the client. @@ -419,20 +419,20 @@ distribution policies. The scheduling rule is basically: There is a potential race condition here. Say at time _T1_ the control plane learns there are _m_ available resources in a K8S cluster. As the cluster is working independently it still accepts workload -requests from other K8S clients or even another Ubernetes control -plane. The Ubernetes scheduling decision is based on this data of +requests from other K8S clients or even another Cluster Federation control +plane. The Cluster Federation scheduling decision is based on this data of available resources. However when the actual RC creation happens to the cluster at time _T2_, the cluster may don’t have enough resources at that time. We will address this problem in later phases with some proposed solutions like resource reservation mechanisms. -![Ubernetes Scheduling](ubernetes-scheduling.png) +![Federated Scheduling](ubernetes-scheduling.png) ## Service Discovery This part has been included in the section “Federated Service” of document -“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. +“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please refer to that document for details. diff --git a/podaffinity.md b/podaffinity.md index d72a6db8..2bba0c11 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -347,7 +347,7 @@ scheduler to not put more than one pod from S in the same zone, and thus by definition it will not put more than one pod from S on the same node, assuming each node is in one zone. This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in -[Ubernetes](../../docs/proposals/federation.md) clusters.) +[Cluster Federation](../../docs/proposals/federation.md) clusters.) * **Don't co-locate pods of this service with pods from service "evilService"**: `{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` -- cgit v1.2.3 From fa027eea67872811a0715c7c9c9db31b3b55ad62 Mon Sep 17 00:00:00 2001 From: joe2far Date: Wed, 13 Jul 2016 15:06:24 +0100 Subject: Fixed several typos --- control-plane-resilience.md | 2 +- daemon.md | 2 +- federated-services.md | 2 +- indexed-job.md | 2 +- nodeaffinity.md | 2 +- security.md | 2 +- taint-toleration-dedicated.md | 4 ++-- 7 files changed, 8 insertions(+), 8 deletions(-) diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 9e7eecae..eb5f800e 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -179,7 +179,7 @@ well-bounded time period. Multiple stateless, self-hosted, self-healing API servers behind a HA load balancer, built out by the default "kube-up" automation on GCE, AWS and basic bare metal (BBM). Note that the single-host approach of -hving etcd listen only on localhost to ensure that onyl API server can +having etcd listen only on localhost to ensure that only API server can connect to it will no longer work, so alternative security will be needed in the regard (either using firewall rules, SSL certs, or something else). All necessary flags are currently supported to enable diff --git a/daemon.md b/daemon.md index be78a035..a6ce5aef 100644 --- a/daemon.md +++ b/daemon.md @@ -174,7 +174,7 @@ upgradable, and more generally could not be managed through the API server interface. A third alternative is to generalize the Replication Controller. We would do -something like: if you set the `replicas` field of the ReplicationConrollerSpec +something like: if you set the `replicas` field of the ReplicationControllerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number diff --git a/federated-services.md b/federated-services.md index 124ff30a..46958146 100644 --- a/federated-services.md +++ b/federated-services.md @@ -505,7 +505,7 @@ depend on what scheduling policy is in force. In the above example, the scheduler created an equal number of replicas (2) in each of the three underlying clusters, to make up the total of 6 replicas required. To handle entire cluster failures, various approaches are possible, including: -1. **simple overprovisioing**, such that sufficient replicas remain even if a +1. **simple overprovisioning**, such that sufficient replicas remain even if a cluster fails. This wastes some resources, but is simple and reliable. 2. **pod autoscaling**, where the replication controller in each cluster automatically and autonomously increases the number of diff --git a/indexed-job.md b/indexed-job.md index 63dafc7b..799f6b04 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -522,7 +522,7 @@ The index-only approach: - Requires that the user keep the *per completion parameters* in a separate storage, such as a configData or networked storage. - Makes no changes to the JobSpec. -- Drawback: while in separate storage, they could be mutatated, which would have +- Drawback: while in separate storage, they could be mutated, which would have unexpected effects. - Drawback: Logic for using index to lookup parameters needs to be in the Pod. - Drawback: CLIs and UIs are limited to using the "index" as the identity of a diff --git a/nodeaffinity.md b/nodeaffinity.md index 3c29d6fe..77bc6e91 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -62,7 +62,7 @@ scheduling requirements. rather than replacing `map[string]string`, due to backward compatibility requirements.) -The affiniy specifications described above allow a pod to request various +The affinity specifications described above allow a pod to request various properties that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." ([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes diff --git a/security.md b/security.md index 0ed8f2f0..650a1b70 100644 --- a/security.md +++ b/security.md @@ -204,7 +204,7 @@ arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history -of the custer. +of the cluster. As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index e896519f..c7126921 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -201,7 +201,7 @@ to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union of the taints specified by various sources. For now, the only source is the `NodeSpec` itself, but in the future one could imagine a node inheriting taints from pods (if we were to allow taints to be attached to pods), from -the node's startup coniguration, etc. The scheduler should look at the `Taints` +the node's startup configuration, etc. The scheduler should look at the `Taints` in `NodeStatus`, not in `NodeSpec`. Taints and tolerations are not scoped to namespace. @@ -305,7 +305,7 @@ Users should not start using taints and tolerations until the full implementation has been in Kubelet and the master for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet or master to a version that does not support them. Longer-term we will use a -progamatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). +programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). ## Related issues -- cgit v1.2.3 From 93a84b81e6956ed5a05024f55b5cd1852c32f332 Mon Sep 17 00:00:00 2001 From: joe2far Date: Fri, 15 Jul 2016 10:44:58 +0100 Subject: Fix broken warning image link in docs --- README.md | 10 +++++----- access.md | 10 +++++----- admission_control.md | 10 +++++----- admission_control_limit_range.md | 10 +++++----- admission_control_resource_quota.md | 10 +++++----- architecture.md | 10 +++++----- aws_under_the_hood.md | 10 +++++----- clustering.md | 10 +++++----- clustering/README.md | 10 +++++----- command_execution_port_forwarding.md | 10 +++++----- configmap.md | 10 +++++----- control-plane-resilience.md | 10 +++++----- daemon.md | 10 +++++----- downward_api_resources_limits_requests.md | 10 +++++----- enhance-pluggable-policy.md | 10 +++++----- event_compression.md | 10 +++++----- expansion.md | 10 +++++----- extending-api.md | 10 +++++----- federated-services.md | 10 +++++----- federation-phase-1.md | 10 +++++----- horizontal-pod-autoscaler.md | 10 +++++----- identifiers.md | 10 +++++----- indexed-job.md | 10 +++++----- metadata-policy.md | 10 +++++----- namespaces.md | 10 +++++----- networking.md | 10 +++++----- nodeaffinity.md | 10 +++++----- persistent-storage.md | 10 +++++----- podaffinity.md | 10 +++++----- principles.md | 10 +++++----- resource-qos.md | 10 +++++----- resources.md | 10 +++++----- scheduler_extender.md | 10 +++++----- seccomp.md | 10 +++++----- secrets.md | 10 +++++----- security.md | 10 +++++----- security_context.md | 10 +++++----- selector-generation.md | 10 +++++----- service_accounts.md | 10 +++++----- simple-rolling-update.md | 10 +++++----- taint-toleration-dedicated.md | 10 +++++----- versioning.md | 10 +++++----- 42 files changed, 210 insertions(+), 210 deletions(-) diff --git a/README.md b/README.md index 834534a3..2aa70c61 100644 --- a/README.md +++ b/README.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/access.md b/access.md index a19e082e..7377707f 100644 --- a/access.md +++ b/access.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/admission_control.md b/admission_control.md index ae842122..32a0907e 100644 --- a/admission_control.md +++ b/admission_control.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index e8afaa78..261b7007 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 076fe588..9ec884e0 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/architecture.md b/architecture.md index 94a14067..a78c6d45 100644 --- a/architecture.md +++ b/architecture.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 12d31701..aedab0e1 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/clustering.md b/clustering.md index 5ca676c4..904b5d44 100644 --- a/clustering.md +++ b/clustering.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/clustering/README.md b/clustering/README.md index 1a6bb48d..3692012d 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 489f936e..7da049b0 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/configmap.md b/configmap.md index a9f80f8a..79104188 100644 --- a/configmap.md +++ b/configmap.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/control-plane-resilience.md b/control-plane-resilience.md index eb5f800e..4945fc10 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/daemon.md b/daemon.md index a6ce5aef..4c1044ad 100644 --- a/daemon.md +++ b/daemon.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 11d132bd..4c1ac66a 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index bd6a329f..59f2fa91 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/event_compression.md b/event_compression.md index 7ed46538..6e3041d8 100644 --- a/event_compression.md +++ b/event_compression.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/expansion.md b/expansion.md index 2c8b775a..b14b8d4a 100644 --- a/expansion.md +++ b/expansion.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/extending-api.md b/extending-api.md index 4c7049af..241f6174 100644 --- a/extending-api.md +++ b/extending-api.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/federated-services.md b/federated-services.md index 46958146..3f0f0b1e 100644 --- a/federated-services.md +++ b/federated-services.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/federation-phase-1.md b/federation-phase-1.md index d93046e6..78c73e34 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index b1234940..01c4e66a 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/identifiers.md b/identifiers.md index 1966d250..1f2c2dc7 100644 --- a/identifiers.md +++ b/identifiers.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/indexed-job.md b/indexed-job.md index 799f6b04..47f454cf 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/metadata-policy.md b/metadata-policy.md index 384d5ef4..da371165 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/namespaces.md b/namespaces.md index 2eff4512..bc903901 100644 --- a/namespaces.md +++ b/namespaces.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/networking.md b/networking.md index 93b243be..f55ffdc1 100644 --- a/networking.md +++ b/networking.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/nodeaffinity.md b/nodeaffinity.md index 77bc6e91..f1bfec6f 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/persistent-storage.md b/persistent-storage.md index b973d2b1..7ccbe8d4 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/podaffinity.md b/podaffinity.md index 2bba0c11..016160b6 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/principles.md b/principles.md index 3e711986..eac1e432 100644 --- a/principles.md +++ b/principles.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/resource-qos.md b/resource-qos.md index 24b966fd..e7a2951e 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/resources.md b/resources.md index d9e0bc3d..e03214aa 100644 --- a/resources.md +++ b/resources.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/scheduler_extender.md b/scheduler_extender.md index 0df529b9..b3082a8e 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/seccomp.md b/seccomp.md index 3a83093e..9741d8e4 100644 --- a/seccomp.md +++ b/seccomp.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/secrets.md b/secrets.md index 98c8f0ce..a40bd88c 100644 --- a/secrets.md +++ b/secrets.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/security.md b/security.md index 650a1b70..58be04c1 100644 --- a/security.md +++ b/security.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/security_context.md b/security_context.md index 59b152ab..1ae5be90 100644 --- a/security_context.md +++ b/security_context.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/selector-generation.md b/selector-generation.md index 9627b2e5..46ed0adc 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/service_accounts.md b/service_accounts.md index 2f656228..32b3a5c5 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 32a22ce8..961b7b84 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index c7126921..99ba482c 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

diff --git a/versioning.md b/versioning.md index b8d83e93..b2693100 100644 --- a/versioning.md +++ b/versioning.md @@ -2,15 +2,15 @@ -WARNING -WARNING -WARNING -WARNING -WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

-- cgit v1.2.3 From e3f337040ccad3e5033d2edbf32b46d0fa53aac9 Mon Sep 17 00:00:00 2001 From: Amanpreet Singh Date: Wed, 20 Jul 2016 18:21:56 +0530 Subject: Make a link in docs clickable - Github flavored markdown doesn't support links inside codeblocks --- resource-qos.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-qos.md b/resource-qos.md index 24b966fd..48b41b48 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -54,7 +54,7 @@ Borg increased utilization by about 20% when it started allowing use of such non ## Requests and Limits -For each resource, containers can specify a resource request and limit, `0 <= request <= [Node Allocatable](../proposals/node-allocatable.md)` & `request <= limit <= Infinity`. +For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](../proposals/node-allocatable.md) & `request <= limit <= Infinity`. If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. Scheduling is based on `requests` and not `limits`. The pods and its containers will not be allowed to exceed the specified limit. -- cgit v1.2.3 From 2a7816da397f44ba56f638dede91ad9a07fe2f57 Mon Sep 17 00:00:00 2001 From: Tong Date: Wed, 20 Jul 2016 22:29:10 +0800 Subject: fixes a typo in example yaml --- resource-qos.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/resource-qos.md b/resource-qos.md index 24b966fd..e611306d 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -120,8 +120,8 @@ containers: cpu: 100m memory: 100Mi requests: - cpu: 10m - memory: 1Gi + cpu: 100m + memory: 100Mi ``` - If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**. -- cgit v1.2.3 From 1ef459e015781e6d74e11fa864119f4bf17b2b75 Mon Sep 17 00:00:00 2001 From: AdoHe Date: Fri, 22 Jul 2016 00:16:36 -0400 Subject: doc third party resource usage more cleanly --- extending-api.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/extending-api.md b/extending-api.md index 4c7049af..f80bca74 100644 --- a/extending-api.md +++ b/extending-api.md @@ -132,6 +132,8 @@ versions: Then the API server will program in the new RESTful resource path: * `/apis/stable.example.com/v1/namespaces//crontabs/...` +**Note: This may take a while before RESTful resource path registration happen, please +always check this before you create resource instances.** Now that this schema has been created, a user can `POST`: -- cgit v1.2.3 From e5a36cd230620ee3337ff35d5dfbe91c40ed9766 Mon Sep 17 00:00:00 2001 From: lixiaobing10051267 Date: Fri, 22 Jul 2016 15:12:06 +0800 Subject: Give the complete and correct path to client/apiserver related --- daemon.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/daemon.md b/daemon.md index 4c1044ad..3ba6c618 100644 --- a/daemon.md +++ b/daemon.md @@ -195,15 +195,15 @@ some discussion of this topic). #### Client - Add support for DaemonSet commands to kubectl and the client. Client code was -added to client/unversioned. The main files in Kubectl that were modified are -kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, +added to pkg/client/unversioned. The main files in Kubectl that were modified are +pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API. #### Apiserver - Accept, parse, validate client commands -- REST API calls are handled in registry/daemon +- REST API calls are handled in pkg/registry/daemonset - In particular, the api server will add the object to etcd - DaemonManager listens for updates to etcd (using Framework.informer) - API objects for DaemonSet were created in expapi/v1/types.go and -- cgit v1.2.3 From f555bd63e354d9d9654bf77556c0aec1feac71e6 Mon Sep 17 00:00:00 2001 From: lixiaobing10051267 Date: Tue, 26 Jul 2016 17:22:32 +0800 Subject: HyperLink not found and can't redirected --- indexed-job.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/indexed-job.md b/indexed-job.md index 47f454cf..40817a50 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -866,7 +866,7 @@ list of parameters. However, some popular base images do not include that does not support array syntax. Kubelet does support [expanding varaibles without a -shell](http://kubernetes.io/v1.1/docs/design/expansion.html). But it does not +shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not allow for recursive substitution, which is required to extract the correct parameter from a list based on the completion index of the pod. The syntax could be extended, but doing so seems complex and will be an unfamiliar syntax -- cgit v1.2.3 From 13213ad729334c7d77c4039b822c342fe0bf5cd5 Mon Sep 17 00:00:00 2001 From: Taariq Levack Date: Sun, 7 Aug 2016 15:46:17 +0200 Subject: Update secrets.md Typo, environment should be prod not pod --- secrets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/secrets.md b/secrets.md index a40bd88c..457bbac9 100644 --- a/secrets.md +++ b/secrets.md @@ -534,7 +534,7 @@ When the container's command runs, the pieces of the key will be available in: The container is then free to use the secret data to establish an ssh connection. -### Use-Case: Pods with pod / test credentials +### Use-Case: Pods with prod / test credentials This example illustrates a pod which consumes a secret containing prod credentials and another pod which consumes a secret with test environment -- cgit v1.2.3 From e5e67e803d294c56a08835a97f53385b0fbcf755 Mon Sep 17 00:00:00 2001 From: Crazykev Date: Mon, 8 Aug 2016 14:16:45 +0800 Subject: remove duplicate words in indexed-job Signed-off-by: Crazykev --- indexed-job.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/indexed-job.md b/indexed-job.md index 47f454cf..3076291f 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -103,7 +103,7 @@ and can refer to it from other resource types, such as Here are several examples of *work lists*: lists of command lines that the user wants to run, each line its own Pod. (Note that in practice, a work list may not ever be written out in this form, but it exists in the mind of the Job creator, -and it is a useful way to talk about the the intent of the user when discussing +and it is a useful way to talk about the intent of the user when discussing alternatives for specifying Indexed Jobs). Note that we will not have the user express their requirements in work list @@ -387,7 +387,7 @@ equal to the number of work items in the work list. Each pod that the job controller creates is intended to complete one work item from the work list. Since a pod may fail, several pods may, serially, attempt to -complete the same index. Therefore, we call it a a *completion index* (or just +complete the same index. Therefore, we call it a *completion index* (or just *index*), but not a *pod index*. For each completion index, in the range 1 to `.job.Spec.Completions`, the job @@ -564,7 +564,7 @@ before. They will have a new annotation, but pod are expected to tolerate unfamiliar annotations. However, if the job controller version is reverted, to a version before this -change, the jobs whose pod specs depend on the the new annotation will fail. +change, the jobs whose pod specs depend on the new annotation will fail. This is okay for a Beta resource. #### Job Controller Changes -- cgit v1.2.3 From e7a57bdb6caac971cc972a9e32c7c0cbd6bd5dec Mon Sep 17 00:00:00 2001 From: lixiaobing10051267 Date: Mon, 8 Aug 2016 15:16:48 +0800 Subject: Fix several typos in control-plane-resilience.md --- control-plane-resilience.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/control-plane-resilience.md b/control-plane-resilience.md index b3e76c40..3ff6d3ca 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -179,7 +179,7 @@ well-bounded time period. Multiple stateless, self-hosted, self-healing API servers behind a HA load balancer, built out by the default "kube-up" automation on GCE, AWS and basic bare metal (BBM). Note that the single-host approach of -hving etcd listen only on localhost to ensure that onyl API server can +having etcd listen only on localhost to ensure that only API server can connect to it will no longer work, so alternative security will be needed in the regard (either using firewall rules, SSL certs, or something else). All necessary flags are currently supported to enable -- cgit v1.2.3 From de3f3fa1e46e54ab1640721035ec8609cd210bea Mon Sep 17 00:00:00 2001 From: Crazykev Date: Wed, 10 Aug 2016 15:43:27 +0800 Subject: fix a typo in nodeaffinity Signed-off-by: Crazykev --- nodeaffinity.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nodeaffinity.md b/nodeaffinity.md index f1bfec6f..1459a321 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -69,7 +69,7 @@ an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." some of the properties that a node might publish as labels, which affinity expressions can match against.) They do *not* allow a pod to request to schedule (or not schedule) on a node based on what other pods are running on the node. -That feature is called "inter-pod topological affinity/anti-afinity" and is +That feature is called "inter-pod topological affinity/anti-affinity" and is described [here](https://github.com/kubernetes/kubernetes/pull/18265). ## API -- cgit v1.2.3 From d418f4b9b18b25cd03078b7d5068326861152267 Mon Sep 17 00:00:00 2001 From: Crazykev Date: Tue, 9 Aug 2016 13:22:39 +0800 Subject: typo: correct spell Signed-off-by: Crazykev --- indexed-job.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/indexed-job.md b/indexed-job.md index 2cd35e1c..432996d4 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -865,7 +865,7 @@ list of parameters. However, some popular base images do not include `/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation that does not support array syntax. -Kubelet does support [expanding varaibles without a +Kubelet does support [expanding variables without a shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not allow for recursive substitution, which is required to extract the correct parameter from a list based on the completion index of the pod. The syntax -- cgit v1.2.3 From 7c02a6a20bc67acde3e97258cf5590e19558ffd9 Mon Sep 17 00:00:00 2001 From: Crazykev Date: Fri, 19 Aug 2016 13:36:37 +0800 Subject: correct specifies in aws_under_the_hood Signed-off-by: Crazykev --- aws_under_the_hood.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index aedab0e1..3ff1c17e 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -177,7 +177,7 @@ within AWS Certificate Manager. service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp) ``` -The second annotation specificies which protocol a pod speaks. For HTTPS and +The second annotation specifies which protocol a pod speaks. For HTTPS and SSL, the ELB will expect the pod to authenticate itself over the encrypted connection. -- cgit v1.2.3 From e01f2066e347d55d13c28011f3614826f455c10d Mon Sep 17 00:00:00 2001 From: Crazykev Date: Fri, 19 Aug 2016 13:53:24 +0800 Subject: correct object in downward_api_resources_limits_requests Signed-off-by: Crazykev --- downward_api_resources_limits_requests.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 4c1ac66a..6a5f660d 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -106,7 +106,7 @@ the values. As kubelet operates on internal objects (without json tags), and the selectors are part of versioned objects, retrieving values of the limits and requests can be handled using these two solutions: -1. By converting an internal object to versioned obejct, and then using +1. By converting an internal object to versioned object, and then using the json path library to retrieve the values from the versioned object by processing the selector. -- cgit v1.2.3 From eeb0c899220ac21dec0f5ee68559e7a93694cb1c Mon Sep 17 00:00:00 2001 From: Peter Miron Date: Fri, 29 Jul 2016 13:23:37 -0400 Subject: New plugin must be imported Admission control design doc doesn't mention importing the plugin to plugins.go. I was unable to get the plugin to build into my binary without it. --- admission_control.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/admission_control.md b/admission_control.md index 32a0907e..f879bfd3 100644 --- a/admission_control.md +++ b/admission_control.md @@ -104,6 +104,17 @@ func init() { } ``` +A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go) + +```go + // Admission policies + _ "k8s.io/kubernetes/plugin/pkg/admission/admit" + _ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages" + _ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity" + ... + _ "" +``` + Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations. -- cgit v1.2.3 From 4bb3757243ec9251af4bd49803293a6c21a36d66 Mon Sep 17 00:00:00 2001 From: Vladi Date: Wed, 24 Aug 2016 12:59:44 -0700 Subject: Doc: Explain how to use EBS storage on AWS --- aws_under_the_hood.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 3ff1c17e..3e4f0456 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -94,6 +94,9 @@ often faster, and historically more reliable. Unless you can make do with whatever space is left on your root partition, you must choose an instance type that provides you with sufficient instance storage for your needs. +To configure Kubernetes to use EBS storage, pass the environment variable +`KUBE_AWS_STORAGE=ebs` to kube-up. + Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to track its state. Similar to nodes, containers are mostly run against instance storage, except that we repoint some important data onto the persistent volume. -- cgit v1.2.3 From ac8cbd9f74170a9bd9326c898ff36a4d2ef0b2cf Mon Sep 17 00:00:00 2001 From: yuexiao-wang Date: Tue, 23 Aug 2016 18:32:10 +0800 Subject: Add targets for PHONY in Makefile Signed-off-by: yuexiao-wang --- clustering/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/clustering/Makefile b/clustering/Makefile index 945a5f0b..d5640164 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -35,9 +35,11 @@ docker: docker build -t clustering-seqdiag . docker run --rm clustering-seqdiag | tar xvf - +.PHONY: docker-clean docker-clean: docker rmi clustering-seqdiag || true docker images -q --filter "dangling=true" | xargs docker rmi +.PHONY: fix-clock-skew fix-clock-skew: boot2docker ssh sudo date -u -D "%Y%m%d%H%M.%S" --set "$(shell date -u +%Y%m%d%H%M.%S)" -- cgit v1.2.3 From 233b226e432da33243789a6d215b17c91d8cd126 Mon Sep 17 00:00:00 2001 From: Jeff Mendoza Date: Mon, 15 Aug 2016 13:04:34 -0700 Subject: Removed non-md files from docs. Moved doc yamls to test/fixtures. Most of the contents of docs/ has moved to kubernetes.github.io. Development of the docs and accompanying files has continued there, making the copies in this repo stale. I've removed everything but the .md files which remain to redirect old links. The .yaml config files in the docs were used by some tests, these have been moved to test/fixtures/doc-yaml, and can remain there to be used by tests or other purposes. --- admission_control_resource_quota.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 9ec884e0..2fa4c5f0 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -221,9 +221,9 @@ kubectl is modified to support the **ResourceQuota** resource. For example: ```console -$ kubectl create -f docs/admin/resourcequota/namespace.yaml +$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml namespace "quota-example" created -$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example +$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example resourcequota "quota" created $ kubectl describe quota quota --namespace=quota-example Name: quota -- cgit v1.2.3 From 40d31aca4282817172fc90c912d4285e6945942b Mon Sep 17 00:00:00 2001 From: YuPengZTE Date: Thu, 1 Sep 2016 17:23:21 +0800 Subject: The first letter is small --- resource-qos.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-qos.md b/resource-qos.md index f09f7e93..c852e415 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -228,7 +228,7 @@ Pod OOM score configuration - OOM_SCORE_ADJ: -998 *Kubelet, Docker* - OOM_SCORE_ADJ: -999 (won’t be OOM killed) - - Hack, because these critical tasks might die if they conflict with guaranteed containers. in the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. + - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. ## Known issues and possible improvements -- cgit v1.2.3 From 9bff312b3dccd8de754cce77f4cda0627c0f484f Mon Sep 17 00:00:00 2001 From: David McMahon Date: Thu, 1 Sep 2016 14:40:55 -0700 Subject: Update the latestReleaseBranch to release-1.4 in the munger. --- README.md | 2 +- access.md | 2 +- admission_control.md | 2 +- admission_control_limit_range.md | 2 +- admission_control_resource_quota.md | 2 +- architecture.md | 2 +- aws_under_the_hood.md | 2 +- clustering.md | 2 +- clustering/README.md | 2 +- command_execution_port_forwarding.md | 2 +- configmap.md | 2 +- control-plane-resilience.md | 2 +- daemon.md | 2 +- downward_api_resources_limits_requests.md | 2 +- enhance-pluggable-policy.md | 2 +- event_compression.md | 2 +- expansion.md | 2 +- extending-api.md | 2 +- federated-services.md | 2 +- federation-phase-1.md | 2 +- horizontal-pod-autoscaler.md | 2 +- identifiers.md | 2 +- indexed-job.md | 2 +- metadata-policy.md | 2 +- namespaces.md | 2 +- networking.md | 2 +- nodeaffinity.md | 2 +- persistent-storage.md | 2 +- podaffinity.md | 2 +- principles.md | 2 +- resource-qos.md | 2 +- resources.md | 2 +- scheduler_extender.md | 2 +- seccomp.md | 2 +- secrets.md | 2 +- security.md | 2 +- security_context.md | 2 +- selector-generation.md | 2 +- service_accounts.md | 2 +- simple-rolling-update.md | 2 +- taint-toleration-dedicated.md | 2 +- versioning.md | 2 +- 42 files changed, 42 insertions(+), 42 deletions(-) diff --git a/README.md b/README.md index 2aa70c61..1a812e2e 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/README.md). +[here](http://releases.k8s.io/release-1.4/docs/design/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/access.md b/access.md index 7377707f..a01576e4 100644 --- a/access.md +++ b/access.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/access.md). +[here](http://releases.k8s.io/release-1.4/docs/design/access.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control.md b/admission_control.md index f879bfd3..e9f52528 100644 --- a/admission_control.md +++ b/admission_control.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/admission_control.md). +[here](http://releases.k8s.io/release-1.4/docs/design/admission_control.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 261b7007..27637769 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/admission_control_limit_range.md). +[here](http://releases.k8s.io/release-1.4/docs/design/admission_control_limit_range.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 2fa4c5f0..8265c9a9 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/admission_control_resource_quota.md). +[here](http://releases.k8s.io/release-1.4/docs/design/admission_control_resource_quota.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/architecture.md b/architecture.md index a78c6d45..5e489dfa 100644 --- a/architecture.md +++ b/architecture.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/architecture.md). +[here](http://releases.k8s.io/release-1.4/docs/design/architecture.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 3ff1c17e..9702a4fa 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/aws_under_the_hood.md). +[here](http://releases.k8s.io/release-1.4/docs/design/aws_under_the_hood.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering.md b/clustering.md index 904b5d44..c5f67a20 100644 --- a/clustering.md +++ b/clustering.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/clustering.md). +[here](http://releases.k8s.io/release-1.4/docs/design/clustering.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/clustering/README.md b/clustering/README.md index 3692012d..014b96c2 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/clustering/README.md). +[here](http://releases.k8s.io/release-1.4/docs/design/clustering/README.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 7da049b0..2af98cbe 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/command_execution_port_forwarding.md). +[here](http://releases.k8s.io/release-1.4/docs/design/command_execution_port_forwarding.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/configmap.md b/configmap.md index 79104188..9b7fa0a2 100644 --- a/configmap.md +++ b/configmap.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/configmap.md). +[here](http://releases.k8s.io/release-1.4/docs/design/configmap.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 4945fc10..9e65a1e3 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/control-plane-resilience.md). +[here](http://releases.k8s.io/release-1.4/docs/design/control-plane-resilience.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/daemon.md b/daemon.md index 3ba6c618..5185f2e4 100644 --- a/daemon.md +++ b/daemon.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/daemon.md). +[here](http://releases.k8s.io/release-1.4/docs/design/daemon.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 6a5f660d..89907d22 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/downward_api_resources_limits_requests.md). +[here](http://releases.k8s.io/release-1.4/docs/design/downward_api_resources_limits_requests.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 59f2fa91..529aa588 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/enhance-pluggable-policy.md). +[here](http://releases.k8s.io/release-1.4/docs/design/enhance-pluggable-policy.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/event_compression.md b/event_compression.md index 6e3041d8..738c3a1c 100644 --- a/event_compression.md +++ b/event_compression.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/event_compression.md). +[here](http://releases.k8s.io/release-1.4/docs/design/event_compression.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/expansion.md b/expansion.md index b14b8d4a..277e7211 100644 --- a/expansion.md +++ b/expansion.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/expansion.md). +[here](http://releases.k8s.io/release-1.4/docs/design/expansion.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/extending-api.md b/extending-api.md index ff3b6b74..2a14e08e 100644 --- a/extending-api.md +++ b/extending-api.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/extending-api.md). +[here](http://releases.k8s.io/release-1.4/docs/design/extending-api.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/federated-services.md b/federated-services.md index 3f0f0b1e..fe050da3 100644 --- a/federated-services.md +++ b/federated-services.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/federated-services.md). +[here](http://releases.k8s.io/release-1.4/docs/design/federated-services.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/federation-phase-1.md b/federation-phase-1.md index 78c73e34..a1798f6e 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/federation-phase-1.md). +[here](http://releases.k8s.io/release-1.4/docs/design/federation-phase-1.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index 01c4e66a..f76e3ee4 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/horizontal-pod-autoscaler.md). +[here](http://releases.k8s.io/release-1.4/docs/design/horizontal-pod-autoscaler.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/identifiers.md b/identifiers.md index 1f2c2dc7..004b6bac 100644 --- a/identifiers.md +++ b/identifiers.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/identifiers.md). +[here](http://releases.k8s.io/release-1.4/docs/design/identifiers.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/indexed-job.md b/indexed-job.md index 432996d4..28655391 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/indexed-job.md). +[here](http://releases.k8s.io/release-1.4/docs/design/indexed-job.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/metadata-policy.md b/metadata-policy.md index da371165..4ffb0ba4 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/metadata-policy.md). +[here](http://releases.k8s.io/release-1.4/docs/design/metadata-policy.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/namespaces.md b/namespaces.md index bc903901..8aa44fe9 100644 --- a/namespaces.md +++ b/namespaces.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/namespaces.md). +[here](http://releases.k8s.io/release-1.4/docs/design/namespaces.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/networking.md b/networking.md index f55ffdc1..d022169b 100644 --- a/networking.md +++ b/networking.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/networking.md). +[here](http://releases.k8s.io/release-1.4/docs/design/networking.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/nodeaffinity.md b/nodeaffinity.md index 1459a321..18a079f2 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/nodeaffinity.md). +[here](http://releases.k8s.io/release-1.4/docs/design/nodeaffinity.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/persistent-storage.md b/persistent-storage.md index 7ccbe8d4..19706b1a 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/persistent-storage.md). +[here](http://releases.k8s.io/release-1.4/docs/design/persistent-storage.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/podaffinity.md b/podaffinity.md index 016160b6..33eaf60d 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/podaffinity.md). +[here](http://releases.k8s.io/release-1.4/docs/design/podaffinity.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/principles.md b/principles.md index eac1e432..762cae01 100644 --- a/principles.md +++ b/principles.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/principles.md). +[here](http://releases.k8s.io/release-1.4/docs/design/principles.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resource-qos.md b/resource-qos.md index f09f7e93..af84b648 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/resource-qos.md). +[here](http://releases.k8s.io/release-1.4/docs/design/resource-qos.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/resources.md b/resources.md index e03214aa..862e8d84 100644 --- a/resources.md +++ b/resources.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/resources.md). +[here](http://releases.k8s.io/release-1.4/docs/design/resources.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/scheduler_extender.md b/scheduler_extender.md index b3082a8e..577f5100 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/scheduler_extender.md). +[here](http://releases.k8s.io/release-1.4/docs/design/scheduler_extender.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/seccomp.md b/seccomp.md index 9741d8e4..69d121cb 100644 --- a/seccomp.md +++ b/seccomp.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/seccomp.md). +[here](http://releases.k8s.io/release-1.4/docs/design/seccomp.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/secrets.md b/secrets.md index 457bbac9..bb15c3d5 100644 --- a/secrets.md +++ b/secrets.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/secrets.md). +[here](http://releases.k8s.io/release-1.4/docs/design/secrets.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security.md b/security.md index 58be04c1..0c2b2ac9 100644 --- a/security.md +++ b/security.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/security.md). +[here](http://releases.k8s.io/release-1.4/docs/design/security.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/security_context.md b/security_context.md index 1ae5be90..1bc654f8 100644 --- a/security_context.md +++ b/security_context.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/security_context.md). +[here](http://releases.k8s.io/release-1.4/docs/design/security_context.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/selector-generation.md b/selector-generation.md index 46ed0adc..e54897a5 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/selector-generation.md). +[here](http://releases.k8s.io/release-1.4/docs/design/selector-generation.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/service_accounts.md b/service_accounts.md index 32b3a5c5..bef22c40 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/service_accounts.md). +[here](http://releases.k8s.io/release-1.4/docs/design/service_accounts.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 961b7b84..32a1cf35 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/simple-rolling-update.md). +[here](http://releases.k8s.io/release-1.4/docs/design/simple-rolling-update.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index 99ba482c..1a882c09 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/taint-toleration-dedicated.md). +[here](http://releases.k8s.io/release-1.4/docs/design/taint-toleration-dedicated.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). diff --git a/versioning.md b/versioning.md index b2693100..bf3183dd 100644 --- a/versioning.md +++ b/versioning.md @@ -21,7 +21,7 @@ refer to the docs that go with that version. The latest release of this document can be found -[here](http://releases.k8s.io/release-1.3/docs/design/versioning.md). +[here](http://releases.k8s.io/release-1.4/docs/design/versioning.md). Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- cgit v1.2.3 From 672de53afe31f18f5edbe4d0e97d88eea3c9c333 Mon Sep 17 00:00:00 2001 From: Cindy Wang Date: Thu, 1 Sep 2016 14:04:55 -0700 Subject: Snapshotting design doc minor edits, changing prefix on annotations update-munge-docs --- volume-snapshotting.md | 552 ++++++++++++++++++++++++++++++++++++++++++++++++ volume-snapshotting.png | Bin 0 -> 49261 bytes 2 files changed, 552 insertions(+) create mode 100644 volume-snapshotting.md create mode 100644 volume-snapshotting.png diff --git a/volume-snapshotting.md b/volume-snapshotting.md new file mode 100644 index 00000000..641e247d --- /dev/null +++ b/volume-snapshotting.md @@ -0,0 +1,552 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +Kubernetes Snapshotting Proposal +================================ + +**Authors:** [Cindy Wang](https://github.com/ciwang) + +## Background + +Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). + +Typical existing backup solutions offer on demand or scheduled snapshots. + +An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume. + +Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves. + +## Objectives + +For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals. + +* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers. + + * Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods. + +* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API. + + * Nongoal: Support Delete and Restore snapshot operations in API. + +* Goal 3: Implement snapshotting interface for GCE PDs. + + * Nongoal: Implement snapshotting interface for non GCE PD volumes. + +### Feature Roadmap + +Major features, in order of priority (bold features are priorities for v1): + +* **On demand snapshots** + + * **API to create new snapshots and list existing snapshots** + + * API to restore a disk from a snapshot and delete old snapshots + +* Scheduled snapshots + +* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node) + +## Requirements + +### Performance + +* Time SLA from issuing a snapshot to completion: + +* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location. + +* This should be on the order of a few minutes. + +### Reliability + +* Data corruption + + * Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons: + + * GCE and Amazon can create snapshots while the application is running. + + * Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code. + + * Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown. + +* Snapshot failure + + * Case: Failure during external process, such as during API call or upload + + * Log error, retry until success (indefinitely) + + * Case: Failure within Kubernetes, such as controller restarts + + * If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state. + +## Solution Overview + +Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects. + +* **Create:** + + * Key: create.snapshot.volume.alpha.kubernetes.io + + * Value: [snapshot name] + +* **List:** + + * Key: snapshot.volume.alpha.kubernetes.io/[snapshot name] + + * Value: [snapshot timestamp] + +A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes. + +The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage). + +## Detailed Design + +### API + +* Create snapshot + + * Usage: + + * Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter + + * When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list. + + * API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually. + + * PVC control loop in master + + * If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV. + + * If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV. + +* List existing snapshots + + * Only displayed as annotations on PVC object. + + * Only lists unique names and timestamps of snapshots taken using the Kubernetes API. + + * Usage: + + * Get the PVC object + + * Snapshots are listed as key-value pairs within the PVC annotations + +### SnapshotController + +![Snapshot Controller Diagram](volume-snapshotting.png?raw=true "Snapshot controller diagram") + +**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests. + +**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully. + +**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful. + +The controller will have a loop that does the following: + +* Fetch State + + * Fetch all PVC objects from the API server. + +* Act + + * Trigger snapshot: + + * Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation. + +* Persist State + + * Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server. + +Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor. + +The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor. + +### Create Snapshot Logic + +To create a snapshot: + +* Acquire operation lock for volume so that no other attach or detach operations can be started for volume. + + * Abort if there is already a pending operation for the specified volume (main loop will retry, if needed). + +* Spawn a new thread: + + * Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC. + + * For any errors, log the error, and terminate the thread (the main controller will retry as needed). + + * Once a snapshot is created successfully: + + * Make a call to the API server to delete the create snapshot annotation in the PVC object. + + * Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations. + +*Brainstorming notes below, read at your own risk!* + +* * * + + +Open questions: + +* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API? + + * It seems that the API route is a bit more feasible in implementation and can also be fully utilized. + + * Can the API call methods on VolumePlugins? Yeah via controller + + * The scheduler gives users functionality that doesn’t already exist, but required adding an entirely new controller + +* Should the list and restore operations be part of v1? + +* Do we call them snapshots or backups? + + * From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application." + +* At what minimum granularity should snapshots be allowed? + +* How do we store information about the most recent snapshot in case the controller restarts? + +* In case of error, do we err on the side of fewer or more snapshots? + +Snapshot Scheduler + +1. PVC API Object + +A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression. + +* CRUD operations on snapshot schedules + + * Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/) + + * The cron expression provides flexibility to decrease the interval between snapshots in future versions + + * Read: Display snapshot schedule to user via kubectl get pvc + + * Update: Do not support changing the snapshot schedule for an existing PVC + + * Delete: Do not support deleting the snapshot schedule for an existing PVC + + * In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name + +* Validation + + * Cron expressions must have a 0 in the minutes place and use exact, not interval syntax + + * [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals). + + * If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user + + Objective + +Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins. + +Goal: Implement snapshotting interface for GCE PDs. + +Goal: Protect against data loss by allowing users to restore snapshots of their disks. + +Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes. + +Nongoal: Use snapshotting to provide additional features such as migration. + + Background + +Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). + +Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage. + +Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules. + + what is the value case? Who wants this? What do we enable by implementing this? + +I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious. + +Requirements + +Functionality + +Should this support PVs, direct volumes, or both? + +Should we support deletion? + +Should we support restores? + +Automated schedule -- times or intervals? Before major event? + +Performance + +Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing? + +* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards. + + * Pending = uploading to GCE + +* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting + +Reliability + +How do we ascertain that deletions happen when we want them to? + +For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes. + +We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this. + +What happens when the snapshot fails as these are async operations? + +Retry (for some time period? indefinitely?) and log the error + +Other + +What is the UI for seeing the list of snapshots? + +In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots. + +Overview + +There are several design options for the design of each layer of implementation as follows. + +1. **Public API:** + +Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive: + + 1. New Volume API object + + 1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object. + + 2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot. + + 3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object. + + 4. Pros + + 1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule. + + 5. Cons + + 2. Heavyweight codewise; involves changing and touching a lot of existing code. + + 3. Potentially bad UX: How is the Volume API object created? + + 1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands. + + 2. When a unique volume is specified in a pod or PV spec. + + 2. Directly in volume definition in the pod/PV object + + 6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. + + 7. Pros + + 4. Easy for users to implement and understand + + 8. Cons + + 5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used. + + 6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set. + + 3. Only in PV object + + 9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. + + 10. Pros + + 7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property. + + 11. Cons + + 8. No support for direct volumes + + 9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job + + 10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules. + + 4. Annotations: key value pairs on API object + + 12. User experience is the same as (b) + + 13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours. + + 14. Pros + + 11. Less invasive to the codebase than (a-c) + + 15. Cons + + 12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation. + +2. **Business logic:** + + 5. Does this go on the master, node, or both? + + 16. Where the snapshot is stored + + 13. GCE, Amazon: cloud storage + + 14. Others stored on volume itself (gluster) or external drive (iSCSI) + + 17. Requirements for snapshot operation + + 15. Application flush, sync, and fsfreeze before creating snapshot + + 6. Suggestion: + + 18. New SnapshotController on master + + 16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot + + 17. If controller restarts and we miss a snapshot in the process, just skip it + + 3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5) + + 18. If snapshotting call fails, retry for an amount of time specified in retryPeriod + + 19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period + + 19. Logic to prepare the disk for snapshotting on node + + 20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD) + + 7. Alternatives: login entirely on node + + 20. Problems: + + 21. If pod moves from one node to another + + 4. A different node is in now in charge of snapshotting + + 5. If the volume plugin requires external memory for snapshots, we need to move the existing data + + 22. If the same pod exists on two different nodes, which node is in charge + +3. **Volume plugin interface/internal API:** + + 8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin) + + 9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot + + 10. Similar to volume.attach call + +4. **Other questions:** + + 11. Snapshot period + + 12. Time or period + + 13. What is our SLO around time accuracy? + + 21. Best effort, but no guarantees (depends on time or period) -- if going with time. + + 14. What if we miss a snapshot? + + 22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity + + 15. Will we know when an operation has failed? How do we report that? + + 23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases + + 16. Will we be responsible for GCing old snapshots? + + 24. Maybe this can be explicit non-goal, in the future can automate garbage collection + + 17. If the pod dies do we continue creating snapshots? + + 18. How to communicate errors (PD doesn’t support snapshotting, time period unsupported) + + 19. Off schedule snapshotting like before an application upgrade + + 20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this? + +Options, pros, cons, suggestion/recommendation + +Example 1b + +During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod’s associated volume. + +For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2): + +apiVersion: v1 +kind: Pod +metadata: + name: test-pd +spec: + containers: + - image: gcr.io/google_containers/test-webserver + name: test-container + volumeMounts: + - mountPath: /test-pd + name: test-volume + volumes: + - name: test-volume + # This GCE PD must already exist. + gcePersistentDisk: + pdName: my-data-disk + fsType: ext4 + +Introduce a new field into the volume spec: + +apiVersion: v1 +kind: Pod +metadata: + name: test-pd +spec: + containers: + - image: gcr.io/google_containers/test-webserver + name: test-container + volumeMounts: + - mountPath: /test-pd + name: test-volume + volumes: + - name: test-volume + # This GCE PD must already exist. + gcePersistentDisk: + pdName: my-data-disk + fsType: ext4 + +** ssTimes: ****[1, 5]** + + Caveats + +* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because + + * this does not provide value to the user and only adds an extra layer of indirection/complexity. + + * ? + + Dependencies + +* Kubernetes + +* Persistent volume snapshot support through API + + * POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/volume-snapshotting.md?pixel)]() + diff --git a/volume-snapshotting.png b/volume-snapshotting.png new file mode 100644 index 00000000..1b1ea748 Binary files /dev/null and b/volume-snapshotting.png differ -- cgit v1.2.3 From a8dc20232cf228556e97c4e14cf0e8b10011fa1d Mon Sep 17 00:00:00 2001 From: Matt Liggett Date: Tue, 6 Sep 2016 17:41:40 -0700 Subject: re-run update-munge-docs --- volume-snapshotting.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/volume-snapshotting.md b/volume-snapshotting.md index 641e247d..ad318d59 100644 --- a/volume-snapshotting.md +++ b/volume-snapshotting.md @@ -18,6 +18,11 @@ If you are using a released version of Kubernetes, you should refer to the docs that go with that version. + + +The latest release of this document can be found +[here](http://releases.k8s.io/release-1.4/docs/design/volume-snapshotting.md). + Documentation for other releases can be found at [releases.k8s.io](http://releases.k8s.io). -- cgit v1.2.3 From f75d4e6b985bac3348a453d6a278bdf4573d68c6 Mon Sep 17 00:00:00 2001 From: Jordan Liggitt Date: Thu, 8 Sep 2016 16:21:58 -0400 Subject: Doc API group suffix, add test to catch new groups --- extending-api.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/extending-api.md b/extending-api.md index 2a14e08e..6ce3159f 100644 --- a/extending-api.md +++ b/extending-api.md @@ -80,9 +80,9 @@ expected to be programmatically convertible to the name of the resource using the following conversion. Kinds are expected to be of the form ``, and the `APIVersion` for the object is expected to be `/`. To prevent collisions, it's expected that you'll -use a fully qualified domain name for the API group, e.g. `example.com`. +use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`. -For example `stable.example.com/v1` +For example `mygroup.example.com/v1` 'CamelCaseKind' is the specific type name. @@ -101,9 +101,9 @@ for ix := range kindName { } ``` -As a concrete example, the resource named `camel-case-kind.example.com` defines +As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines resources of Kind `CamelCaseKind`, in the APIGroup with the prefix -`example.com/...`. +`mygroup.example.com/...`. The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information. This is also the reason why `ThirdPartyResource` is @@ -120,7 +120,7 @@ For example, if a user creates: ```yaml metadata: - name: cron-tab.stable.example.com + name: cron-tab.mygroup.example.com apiVersion: extensions/v1beta1 kind: ThirdPartyResource description: "A specification of a Pod to run on a cron style schedule" @@ -130,7 +130,7 @@ versions: ``` Then the API server will program in the new RESTful resource path: - * `/apis/stable.example.com/v1/namespaces//crontabs/...` + * `/apis/mygroup.example.com/v1/namespaces//crontabs/...` **Note: This may take a while before RESTful resource path registration happen, please always check this before you create resource instances.** @@ -142,20 +142,20 @@ Now that this schema has been created, a user can `POST`: "metadata": { "name": "my-new-cron-object" }, - "apiVersion": "stable.example.com/v1", + "apiVersion": "mygroup.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", "image": "my-awesome-cron-image" } ``` -to: `/apis/stable.example.com/v1/namespaces/default/crontabs` +to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs` and the corresponding data will be stored into etcd by the APIServer, so that when the user issues: ``` -GET /apis/stable.example.com/v1/namespaces/default/crontabs/my-new-cron-object` +GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object` ``` And when they do that, they will get back the same data, but with additional @@ -164,21 +164,21 @@ Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in. Likewise, to list all resources, a user can issue: ``` -GET /apis/stable.example.com/v1/namespaces/default/crontabs +GET /apis/mygroup.example.com/v1/namespaces/default/crontabs ``` and get back: ```json { - "apiVersion": "stable.example.com/v1", + "apiVersion": "mygroup.example.com/v1", "kind": "CronTabList", "items": [ { "metadata": { "name": "my-new-cron-object" }, - "apiVersion": "stable.example.com/v1", + "apiVersion": "mygroup.example.com/v1", "kind": "CronTab", "cronSpec": "* * * * /5", "image": "my-awesome-cron-image" -- cgit v1.2.3 From 9bbb2c90567802b95cb6b2509fc73f16bc195136 Mon Sep 17 00:00:00 2001 From: Vishnu kannan Date: Wed, 7 Sep 2016 18:32:38 -0700 Subject: Fix oom-score-adj policy in kubelet. Docker daemon and kubelet needs to be protected by setting oom-score-adj to -999. Signed-off-by: Vishnu kannan --- resource-qos.md | 1 + 1 file changed, 1 insertion(+) diff --git a/resource-qos.md b/resource-qos.md index b2a43c97..6a8e8ab2 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -226,6 +226,7 @@ Pod OOM score configuration *Pod infra containers* or *Special Pod init process* - OOM_SCORE_ADJ: -998 + *Kubelet, Docker* - OOM_SCORE_ADJ: -999 (won’t be OOM killed) - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. -- cgit v1.2.3 From b31fe4fe236adba7209d13ff0eda96ea9805e14a Mon Sep 17 00:00:00 2001 From: Vish Kannan Date: Thu, 15 Sep 2016 19:28:59 -0700 Subject: Revert "[kubelet] Fix oom-score-adj policy in kubelet" --- resource-qos.md | 1 - 1 file changed, 1 deletion(-) diff --git a/resource-qos.md b/resource-qos.md index 6a8e8ab2..b2a43c97 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -226,7 +226,6 @@ Pod OOM score configuration *Pod infra containers* or *Special Pod init process* - OOM_SCORE_ADJ: -998 - *Kubelet, Docker* - OOM_SCORE_ADJ: -999 (won’t be OOM killed) - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. -- cgit v1.2.3 From e26d642d67c1cbf34e65d45634c3f43060e9a1e0 Mon Sep 17 00:00:00 2001 From: Vish Kannan Date: Fri, 16 Sep 2016 16:32:58 -0700 Subject: Revert "Revert "[kubelet] Fix oom-score-adj policy in kubelet"" --- resource-qos.md | 1 + 1 file changed, 1 insertion(+) diff --git a/resource-qos.md b/resource-qos.md index b2a43c97..6a8e8ab2 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -226,6 +226,7 @@ Pod OOM score configuration *Pod infra containers* or *Special Pod init process* - OOM_SCORE_ADJ: -998 + *Kubelet, Docker* - OOM_SCORE_ADJ: -999 (won’t be OOM killed) - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. -- cgit v1.2.3 From 07619709a9cebb66e374866272ff714dfe9ae6ac Mon Sep 17 00:00:00 2001 From: YuPengZTE Date: Tue, 20 Sep 2016 16:21:14 +0800 Subject: ie. is should be i.e. Signed-off-by: YuPengZTE --- security_context.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/security_context.md b/security_context.md index 1bc654f8..db0d4390 100644 --- a/security_context.md +++ b/security_context.md @@ -86,7 +86,7 @@ shared disks. constraints to isolate containers from their host. Different use cases need different settings. * The concept of a security context should not be tied to a particular security -mechanism or platform (ie. SELinux, AppArmor) +mechanism or platform (i.e. SELinux, AppArmor) * Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for [service accounts](service_accounts.md). -- cgit v1.2.3 From 490fb6ff1e707a3665a3165af52fa68f1461bfce Mon Sep 17 00:00:00 2001 From: YuPengZTE Date: Mon, 26 Sep 2016 17:05:53 +0800 Subject: The VS and dot is seprated Signed-off-by: YuPengZTE --- access.md | 2 +- admission_control_resource_quota.md | 2 +- service_accounts.md | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/access.md b/access.md index a01576e4..6021ac37 100644 --- a/access.md +++ b/access.md @@ -254,7 +254,7 @@ In the Enterprise Profile: In the Simple Profile: - There is a single `namespace` used by the single user. -Namespaces versus userAccount vs Labels: +Namespaces versus userAccount vs. Labels: - `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s. - `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 8265c9a9..4727dc0c 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -121,7 +121,7 @@ If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource) -## Resource Requirements: Requests vs Limits +## Resource Requirements: Requests vs. Limits If a resource supports the ability to distinguish between a request and a limit for a resource, the quota tracking system will only cost the request value diff --git a/service_accounts.md b/service_accounts.md index bef22c40..795f5212 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -113,9 +113,9 @@ system external to Kubernetes. Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or -may be qualified to allow for federated identity (`alice@example.com` vs +may be qualified to allow for federated identity (`alice@example.com` vs. `alice@example.org`.) Naming convention may distinguish service accounts from -user accounts (e.g. `alice@example.com` vs +user accounts (e.g. `alice@example.com` vs. `build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but Kubernetes does not require this. -- cgit v1.2.3 From 418d7c0c102b1b6c166b68724db90b1b0d220520 Mon Sep 17 00:00:00 2001 From: Paul Morie Date: Tue, 27 Sep 2016 11:19:54 -0400 Subject: Move SELinux proposal to docs/design --- selinux.md | 346 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 346 insertions(+) create mode 100644 selinux.md diff --git a/selinux.md b/selinux.md new file mode 100644 index 00000000..0b67ea4a --- /dev/null +++ b/selinux.md @@ -0,0 +1,346 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +## Abstract + +A proposal for enabling containers in a pod to share volumes using a pod level SELinux context. + +## Motivation + +Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin +authors should not have to explicitly account for SELinux except for volume types that require +special handling of the SELinux context during setup. + +Currently, each container in a pod has an SELinux context. This is not an ideal factoring for +sharing resources using SELinux. + +We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a +generic way. + +Goals of this design: + +1. Describe the problems with a container SELinux context +2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context + which is backward compatible with the v1.0.0 API + +## Constraints and Assumptions + +1. We will not support securing containers within a pod from one another +2. Volume plugins should not have to handle setting SELinux context on volumes +3. We will not deal with shared storage + +## Current State Overview + +### Docker + +Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux +context of a container can be overridden with the `SecurityOpt` api that allows setting the different +parts of the SELinux context individually. + +Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different +use-cases: + +1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's + SELinux context +2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's + SElinux context, but remove the MCS labels, making the volume shareable between containers + +We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container +(from an SELinux standpoint) can use the volume. + +### rkt + +rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts` +and allocates a unique MCS label per pod. + +### Kubernetes + + +There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the +EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a +patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem +in general of handling SELinux in kubernetes to merging this PR. + +A new `PodSecurityContext` type has been added that carries information about security attributes +that apply to the entire pod and that apply to all containers in a pod. See: + +1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939) +1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823) + +## Use Cases + +1. As a cluster operator, I want to support securing pods from one another using SELinux when + SELinux integration is enabled in the cluster +2. As a user, I want volumes sharing to work correctly amongst containers in pods + +#### SELinux context: pod- or container- level? + +Currently, SELinux context is specifiable only at the container level. This is an inconvenient +factoring for sharing volumes and other SELinux-secured resources between containers because there +is no way in SELinux to share resources between processes with different MCS labels except to +remove MCS labels from the shared resource. This is a big security risk: _any container_ in the +system can work with a resource which has the same SELinux context as it and no MCS labels. Since +we are also not interested in isolating containers in a pod from one another, the SELinux context +should be shared by all containers in a pod to facilitate isolation from the containers in other +pods and sharing resources amongst all the containers of a pod. + +#### Volumes + +Kubernetes volumes can be divided into two broad categories: + +1. Unshared storage: + 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, + downward api. All volumes in this category delegate to `EmptyDir` for their underlying + storage. + 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively + by a single pod*. +2. Shared storage: + 1. `hostPath` is shared storage because it is necessarily used by a container and the host + 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. + 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because + they may be used simultaneously by multiple pods. + +For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon` +operation on the volume directory after running the volume plugin's `Setup` function. For these +volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume +plugin code. Some volume plugins may need to use the SELinux context during a mount operation in +certain cases. To account for this, our design must have a way for volume plugins to state that +a particular volume should or should not receive generic label management. + +For shared storage, the picture is murkier. Labels for existing shared storage will be managed +outside Kubernetes and administrators will have to set the SELinux context of pods correctly. +The problem of solving SELinux label management for new shared storage is outside the scope for +this proposal. + +## Analysis + +The system needs to be able to: + +1. Model correctly which volumes require SELinux label management +1. Relabel volumes with the correct SELinux context when required + +### Modeling whether a volume requires label management + +#### Unshared storage: volumes derived from `EmptyDir` + +Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure +that the ownership and SELinux context (when relevant) are set correctly for the volume to be +usable. + +#### Unshared storage: network block devices + +Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way +as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` +volumes, permissions and ownership can be managed on the client side by the Kubelet when used +exclusively by one pod. When the volumes are used outside of a persistent volume, or with the +`ReadWriteOnce` mode, they are effectively unshared storage. + +When used by multiple pods, there are many additional use-cases to analyze before we can be +confident that we can support SELinux label management robustly with these file systems. The right +design is one that makes it easy to experiment and develop support for ownership management with +volume plugins to enable developers and cluster operators to continue exploring these issues. + +#### Shared storage: hostPath + +The `hostPath` volume should only be used by effective-root users, and the permissions of paths +exposed into containers via hostPath volumes should always be managed by the cluster operator. If +the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath` +volume could affect changes in the state of arbitrary paths within the host's filesystem. This +would be a severe security risk, so we will consider hostPath a corner case that the kubelet should +never perform ownership management for. + +#### Shared storage: network + +Ownership management of shared storage is a complex topic. SELinux labels for existing shared +storage will be managed externally from Kubernetes. For this case, our API should make it simple to +express whether a particular volume should have these concerns managed by Kubernetes. + +We will not attempt to address the concerns of new shared storage in this proposal. + +When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` +modes, it is shared storage, and thus outside the scope of this proposal. + +#### API requirements + +From the above, we know that label management must be applied: + +1. To some volume types always +2. To some volume types never +3. To some volume types *sometimes* + +Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it +is desirable for other container runtime implementations to provide similar functionality. + +Relabeling should be an optional aspect of a volume plugin to accommodate: + +1. volume types for which generalized relabeling support is not sufficient +2. testing for each volume plugin individually + +## Proposed Design + +Our design should minimize code for handling SELinux labelling required in the Kubelet and volume +plugins. + +### Deferral: MCS label allocation + +Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the +primitives for higher level composition; making these automatic is a longer-term goal. Allocating +groups and MCS labels are fairly complex problems in their own right, and so our proposal will not +encompass either of these topics. There are several problems that the solution for allocation +depends on: + +1. Users and groups in Kubernetes +2. General auth policy in Kubernetes +3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893) + +### API changes + +The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823) +adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is +the addition of the semantics to this field: + +* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership +management in the Kubelet have their SELinuxContext set from this field. + +```go +package api + +type PodSecurityContext struct { + // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's + // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. + // + // This field will be used to set the SELinux of volumes that support SELinux label management + // by the kubelet. + SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` +} +``` + +The V1 API is extended with the same semantics: + +```go +package v1 + +type PodSecurityContext struct { + // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's + // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. + // + // This field will be used to set the SELinux of volumes that support SELinux label management + // by the kubelet. + SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` +} +``` + +#### API backward compatibility + +Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive +SELinux label management for their volumes. This is acceptable since old clients won't know about +this field and won't have any expectation of their volumes being managed this way. + +The existing backward compatibility semantics for SELinux do not change at all with this proposal. + +### Kubelet changes + +The Kubelet should be modified to perform SELinux label management when required for a volume. The +criteria to activate the kubelet SELinux label management for volumes are: + +1. SELinux integration is enabled in the cluster +2. SELinux is enabled on the node +3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set +4. The volume plugin supports SELinux label management + +The `volume.Mounter` interface should have a new method added that indicates whether the plugin +supports SELinux label management: + +```go +package volume + +type Builder interface { + // other methods omitted + SupportsSELinux() bool +} +``` + +Individual volume plugins are responsible for correctly reporting whether they support label +management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its +derivations will be tested with ownership management support: + +| Plugin Name | SupportsOwnershipManagement | +|-------------------------|-------------------------------| +| `hostPath` | false | +| `emptyDir` | true | +| `gitRepo` | true | +| `secret` | true | +| `downwardAPI` | true | +| `gcePersistentDisk` | false | +| `awsElasticBlockStore` | false | +| `nfs` | false | +| `iscsi` | false | +| `glusterfs` | false | +| `persistentVolumeClaim` | depends on underlying volume and PV mode | +| `rbd` | false | +| `cinder` | false | +| `cephfs` | false | + +Ultimately, the matrix will theoretically look like: + +| Plugin Name | SupportsOwnershipManagement | +|-------------------------|-------------------------------| +| `hostPath` | false | +| `emptyDir` | true | +| `gitRepo` | true | +| `secret` | true | +| `downwardAPI` | true | +| `gcePersistentDisk` | true | +| `awsElasticBlockStore` | true | +| `nfs` | false | +| `iscsi` | true | +| `glusterfs` | false | +| `persistentVolumeClaim` | depends on underlying volume and PV mode | +| `rbd` | true | +| `cinder` | false | +| `cephfs` | false | + +In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a +function of the container runtime implementations. Initially, we will modify the docker runtime +implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish +generic label management for docker containers. + +Volume types that require SELinux context information at mount must be injected with and respect the +enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism +will be used to carry information about label management enablement to the volume plugins that have +to manage labels individually. + +This allows the volume plugins to determine when they do and don't want this type of support from +the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selinux.md?pixel)]() + -- cgit v1.2.3 From 7a23d7bbd0d1c1b665cc8ebcdd11314e26f27b17 Mon Sep 17 00:00:00 2001 From: Doug Davis Date: Thu, 5 May 2016 13:41:49 -0700 Subject: Change minion to node Contination of #1111 I tried to keep this PR down to just a simple search-n-replace to keep things simple. I may have gone too far in some spots but its easy to roll those back if needed. I avoided renaming `contrib/mesos/pkg/minion` because there's already a `contrib/mesos/pkg/node` dir and fixing that will require a bit of work due to a circular import chain that pops up. So I'm saving that for a follow-on PR. I rolled back some of this from a previous commit because it just got to big/messy. Will follow up with additional PRs Signed-off-by: Doug Davis --- aws_under_the_hood.md | 14 +++++++------- event_compression.md | 8 ++++---- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 9702a4fa..77b18d75 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -64,7 +64,7 @@ you manually created or configured your cluster. ### Architecture overview Kubernetes is a cluster of several machines that consists of a Kubernetes -master and a set number of nodes (previously known as 'minions') for which the +master and a set number of nodes (previously known as 'nodes') for which the master which is responsible. See the [Architecture](architecture.md) topic for more details. @@ -161,7 +161,7 @@ Note that we do not automatically open NodePort services in the AWS firewall NodePort services are more of a building block for things like inter-cluster services or for LoadBalancer. To consume a NodePort service externally, you will likely have to open the port in the node security group -(`kubernetes-minion-`). +(`kubernetes-node-`). For SSL support, starting with 1.3 two annotations can be added to a service: @@ -194,7 +194,7 @@ modifying the headers. kube-proxy sets up two IAM roles, one for the master called [kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) and one for the nodes called -[kubernetes-minion](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). +[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). The master is responsible for creating ELBs and configuring them, as well as setting up advanced VPC routing. Currently it has blanket permissions on EC2, @@ -242,7 +242,7 @@ HTTP URLs are passed to instances; this is how Kubernetes code gets onto the machines. * Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): * `kubernetes-master` is used by the master. - * `kubernetes-minion` is used by nodes. + * `kubernetes-node` is used by nodes. * Creates an AWS SSH key named `kubernetes-`. Fingerprint here is the OpenSSH key fingerprint, so that multiple users can run the script with different keys and their keys will not collide (with near-certainty). It will @@ -265,7 +265,7 @@ The debate is open here, where cluster-per-AZ is discussed as more robust but cross-AZ-clusters are more convenient. * Associates the subnet to the route table * Creates security groups for the master (`kubernetes-master-`) -and the nodes (`kubernetes-minion-`). +and the nodes (`kubernetes-node-`). * Configures security groups so that masters and nodes can communicate. This includes intercommunication between masters and nodes, opening SSH publicly for both masters and nodes, and opening port 443 on the master for the HTTPS @@ -281,8 +281,8 @@ information that must be passed in this way. routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to 10.246.0.0/24). * For auto-scaling, on each nodes it creates a launch configuration and group. -The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-minion-group. The default -name is kubernetes-minion-group. The auto-scaling group has a min and max size +The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default +name is kubernetes-node-group. The auto-scaling group has a min and max size that are both set to NUM_NODES. You can change the size of the auto-scaling group to add or remove the total number of nodes from within the AWS API or Console. Each nodes self-configures, meaning that they come up; run Salt with diff --git a/event_compression.md b/event_compression.md index 738c3a1c..bbac945a 100644 --- a/event_compression.md +++ b/event_compression.md @@ -170,10 +170,10 @@ Sample kubectl output: ```console FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE -Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -- cgit v1.2.3 From 9156099ab26575afe88421fd8b9c2608b4217009 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Wed, 28 Sep 2016 16:58:55 -0500 Subject: docs/networking: update IPv6 support section --- networking.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/networking.md b/networking.md index d022169b..f1d973c5 100644 --- a/networking.md +++ b/networking.md @@ -207,10 +207,13 @@ External IP assignment would also simplify DNS support (see below). ### IPv6 -IPv6 would be a nice option, also, but we can't depend on it yet. Docker support -is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), -[Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), -[Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). +IPv6 support would be nice but requires significant internal changes in a few +areas. First pods should be able to report multiple IP addresses +[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398) +and the network plugin architecture Kubernetes uses needs to allow returning +IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245). +Kubernetes code that deals with IP addresses must then be audited and fixed to +support both IPv4 and IPv6 addresses and not assume IPv4. Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-) -- cgit v1.2.3 From bf5cb84fa95379462fb0934686e415d0a61745f2 Mon Sep 17 00:00:00 2001 From: Joe Beda Date: Wed, 10 Aug 2016 15:51:23 -0700 Subject: Remove support for boot2docker --- clustering/Makefile | 4 ---- clustering/README.md | 4 ---- 2 files changed, 8 deletions(-) diff --git a/clustering/Makefile b/clustering/Makefile index d5640164..e72d441e 100644 --- a/clustering/Makefile +++ b/clustering/Makefile @@ -39,7 +39,3 @@ docker: docker-clean: docker rmi clustering-seqdiag || true docker images -q --filter "dangling=true" | xargs docker rmi - -.PHONY: fix-clock-skew -fix-clock-skew: - boot2docker ssh sudo date -u -D "%Y%m%d%H%M.%S" --set "$(shell date -u +%Y%m%d%H%M.%S)" diff --git a/clustering/README.md b/clustering/README.md index 014b96c2..d662b952 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -56,10 +56,6 @@ The first run will be slow but things should be fast after that. To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`. -If you are using boot2docker and get warnings about clock skew (or if things -aren't building for some reason) then you can fix that up with -`make fix-clock-skew`. - ## Automatically rebuild on file changes If you have the fswatch utility installed, you can have it monitor the file -- cgit v1.2.3 From 1a5d6cfb10d7b9338ac8b7d41ab171785603cd22 Mon Sep 17 00:00:00 2001 From: markturansky Date: Mon, 15 Aug 2016 10:19:15 -0400 Subject: add pvc storage to LimitRange --- admission_control_limit_range.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index 27637769..aa0134b4 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -47,6 +47,7 @@ as part of admission control. 4. Ability to specify default resource limits for a container 5. Ability to specify default resource requests for a container 6. Ability to enforce a ratio between request and limit for a resource. +7. Ability to enforce min/max storage requests for persistent volume claims ## Data Model @@ -209,6 +210,23 @@ Across all containers in pod, the following must hold true | Max | Limit (required) <= Max | | LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) | +**Type: PersistentVolumeClaim** + +Supported Resources: + +1. storage + +Supported Constraints: + +Across all claims in a namespace, the following must hold true: + +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min >= Request (required) | +| Max | Max <= Request (required) | + +Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time. + ## Run-time configuration The default ```LimitRange``` that is applied via Salt configuration will be -- cgit v1.2.3 From e66185b8628300bedfd68f7b886343a027acb44d Mon Sep 17 00:00:00 2001 From: Denis Andrejew Date: Fri, 7 Oct 2016 12:48:02 +0200 Subject: fix typo in podaffinity.md --- podaffinity.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/podaffinity.md b/podaffinity.md index 33eaf60d..d9796cd9 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -213,7 +213,7 @@ are satisfied for each node, and the node(s) with the highest weight(s) are the most preferred. In reality there are two variants of `RequiredDuringScheduling`: one suffixed -with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. +with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`. For the first variant, if the affinity/anti-affinity ceases to be met at some point during pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod from its node. In the second variant, the system may -- cgit v1.2.3 From a218555a5427415774a70698ab0130a94bababfd Mon Sep 17 00:00:00 2001 From: Filip Grzadkowski Date: Thu, 13 Oct 2016 22:30:14 +0200 Subject: Add monitoring architecture. --- monitoring_architecture.md | 232 ++++++++++++++++++++++++++++++++++++++++++++ monitoring_architecture.png | Bin 0 -> 76662 bytes 2 files changed, 232 insertions(+) create mode 100644 monitoring_architecture.md create mode 100644 monitoring_architecture.png diff --git a/monitoring_architecture.md b/monitoring_architecture.md new file mode 100644 index 00000000..b1fc51b9 --- /dev/null +++ b/monitoring_architecture.md @@ -0,0 +1,232 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). +
+-- + + + + + +# Kubernetes monitoring architecture + +## Executive Summary + +Monitoring is split into two pipelines: + +* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down +Heapster called metrics-server, and the API server serving the master metrics API. These +metrics are used by core system components, such as scheduling logic (e.g. scheduler and +horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components +(e.g. `kubectl top`). This pipeline is not intended for integration with third-party +monitoring systems. +* A **monitoring pipeline** used for collecting various metrics from the system and exposing +them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore +via adapters. Users can choose from many monitoring system vendors, or run none at all. In +open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options +will be easy to install. We expect that such pipelines will typically consist of a per-node +agent and a cluster-level aggregator. + +The architecture is illustrated in the diagram in the Appendix of this doc. + +## Introduction and Objectives + +This document proposes a high-level monitoring architecture for Kubernetes. It covers +a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc, +specifically focusing on an architecture (components and their interactions) that +hopefully meets the numerous requirements. We do not specify any particular timeframe +for implementing this architecture, nor any particular roadmap for getting there. + +### Terminology + +There are two types of metrics, system metrics and service metrics. System metrics are +generic metrics that are generally available from every entity that is monitored (e.g. +usage of CPU and memory by container and node). Service metrics are explicitly defined +in application code and exported (e.g. number of 500s served by the API server). Both +system metrics and service metrics can originate from users’ containers or from system +infrastructure components (master components like the API server, addon pods running on +the master, and addon pods running on user nodes). + +We divide system metrics into + +* *core metrics*, which are metrics that Kubernetes understands and uses for operation +of its internal components and core utilities -- for example, metrics used for scheduling +(including the inputs to the algorithms for resource estimation, initial resources/vertical +autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics), +the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage, +memory instantaneous usage, disk usage of pods, disk usage of containers +* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they +include the core metrics (though not necessarily in a format Kubernetes understands) plus +additional metrics. + +Service metrics can be divided into those produced by Kubernetes infrastructure components +(and thus useful for operation of the Kubernetes cluster) and those produced by user applications. +Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics. +Of course horizontal pod autoscaling also uses core metrics. + +We consider logging to be separate from monitoring, so logging is outside the scope of +this doc. + +### Requirements + +The monitoring architecture should + +* include a solution that is part of core Kubernetes and + * makes core system metrics about nodes, pods, and containers available via a standard + master API (today the master metrics API), such that core Kubernetes features do not + depend on non-core components + * requires Kubelet to only export a limited set of metrics, namely those required for + core Kubernetes components to correctly operate (this is related to #18770) + * can scale up to at least 5000 nodes + * is small enough that we can require that all of its components be running in all deployment + configurations +* include an out-of-the-box solution that can serve historical data, e.g. to support Initial +Resources and vertical pod autoscaling as well as cluster analytics queries, that depends +only on core Kubernetes +* allow for third-party monitoring solutions that are not part of core Kubernetes and can +be integrated with components like Horizontal Pod Autoscaler that require service metrics + +## Architecture + +We divide our description of the long-term architecture plan into the core metrics pipeline +and the monitoring pipeline. For each, it is necessary to think about how to deal with each +type of metric (core metrics, non-core metrics, and service metrics) from both the master +and minions. + +### Core metrics pipeline + +The core metrics pipeline collects a set of core system metrics. There are two sources for +these metrics + +* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that +is part of Kubelet will be slimmed down to provide only core system metrics) +* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from +Kubelet into resource estimates (values used by scheduler for a more advanced usage-based +scheduler) + +These sources are scraped by a component we call *metrics-server* which is like a slimmed-down +version of today's Heapster. metrics-server stores locally only latest values and has no sinks. +metrics-server exposes the master metrics API. (The configuration described here is similar +to the current Heapster in “standalone” mode.) +[Discovery summarizer](../../docs/proposals/federated-api-servers.md) +makes the master metrics API available to external clients such that from the client’s perspective +it looks the same as talking to the API server. + +Core (system) metrics are handled as described above in all deployment environments. The only +easily replaceable part is resource estimator, which could be replaced by power users. In +theory, metric-server itself can also be substituted, but it’d be similar to substituting +apiserver itself or controller-manager - possible, but not recommended and not supported. + +Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon +themselves (e.g. CPU usage of Kubelet), even though they do not run in containers. + +The core metrics pipeline is intentionally small and not designed for third-party integrations. +“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline +(see next section) and can run on Kubernetes without having to make changes to upstream components. +In this way we can remove the burden we have today that comes with maintaining Heapster as the +integration point for every possible metrics source, sink, and feature. + +#### Infrastore + +We will build an open-source Infrastore component (most likely reusing existing technologies) +for serving historical queries over core system metrics and events, which it will fetch from +the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries -- +this is TBD) to handle the following use cases + +* initial resources +* vertical autoscaling +* oldtimer API +* decision-support queries for debugging, capacity planning, etc. +* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) + +In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes +infrastructure containers), described in the upcoming sections. + +### Monitoring pipeline + +One of the goals of building a dedicated metrics pipeline for core metrics, as described in the +previous section, is to allow for a separate monitoring pipeline that can be very flexible +because core Kubernetes components do not need to rely on it. By default we will not provide +one, but we will provide an easy way to install one (using a single command, most likely using +Helm). We described the monitoring pipeline in this section. + +Data collected by the monitoring pipeline may contain any sub- or superset of the following groups +of metrics: + +* core system metrics +* non-core system metrics +* service metrics from user application containers +* service metrics from Kubernetes infrastructure containers; these metrics are exposed using +Prometheus instrumentation + +It is up to the monitoring solution to decide which of these are collected. + +In order to enable horizontal pod autoscaling based on custom metrics, the provider of the +monitoring pipeline would also have to create a stateless API adapter that pulls the custom +metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such +API will be a well defined, versioned API similar to regular APIs. Details of how it will be +exposed or discovered will be covered in a detailed design doc for this component. + +The same approach applies if it is desired to make monitoring pipeline metrics available in +Infrastore. These adapters could be standalone components, libraries, or part of the monitoring +solution itself. + +There are many possible combinations of node and cluster-level agents that could comprise a +monitoring pipeline, including +cAdvisor + Heapster + InfluxDB (or any other sink) +* cAdvisor + collectd + Heapster +* cAdvisor + Prometheus +* snapd + Heapster +* snapd + SNAP cluster-level agent +* Sysdig + +As an example we’ll describe a potential integration with cAdvisor + Prometheus. + +Prometheus has the following metric sources on a node: +* core and non-core system metrics from cAdvisor +* service metrics exposed by containers via HTTP handler in Prometheus format +* [optional] metrics about node itself from Node Exporter (a Prometheus component) + +All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus +cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a +standalone API adapter that proxies/translates between the Prometheus Query Language endpoint +on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be +used to make the metrics from the monitoring pipeline available in Infrastore. Neither +adapter is necessary if the user does not need the corresponding feature. + +The command that installs cAdvisor+Prometheus should also automatically set up collection +of the metrics from infrastructure containers. This is possible because the names of the +infrastructure containers and metrics of interest are part of the Kubernetes control plane +configuration itself, and because the infrastructure containers export their metrics in +Prometheus format. + +## Appendix: Architecture diagram + +### Open-source monitoring pipeline + +![Architecture Diagram](monitoring_architecture.png?raw=true "Architecture overview") + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/monitoring_architecture.md?pixel)]() + diff --git a/monitoring_architecture.png b/monitoring_architecture.png new file mode 100644 index 00000000..570996b7 Binary files /dev/null and b/monitoring_architecture.png differ -- cgit v1.2.3 From 23caa2c9e06de2e790a49be6cb5d297d3d1a9293 Mon Sep 17 00:00:00 2001 From: Marcin Wielgus Date: Wed, 5 Oct 2016 13:24:59 +0200 Subject: Federated replica set design doc --- federated-replicasets.md | 542 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 542 insertions(+) create mode 100644 federated-replicasets.md diff --git a/federated-replicasets.md b/federated-replicasets.md new file mode 100644 index 00000000..16db0379 --- /dev/null +++ b/federated-replicasets.md @@ -0,0 +1,542 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Federated ReplicaSets + +# Requirements & Design Document + +This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. + +Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) +Based on discussions with +Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) + +## Overview + +### Summary & Vision + +When running a global application on a federation of Kubernetes +clusters the owner currently has to start it in multiple clusters and +control whether he has both enough application replicas running +locally in each of the clusters (so that, for example, users are +handled by a nearby cluster, with low latency) and globally (so that +there is always enough capacity to handle all traffic). If one of the +clusters has issues or hasn’t enough capacity to run the given set of +replicas the replicas should be automatically moved to some other +cluster to keep the application responsive. + +In single cluster Kubernetes there is a concept of ReplicaSet that +manages the replicas locally. We want to expand this concept to the +federation level. + +### Goals + ++ Win large enterprise customers who want to easily run applications + across multiple clusters ++ Create a reference controller implementation to facilitate bringing + other Kubernetes concepts to Federated Kubernetes. + +## Glossary + +Federation Cluster - a cluster that is a member of federation. + +Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster +that is a member of federation. + +Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. + +Federated ReplicaSet Controller (FRSC) - A controller running inside +of Federated K8S server that controlls FRS. + +## User Experience + +### Critical User Journeys + ++ [CUJ1] User wants to create a ReplicaSet in each of the federation + cluster. They create a definition of federated ReplicaSet on the + federated master and (local) ReplicaSets are automatically created + in each of the federation clusters. The number of replicas is each + of the Local ReplicaSets is (perheps indirectly) configurable by + the user. ++ [CUJ2] When the current number of replicas in a cluster drops below + the desired number and new replicas cannot be scheduled then they + should be started in some other cluster. + +### Features Enabling Critical User Journeys + +Feature #1 -> CUJ1: +A component which looks for newly created Federated ReplicaSets and +creates the appropriate Local ReplicaSet definitions in the federated +clusters. + +Feature #2 -> CUJ2: +A component that checks how many replicas are actually running in each +of the subclusters and if the number matches to the +FederatedReplicaSet preferences (by default spread replicas evenly +across the clusters but custom preferences are allowed - see +below). If it doesn’t and the situation is unlikely to improve soon +then the replicas should be moved to other subclusters. + +### API and CLI + +All interaction with FederatedReplicaSet will be done by issuing +kubectl commands pointing on the Federated Master API Server. All the +commands would behave in a similar way as on the regular master, +however in the next versions (1.5+) some of the commands may give +slightly different output. For example kubectl describe on federated +replica set should also give some information about the subclusters. + +Moreover, for safety, some defaults will be different. For example for +kubectl delete federatedreplicaset cascade will be set to false. + +FederatedReplicaSet would have the same object as local ReplicaSet +(although it will be accessible in a different part of the +api). Scheduling preferences (how many replicas in which cluster) will +be passed as annotations. + +### FederateReplicaSet preferences + +The preferences are expressed by the following structure, passed as a +serialized json inside annotations. + +``` +type FederatedReplicaSetPreferences struct { + // If set to true then already scheduled and running replicas may be moved to other clusters to + // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, + // up and running replicas will not be moved. + Rebalance bool `json:"rebalance,omitempty"` + + // Map from cluster name to preferences for that cluster. It is assumed that if a cluster + // doesn’t have a matching entry then it should not have local replica. The cluster matches + // to "*" if there is no entry with the real cluster name. + Clusters map[string]LocalReplicaSetPreferences +} + +// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. +type ClusterReplicaSetPreferences struct { + // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. + MinReplicas int64 `json:"minReplicas,omitempty"` + + // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). + MaxReplicas *int64 `json:"maxReplicas,omitempty"` + + // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. + Weight int64 +} +``` + +How this works in practice: + +**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity. + Replica layout: A=16 B=17 C=17. ++ Clusters A,B,C and C has capacity for 6 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C. B and C are offline: + Replica layout: A=50 + +**Scenario 2**. I want to have only 2 replicas in each of the clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} + } +} +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } + } + } + +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} + } +} +``` + +There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. + +**Scenario 3**. I want to have 20 replicas in each of 3 clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} + } +} +``` + +There is a global target for 50, however clusters require 60. So some clusters will have less replicas. + Replica layout: A=20 B=20 C=10. + +**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=16 B=17 C=17. ++ B is offline/has no capacity + Replica layout: A=30 B=0 C=20 ++ A and B are offline: + Replica layout: C=20 + +**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 1000000} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=50 B=0 C=0. ++ A has capacity for only 40 replicas + Replica layout: A=40 B=5 C=5 + +**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 2} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there +are already some replicas, please do not move them. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : false + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity, but A already has 20 replicas + Replica layout: A=20 B=15 C=15. ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. + Replica layout: A=30 B=14 C=6 + +## The Idea + +A new federated controller - Federated Replica Set Controller (FRSC) +will be created inside federated controller manager. Below are +enumerated the key idea elements: + ++ [I0] It is considered OK to have slightly higher number of replicas + globally for some time. + ++ [I1] FRSC starts an informer on the FederatedReplicaSet that listens + on FRS being created, updated or deleted. On each create/update the + scheduling code will be started to calculate where to put the + replicas. The default behavior is to start the same amount of + replicas in each of the cluster. While creating LocalReplicaSets + (LRS) the following errors/issues can occur: + + + [E1] Master rejects LRS creation (for known or unknown + reason). In this case another attempt to create a LRS should be + attempted in 1m or so. This action can be tied with + [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created + the situation is the same as [E5]. If this happens multiple + times all due replicas should be moved elsewhere and later moved + back once the LRS is created. + + + [E2] LRS with the same name but different configuration already + exists. The LRS is then overwritten and an appropriate event + created to explain what happened. Pods under the control of the + old LRS are left intact and the new LRS may adopt them if they + match the selector. + + + [E3] LRS is new but the pods that match the selector exist. The + pods are adopted by the RS (if not owned by some other + RS). However they may have a different image, configuration + etc. Just like with regular LRS. + ++ [I2] For each of the cluster FRSC starts a store and an informer on + LRS that will listen for status updates. These status changes are + only interesting in case of troubles. Otherwise it is assumed that + LRS runs trouble free and there is always the right number of pod + created but possibly not scheduled. + + + + [E4] LRS is manually deleted from the local cluster. In this case + a new LRS should be created. It is the same case as + [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind + won’t be killed and will be adopted after the LRS is recreated. + + + [E5] LRS fails to create (not necessary schedule) the desired + number of pods due to master troubles, admission control + etc. This should be considered as the same situation as replicas + unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). + + + [E6] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216). + ++ [I3] For each of the cluster start an store and informer to monitor + whether the created pods are eventually scheduled and what is the + current number of correctly running ready pods. Errors: + + + [E7] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216) + ++ [I4] It is assumed that a not scheduled pod is a normal situation +and can last up to X min if there is a huge traffic on the +cluster. However if the replicas are not scheduled in that time then +FRSC should consider moving most of the unscheduled replicas +elsewhere. For that purpose FRSC will maintain a data structure +where for each FRS controlled LRS we store a list of pods belonging +to that LRS along with their current status and status change timestamp. + ++ [I5] If a new cluster is added to the federation then it doesn’t + have a LRS and the situation is equal to + [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). + ++ [I6] If a cluster is removed from the federation then the situation + is equal to multiple [E4]. It is assumed that if a connection with + a cluster is lost completely then the cluster is removed from the + the cluster list (or marked accordingly) so + [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) + don’t need to be handled. + ++ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), + checked against the current list of clusters, and all missing LRS + are created. This will be executed in combination with [I8]. + ++ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min + (configurable) to check whether some replica move between clusters + is needed or not. + ++ FRSC never moves replicas to LRS that have not scheduled/running +pods or that has pods that failed to be created. + + + When FRSC notices that a number of pods are not scheduler/running + or not_even_created in one LRS for more than Y minutes it takes + most of them from LRS, leaving couple still waiting so that once + they are scheduled FRSC will know that it is ok to put some more + replicas to that cluster. + ++ [I9] FRS becomes ToBeChecked if: + + It is newly created + + Some replica set inside changed its status + + Some pods inside cluster changed their status + + Some cluster is added or deleted. +> FRS stops ToBeChecked if is in desired configuration (or is stable enough). + +## (RE)Scheduling algorithm + +To calculate the (re)scheduling moves for a given FRS: + +1. For each cluster FRSC calculates the number of replicas that are placed +(not necessary up and running) in the cluster and the number of replicas that +failed to be scheduled. Cluster capacity is the difference between the +the placed and failed to be scheduled. + +2. Order all clusters by their weight and hash of the name so that every time +we process the same replica-set we process the clusters in the same order. +Include federated replica set name in the cluster name hash so that we get +slightly different ordering for different RS. So that not all RS of size 1 +end up on the same cluster. + +3. Assign minimum prefered number of replicas to each of the clusters, if +there is enough replicas and capacity. + +4. If rebalance = false, assign the previously present replicas to the clusters, +remember the number of extra replicas added (ER). Of course if there +is enough replicas and capacity. + +5. Distribute the remaining replicas with regard to weights and cluster capacity. +In multiple iterations calculate how many of the replicas should end up in the cluster. +For each of the cluster cap the number of assigned replicas by max number of replicas and +cluster capacity. If there were some extra replicas added to the cluster in step +4, don't really add the replicas but balance them gains ER from 4. + +## Goroutines layout + ++ [GR1] Involved in FRS informer (see + [[I1]]). Whenever a FRS is created and + updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with + delay 0. + ++ [GR2_1...GR2_N] Involved in informers/store on LRS (see + [[I2]]). On all changes the FRS is put on + FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR3_1...GR3_N] Involved in informers/store on Pods + (see [[I3]] and [[I4]]). They maintain the status store + so that for each of the LRS we know the number of pods that are + actually running and ready in O(1) time. They also put the + corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR4] Involved in cluster informer (see + [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE + with delay 0. + ++ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on + FRS_CHANNEL after the given delay (and remove from + FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to + FRS_TO_CHECK_QUEUE the delays are compared and updated so that the + shorter delay is used. + ++ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever + a FRS is received it is put to a work queue. Work queue has no delay + and makes sure that a single replica set is process is processed by + only one goroutine. + ++ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. + Multiple replica set can be processed in parallel. Two Goroutines cannot + process the same FRS at the same time. + + +## Func DoFrsCheck + +The function does [[I7]] and[[I8]]. It is assumed that it is run on a +single thread/goroutine so we check and evaluate the same FRS on many +goroutines (however if needed the function can be parallelized for +different FRS). It takes data only from store maintained by GR2_* and +GR3_*. The external communication is only required to: + ++ Create LRS. If a LRS doesn’t exist it is created after the + rescheduling, when we know how much replicas should it have. + ++ Update LRS replica targets. + +If FRS is not in the desired state then it is put to +FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). + +## Monitoring and status reporting + +FRCS should expose a number of metrics form the run, like + ++ FRSC -> LRS communication latency ++ Total times spent in various elements of DoFrsCheck + +FRSC should also expose the status of FRS as an annotation on FRS and +as events. + +## Workflow + +Here is the sequence of tasks that need to be done in order for a +typical FRS to be split into a number of LRS’s and to be created in +the underlying federated clusters. + +Note a: the reason the workflow would be helpful at this phase is that +for every one or two steps we can create PRs accordingly to start with +the development. + +Note b: we assume that the federation is already in place and the +federated clusters are added to the federation. + +Step 1. the client sends an RS create request to the +federation-apiserver + +Step 2. federation-apiserver persists an FRS into the federation etcd + +Note c: federation-apiserver populates the clusterid field in the FRS +before persisting it into the federation etcd + +Step 3: the federation-level “informer” in FRSC watches federation +etcd for new/modified FRS’s, with empty clusterid or clusterid equal +to federation ID, and if detected, it calls the scheduling code + +Step 4. + +Note d: scheduler populates the clusterid field in the LRS with the +IDs of target clusters + +Note e: at this point let us assume that it only does the even +distribution, i.e., equal weights for all of the underlying clusters + +Step 5. As soon as the scheduler function returns the control to FRSC, +the FRSC starts a number of cluster-level “informer”s, one per every +target cluster, to watch changes in every target cluster etcd +regarding the posted LRS’s and if any violation from the scheduled +number of replicase is detected the scheduling code is re-called for +re-scheduling purposes. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() + -- cgit v1.2.3 From 5e8efc72f1476a65e12ea7d9b0ced02374b8f34f Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Tue, 25 Oct 2016 22:24:50 +0200 Subject: Remove 'this is HEAD' warning on docs --- README.md | 34 ------------------------------- access.md | 34 ------------------------------- admission_control.md | 34 ------------------------------- admission_control_limit_range.md | 34 ------------------------------- admission_control_resource_quota.md | 34 ------------------------------- architecture.md | 34 ------------------------------- aws_under_the_hood.md | 34 ------------------------------- clustering.md | 34 ------------------------------- clustering/README.md | 33 ------------------------------ command_execution_port_forwarding.md | 34 ------------------------------- configmap.md | 34 ------------------------------- control-plane-resilience.md | 34 ------------------------------- daemon.md | 34 ------------------------------- downward_api_resources_limits_requests.md | 34 ------------------------------- enhance-pluggable-policy.md | 34 ------------------------------- event_compression.md | 34 ------------------------------- expansion.md | 34 ------------------------------- extending-api.md | 34 ------------------------------- federated-replicasets.md | 29 -------------------------- federated-services.md | 34 ------------------------------- federation-phase-1.md | 34 ------------------------------- horizontal-pod-autoscaler.md | 34 ------------------------------- identifiers.md | 34 ------------------------------- indexed-job.md | 34 ------------------------------- metadata-policy.md | 34 ------------------------------- monitoring_architecture.md | 29 -------------------------- namespaces.md | 34 ------------------------------- networking.md | 34 ------------------------------- nodeaffinity.md | 34 ------------------------------- persistent-storage.md | 34 ------------------------------- podaffinity.md | 34 ------------------------------- principles.md | 34 ------------------------------- resource-qos.md | 34 ------------------------------- resources.md | 33 ------------------------------ scheduler_extender.md | 34 ------------------------------- seccomp.md | 34 ------------------------------- secrets.md | 34 ------------------------------- security.md | 34 ------------------------------- security_context.md | 34 ------------------------------- selector-generation.md | 34 ------------------------------- selinux.md | 29 -------------------------- service_accounts.md | 34 ------------------------------- simple-rolling-update.md | 34 ------------------------------- taint-toleration-dedicated.md | 34 ------------------------------- versioning.md | 34 ------------------------------- volume-snapshotting.md | 34 ------------------------------- 46 files changed, 1547 deletions(-) diff --git a/README.md b/README.md index 1a812e2e..85fc8245 100644 --- a/README.md +++ b/README.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes Design Overview Kubernetes is a system for managing containerized applications across multiple diff --git a/access.md b/access.md index 6021ac37..b23e463b 100644 --- a/access.md +++ b/access.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/access.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # K8s Identity and Access Management Sketch This document suggests a direction for identity and access management in the diff --git a/admission_control.md b/admission_control.md index e9f52528..a7330104 100644 --- a/admission_control.md +++ b/admission_control.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/admission_control.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes Proposal - Admission Control **Related PR:** diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md index aa0134b4..06cce2cb 100644 --- a/admission_control_limit_range.md +++ b/admission_control_limit_range.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/admission_control_limit_range.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Admission control plugin: LimitRanger ## Background diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md index 4727dc0c..575db9a8 100644 --- a/admission_control_resource_quota.md +++ b/admission_control_resource_quota.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/admission_control_resource_quota.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Admission control plugin: ResourceQuota ## Background diff --git a/architecture.md b/architecture.md index 5e489dfa..95e3aef4 100644 --- a/architecture.md +++ b/architecture.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/architecture.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes architecture A running Kubernetes cluster contains node agents (`kubelet`) and master diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index ffbe8b39..8f2d9377 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/aws_under_the_hood.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Peeking under the hood of Kubernetes on AWS This document provides high-level insight into how Kubernetes works on AWS and diff --git a/clustering.md b/clustering.md index c5f67a20..ca42035b 100644 --- a/clustering.md +++ b/clustering.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/clustering.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Clustering in Kubernetes diff --git a/clustering/README.md b/clustering/README.md index d662b952..d7e2e2e0 100644 --- a/clustering/README.md +++ b/clustering/README.md @@ -1,36 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/clustering/README.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - This directory contains diagrams for the clustering design doc. This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md index 2af98cbe..a7175403 100644 --- a/command_execution_port_forwarding.md +++ b/command_execution_port_forwarding.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/command_execution_port_forwarding.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Container Command Execution & Port Forwarding in Kubernetes ## Abstract diff --git a/configmap.md b/configmap.md index 9b7fa0a2..658ac73b 100644 --- a/configmap.md +++ b/configmap.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/configmap.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Generic Configuration Object ## Abstract diff --git a/control-plane-resilience.md b/control-plane-resilience.md index 9e65a1e3..8193fd97 100644 --- a/control-plane-resilience.md +++ b/control-plane-resilience.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/control-plane-resilience.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes and Cluster Federation Control Plane Resilience ## Long Term Design and Current Status diff --git a/daemon.md b/daemon.md index 5185f2e4..2c306056 100644 --- a/daemon.md +++ b/daemon.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/daemon.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # DaemonSet in Kubernetes **Author**: Ananya Kumar (@AnanyaKumar) diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md index 89907d22..ab17c321 100644 --- a/downward_api_resources_limits_requests.md +++ b/downward_api_resources_limits_requests.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/downward_api_resources_limits_requests.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Downward API for resource limits and requests ## Background diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md index 529aa588..2468d3c1 100644 --- a/enhance-pluggable-policy.md +++ b/enhance-pluggable-policy.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/enhance-pluggable-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Enhance Pluggable Policy While trying to develop an authorization plugin for Kubernetes, we found a few diff --git a/event_compression.md b/event_compression.md index bbac945a..7a1cbb33 100644 --- a/event_compression.md +++ b/event_compression.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/event_compression.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes Event Compression This document captures the design of event compression. diff --git a/expansion.md b/expansion.md index 277e7211..ace1faf0 100644 --- a/expansion.md +++ b/expansion.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/expansion.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Variable expansion in pod command, args, and env ## Abstract diff --git a/extending-api.md b/extending-api.md index 6ce3159f..45a07ca5 100644 --- a/extending-api.md +++ b/extending-api.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/extending-api.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Adding custom resources to the Kubernetes API server This document describes the design for implementing the storage of custom API diff --git a/federated-replicasets.md b/federated-replicasets.md index 16db0379..f1744ade 100644 --- a/federated-replicasets.md +++ b/federated-replicasets.md @@ -1,32 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Federated ReplicaSets # Requirements & Design Document diff --git a/federated-services.md b/federated-services.md index fe050da3..b9d51c43 100644 --- a/federated-services.md +++ b/federated-services.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/federated-services.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes Cluster Federation (previously nicknamed "Ubernetes") ## Cross-cluster Load Balancing and Service Discovery diff --git a/federation-phase-1.md b/federation-phase-1.md index a1798f6e..0a3a8f50 100644 --- a/federation-phase-1.md +++ b/federation-phase-1.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/federation-phase-1.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Ubernetes Design Spec (phase one) **Huawei PaaS Team** diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md index f76e3ee4..1ac9c24b 100644 --- a/horizontal-pod-autoscaler.md +++ b/horizontal-pod-autoscaler.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/horizontal-pod-autoscaler.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - -

Warning! This document might be outdated.

# Horizontal Pod Autoscaling diff --git a/identifiers.md b/identifiers.md index 004b6bac..a37411f9 100644 --- a/identifiers.md +++ b/identifiers.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/identifiers.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Identifiers and Names in Kubernetes A summarization of the goals and recommendations for identifiers in Kubernetes. diff --git a/indexed-job.md b/indexed-job.md index 28655391..13bf154e 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/indexed-job.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Design: Indexed Feature of Job object diff --git a/metadata-policy.md b/metadata-policy.md index 4ffb0ba4..57416f11 100644 --- a/metadata-policy.md +++ b/metadata-policy.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/metadata-policy.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system ## Introduction diff --git a/monitoring_architecture.md b/monitoring_architecture.md index b1fc51b9..b819eeca 100644 --- a/monitoring_architecture.md +++ b/monitoring_architecture.md @@ -1,32 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes monitoring architecture ## Executive Summary diff --git a/namespaces.md b/namespaces.md index 8aa44fe9..8a9c97c8 100644 --- a/namespaces.md +++ b/namespaces.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/namespaces.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Namespaces ## Abstract diff --git a/networking.md b/networking.md index f1d973c5..6e269481 100644 --- a/networking.md +++ b/networking.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/networking.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Networking There are 4 distinct networking problems to solve: diff --git a/nodeaffinity.md b/nodeaffinity.md index 18a079f2..61e04169 100644 --- a/nodeaffinity.md +++ b/nodeaffinity.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/nodeaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Node affinity and NodeSelector ## Introduction diff --git a/persistent-storage.md b/persistent-storage.md index 19706b1a..4e0f82dc 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/persistent-storage.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Persistent Storage This document proposes a model for managing persistent, cluster-scoped storage diff --git a/podaffinity.md b/podaffinity.md index 33eaf60d..4ff26fa0 100644 --- a/podaffinity.md +++ b/podaffinity.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/podaffinity.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Inter-pod topological affinity and anti-affinity ## Introduction diff --git a/principles.md b/principles.md index 762cae01..4e0b663c 100644 --- a/principles.md +++ b/principles.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/principles.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Design Principles Principles to follow when extending Kubernetes. diff --git a/resource-qos.md b/resource-qos.md index 6a8e8ab2..b6feaae5 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/resource-qos.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Resource Quality of Service in Kubernetes **Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) diff --git a/resources.md b/resources.md index 862e8d84..bb66885b 100644 --- a/resources.md +++ b/resources.md @@ -1,36 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/resources.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - **Note: this is a design doc, which describes features that have not been completely implemented. User documentation of the current state is [here](../user-guide/compute-resources.md). The tracking issue for diff --git a/scheduler_extender.md b/scheduler_extender.md index 577f5100..1f362242 100644 --- a/scheduler_extender.md +++ b/scheduler_extender.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/scheduler_extender.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Scheduler extender There are three ways to add new scheduling rules (predicates and priority diff --git a/seccomp.md b/seccomp.md index 69d121cb..de00cbc0 100644 --- a/seccomp.md +++ b/seccomp.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/seccomp.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - ## Abstract A proposal for adding **alpha** support for diff --git a/secrets.md b/secrets.md index bb15c3d5..ca02c977 100644 --- a/secrets.md +++ b/secrets.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/secrets.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - ## Abstract A proposal for the distribution of [secrets](../user-guide/secrets.md) diff --git a/security.md b/security.md index 0c2b2ac9..b1aeacbd 100644 --- a/security.md +++ b/security.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/security.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Security in Kubernetes Kubernetes should define a reasonable set of security best practices that allows diff --git a/security_context.md b/security_context.md index db0d4390..76bc8ee8 100644 --- a/security_context.md +++ b/security_context.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/security_context.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Security Contexts ## Abstract diff --git a/selector-generation.md b/selector-generation.md index e54897a5..efb32cf2 100644 --- a/selector-generation.md +++ b/selector-generation.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/selector-generation.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - Design ============= diff --git a/selinux.md b/selinux.md index 0b67ea4a..ece83d44 100644 --- a/selinux.md +++ b/selinux.md @@ -1,32 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - ## Abstract A proposal for enabling containers in a pod to share volumes using a pod level SELinux context. diff --git a/service_accounts.md b/service_accounts.md index 795f5212..89a3771b 100644 --- a/service_accounts.md +++ b/service_accounts.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/service_accounts.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Service Accounts ## Motivation diff --git a/simple-rolling-update.md b/simple-rolling-update.md index 32a1cf35..c4a5f671 100644 --- a/simple-rolling-update.md +++ b/simple-rolling-update.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/simple-rolling-update.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - ## Simple rolling update This is a lightweight design document for simple diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md index 1a882c09..c523319f 100644 --- a/taint-toleration-dedicated.md +++ b/taint-toleration-dedicated.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/taint-toleration-dedicated.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Taints, Tolerations, and Dedicated Nodes ## Introduction diff --git a/versioning.md b/versioning.md index bf3183dd..ae724b12 100644 --- a/versioning.md +++ b/versioning.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/versioning.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Kubernetes API and Release Versioning Reference: [Semantic Versioning](http://semver.org) diff --git a/volume-snapshotting.md b/volume-snapshotting.md index ad318d59..e92ed3d1 100644 --- a/volume-snapshotting.md +++ b/volume-snapshotting.md @@ -1,37 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - - - -The latest release of this document can be found -[here](http://releases.k8s.io/release-1.4/docs/design/volume-snapshotting.md). - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - Kubernetes Snapshotting Proposal ================================ -- cgit v1.2.3 From 8a951390d6d0e105ccc4546a506d670b315f622a Mon Sep 17 00:00:00 2001 From: Filip Grzadkowski Date: Wed, 27 Jul 2016 03:29:11 +0200 Subject: Design for automated HA master deployment --- ha_master.md | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 265 insertions(+) create mode 100644 ha_master.md diff --git a/ha_master.md b/ha_master.md new file mode 100644 index 00000000..6f2d91d7 --- /dev/null +++ b/ha_master.md @@ -0,0 +1,265 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Automated HA master deployment + +**Author:** filipg@, jsz@ + +# Introduction + +We want to allow users to easily replicate kubernetes masters to have highly available cluster, +initially using `kube-up.sh` and `kube-down.sh`. + +This document describes technical design of this feature. It assumes that we are using aforementioned +scripts for cluster deployment. All of the ideas described in the following sections should be easy +to implement on GCE, AWS and other cloud providers. + +It is a non-goal to design a specific setup for bare-metal environment, which +might be very different. + +# Overview + +In a cluster with replicated master, we will have N VMs, each running regular master components +such as apiserver, etcd, scheduler or controller manager. These components will interact in the +following way: +* All etcd replicas will be clustered together and will be using master election + and quorum mechanism to agree on the state. All of these mechanisms are integral + parts of etcd and we will only have to configure them properly. +* All apiserver replicas will be working independently talking to an etcd on + 127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master + (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)). +* We will introduce provider specific solutions to load balance traffic between master replicas + (see section `load balancing`) +* Controller manager, scheduler & cluster autoscaler will use lease mechanism and + only a single instance will be an active master. All other will be waiting in a standby mode. +* All add-on managers will work independently and each of them will try to keep add-ons in sync + +# Detailed design + +## Components + +### etcd + +``` +Note: This design for etcd clustering is quite pet-set like - each etcd +replica has its name which is explicitly used in etcd configuration etc. In +medium-term future we would like to have the ability to run masters as part of +autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas +automatically. This is pretty tricky and this design does not cover this. +It will be covered in a separate doc. +``` + +All etcd instances will be clustered together and one of them will be an elected master. +In order to commit any change quorum of the cluster will have to confirm it. Etcd will be +configured in such a way that all writes and reads will go through the master (requests +will be forwarded by the local etcd server such that it’s invisible for the user). It will +affect latency for all operations, but it should not increase by much more than the network +latency between master replicas (latency between GCE zones with a region is < 10ms). + +Currently etcd exposes port only using localhost interface. In order to allow clustering +and inter-VM communication we will also have to use public interface. To secure the +communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)). + +When generating command line for etcd we will always assume it’s part of a cluster +(initially of size 1) and list all existing kubernetes master replicas. +Based on that, we will set the following flags: +* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one) +* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one): + * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty + * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty. + +This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs +with master replicas will be generated in `kube-up.sh` script and passed to as a env variable +`INITIAL_ETCD_CLUSTER`. + +### apiservers + +All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact +etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the +etcd leader. This functionality is completely hidden from the client (apiserver +in our case). + +Caching mechanism, which is implemented in apiserver, will not be affected by +replicating master because: +* GET requests go directly to etcd +* LIST requests go either directly to etcd or to cache populated via watch + (depending on the ResourceVersion in ListOptions). In the second scenario, + after a PUT/POST request, changes might not be visible in LIST response. + This is however not worse than it is with the current single master. +* WATCH does not give any guarantees when change will be delivered. + +#### load balancing + +With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud +providers have different capabilities and limitations, we will not try to find a common lowest +denominator that will work everywhere. Instead we will document various options and apply different +solution for different deployments. Below we list possible approaches: + +1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed +automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS) +or Google Cloud DNS (GCP). For load balancing we will have two options: + 1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately + 1.2. use round-robin DNS technique to access all apiservers directly +2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries +will be manually managed by the user. We will provide detailed documentation for the entries we +expect. +3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static +external IP address that is later assigned to the master VM. When creating additional replicas we +will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer +instead of a single master. When removing second to last replica we will reverse this operation (assign +IP address to the remaining master VM and delete load balancer). That way user will not have to provide +a domain name and all client configurations will keep working. + +This will also impact `kubelet <-> master` communication as it should use load +balancing for it. Depending on the chosen method we will use it to properly configure +kubelet. + +#### `kubernetes` service + +Kubernetes maintains a special service called `kubernetes`. Currently it keeps a +list of IP addresses for all apiservers. As it uses a command line flag +`--apiserver-count` it is not very dynamic and would require restarting all +masters to change number of master replicas. + +To allow dynamic changes to the number of apiservers in the cluster, we will +introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration +time for each apiserver (keyed by IP). Each apiserver will do three things: + +1. periodically update expiration time for it's own IP address +2. remove all the stale IP addresses from the endpoints list +3. add it's own IP address if it's not on the list yet. + +That way we will not only solve the problem of dynamically changing number +of apiservers in the cluster, but also the problem of non-responsive apiservers +that should be removed from the `kubernetes` service endpoints list. + +#### Certificates + +Certificate generation will work as today. In particular, on GCE, we will +generate it for the public IP used to access the cluster (see `load balancing` +section) and local IP of the master replica VM. + +That means that with multiple master replicas and a load balancer in front +of them, accessing one of the replicas directly (using it's ephemeral public +IP) will not work on GCE without appropriate flags: + +- `kubectl --insecure-skip-tls-verify=true` +- `curl --insecure` +- `wget --no-check-certificate` + +For other deployment tools and providers the details of certificate generation +may be different, but it must be possible to access the cluster by using either +the main cluster endpoint (DNS name or IP address) or internal service called +`kubernetes` that points directly to the apiservers. + +### controller manager, scheduler & cluster autoscaler + +Controller manager and scheduler will by default use a lease mechanism to choose an active instance +among all masters. Only one instance will be performing any operations. +All other will be waiting in standby mode. + +We will use the same configuration in non-replicated mode to simplify deployment scripts. + +### add-on manager + +All add-on managers will be working independently. Each of them will observe current state of +add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on +can be updated multiple times in a row after upgrading the master. Long-term we should fix this +by using a similar mechanisms as controller manager or scheduler. However, currently add-on +manager is just a bash script and adding a master election mechanism would not be easy. + +## Adding replica + +Command to add new replica on GCE using kube-up script: + +``` +KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh +``` + +A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following: + +``` +1. If there is no load balancer for this cluster: + 1. Create load balancer using ephemeral IP address + 2. Add existing apiserver to the load balancer + 3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) + 4. Update DNS to point to the load balancer. +2. Clone existing master (create a new VM with the same configuration) including + all env variables (certificates, IP ranges etc), with the exception of + `INITIAL_ETCD_CLUSTER`. +3. SSH to an existing master and run the following command to extend etcd cluster + with the new instance: + `curl :4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://:2380"]}'` +4. Add IP address of the new apiserver to the load balancer. +``` + +A simplified algorithm for adding a new master replica and promoting master IP to the load balancer +is identical to the one when using DNS, with a different step to setup load balancer: + +``` +1. If there is no load balancer for this cluster: + 1. Unassign IP from the existing master replica + 2. Create load balancer using static IP reclaimed in the previous step + 3. Add existing apiserver to the load balancer + 4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) +... +``` + +## Deleting replica + +Command to delete one replica on GCE using kube-up script: + +``` +KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh +``` + +A pseudo-code for deleting an existing replica for the master is the following: + +``` +1. Remove replica IP address from the load balancer or DNS configuration +2. SSH to one of the remaining masters and run the following command to remove replica from the cluster: + `curl etcd-0:4001/v2/members/ -XDELETE -L` +3. Delete replica VM +4. If load balancer has only a single target instance, then delete load balancer +5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM. +``` + +## Upgrades + +Upgrading replicated master will be possible by upgrading them one by one using existing tools +(e.g. upgrade.sh for GCE). This will work out of the box because: +* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible. +* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components +will be in the same version. +* Apiserver talks only to a local etcd replica which will be in a compatible version +* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]() + -- cgit v1.2.3 From 966ab0cdcedae8053348fefd9bc97da2f583c9ea Mon Sep 17 00:00:00 2001 From: Sergey Maslyakov Date: Mon, 31 Oct 2016 10:59:36 -0500 Subject: Editorial: An orphaned "which" deleted. --- aws_under_the_hood.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 8f2d9377..2c161df8 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -31,7 +31,7 @@ you manually created or configured your cluster. Kubernetes is a cluster of several machines that consists of a Kubernetes master and a set number of nodes (previously known as 'nodes') for which the -master which is responsible. See the [Architecture](architecture.md) topic for +master is responsible. See the [Architecture](architecture.md) topic for more details. By default on AWS: -- cgit v1.2.3 From 450248dc0b1305d0fb06f12a19f09c39a28f002e Mon Sep 17 00:00:00 2001 From: guangxuli Date: Thu, 3 Nov 2016 09:40:34 +0800 Subject: add latest docker config secret type --- secrets.md | 1 + 1 file changed, 1 insertion(+) diff --git a/secrets.md b/secrets.md index ca02c977..29d18411 100644 --- a/secrets.md +++ b/secrets.md @@ -324,6 +324,7 @@ const ( SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth + SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth // FUTURE: other type values ) -- cgit v1.2.3 From a3c7bf92a2008f246fa9ef82cda15b1b6c097082 Mon Sep 17 00:00:00 2001 From: Jimmy Cuadra Date: Thu, 27 Oct 2016 23:16:31 -1000 Subject: Rename PetSet to StatefulSet in docs and examples. --- indexed-job.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/indexed-job.md b/indexed-job.md index 13bf154e..5a089c22 100644 --- a/indexed-job.md +++ b/indexed-job.md @@ -481,7 +481,7 @@ The multiple substitution approach: for very large jobs, the work-queue style or another type of controller, such as map-reduce or spark, may be a better fit.) - Drawback: is a form of server-side templating, which we want in Kubernetes but -have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). +have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). The index-only approach: @@ -874,24 +874,24 @@ admission time; it will need to understand indexes. previous container failures. - modify the job template, affecting all indexes. -#### Comparison to PetSets +#### Comparison to StatefulSets (previously named PetSets) -The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b. -The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more +The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b. +The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more restrictive and thus less verbose. -It would be easier for users if Indexed Job and PetSet are similar where -possible. However, PetSet differs in several key respects: +It would be easier for users if Indexed Job and StatefulSet are similar where +possible. However, StatefulSet differs in several key respects: -- PetSet is for ones to tens of instances. Indexed job should work with tens of +- StatefulSet is for ones to tens of instances. Indexed job should work with tens of thousands of instances. -- When you have few instances, you may want to given them pet names. When you -have many instances, you that many instances, integer indexes make more sense. +- When you have few instances, you may want to give them names. When you have many instances, +integer indexes make more sense. - When you have thousands of instances, storing the work-list in the JobSpec -is verbose. For PetSet, this is less of a problem. -- PetSets (apparently) need to differ in more fields than indexed Jobs. +is verbose. For StatefulSet, this is less of a problem. +- StatefulSets (apparently) need to differ in more fields than indexed Jobs. -This differs from PetSet in that PetSet uses names and not indexes. PetSet is +This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is intended to support ones to tens of things. -- cgit v1.2.3 From 53bd6400c2484cfcace52d12e7f2370a729f7bc7 Mon Sep 17 00:00:00 2001 From: Jeff Vance Date: Thu, 15 Sep 2016 12:54:48 -0700 Subject: added details on pv-pvc matching --- persistent-storage.md | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/persistent-storage.md b/persistent-storage.md index 4e0f82dc..70bcde97 100644 --- a/persistent-storage.md +++ b/persistent-storage.md @@ -195,6 +195,51 @@ NAME LABELS STATUS VOLUME myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e ``` +A claim must request access modes and storage capacity. This is because internally PVs are +indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity. +A claim may request one of more of the following attributes to better match a PV: volume name, selectors, +and volume class (currently implemented as an annotation). + +A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which +PVC it will match. +A PV may also define labels, annotations, and a volume class (currently implemented as an +annotation) to better target PVCs. + +As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is +matched to a PV: + +1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered. +"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines +the mode requested by the claim. + +1. The potential PVs above are considered in order of the closest access mode match, with the best case +being an exact match, and a worse case being more modes than requested by the claim. + +1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity +is not less than the storage being requested by the claim then this PV will bind to the claim. Done. + +1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is +skipped and will be handled by Dynamic Provisioning. + +1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a +placeholder, then the PV is skipped. + +1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the +PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match +since capacities may differ. + +1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder +for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped. +If the annotations for the PV and PVC are empty they are treated as being equal. + +1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs, +the PV with the smallest capacity that is also equal to or greater than the claim's requested storage +is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all +of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner. + +*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default +`StorageClass` has been defined) then a volume will be dynamically provisioned. + #### Claim usage The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim -- cgit v1.2.3 From 09d8fabb877b2bbaddac6ec4317c418b5ea035f7 Mon Sep 17 00:00:00 2001 From: xiangpengzhao Date: Thu, 17 Nov 2016 00:08:13 -0500 Subject: Fix container to pod in resource-qos.md --- resource-qos.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-qos.md b/resource-qos.md index b6feaae5..cfbe4faf 100644 --- a/resource-qos.md +++ b/resource-qos.md @@ -51,7 +51,7 @@ The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theo Pods can be of one of 3 different classes: -- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the container is classified as **Guaranteed**. +- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**. Examples: -- cgit v1.2.3 From 75f0592c3a46fdc26a1e89258f79b38ff7eb6b0a Mon Sep 17 00:00:00 2001 From: Tim Hockin Date: Fri, 18 Nov 2016 13:28:46 -0800 Subject: Remove a few versioned-warnings that snuck in, again --- ha_master.md | 29 ----------------------------- 1 file changed, 29 deletions(-) diff --git a/ha_master.md b/ha_master.md index 6f2d91d7..d4cf26a9 100644 --- a/ha_master.md +++ b/ha_master.md @@ -1,32 +1,3 @@ - - - - -WARNING -WARNING -WARNING -WARNING -WARNING - -

PLEASE NOTE: This document applies to the HEAD of the source tree

- -If you are using a released version of Kubernetes, you should -refer to the docs that go with that version. - -Documentation for other releases can be found at -[releases.k8s.io](http://releases.k8s.io). - --- - - - - - # Automated HA master deployment **Author:** filipg@, jsz@ -- cgit v1.2.3 From 19c2fcbabd186da4097368393a4b1771feed31ee Mon Sep 17 00:00:00 2001 From: yupeng Date: Fri, 25 Nov 2016 17:34:31 +0800 Subject: fix the mistake type Signed-off-by: yupeng --- aws_under_the_hood.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md index 2c161df8..6e3c5afb 100644 --- a/aws_under_the_hood.md +++ b/aws_under_the_hood.md @@ -198,7 +198,7 @@ Within the AWS cloud provider logic, we filter requests to the AWS APIs to match resources with our cluster tag. By filtering the requests, we ensure that we see only our own AWS objects. -** Important: ** If you choose not to use kube-up, you must pick a unique +**Important:** If you choose not to use kube-up, you must pick a unique cluster-id value, and ensure that all AWS resources have a tag with `Name=KubernetesCluster,Value=`. -- cgit v1.2.3 From 7fcccebe88d62c02c8679fc2a1d192e25cb474ab Mon Sep 17 00:00:00 2001 From: Michelle Noorali Date: Wed, 30 Nov 2016 14:11:51 -0500 Subject: refactor: isolate docs/design --- README.md | 62 - access.md | 376 ----- admission_control.md | 106 -- admission_control_limit_range.md | 233 --- admission_control_resource_quota.md | 215 --- architecture.dia | Bin 6523 -> 0 bytes architecture.md | 85 - architecture.png | Bin 268126 -> 0 bytes architecture.svg | 1943 ---------------------- aws_under_the_hood.md | 310 ---- clustering.md | 128 -- clustering/.gitignore | 1 - clustering/Dockerfile | 26 - clustering/Makefile | 41 - clustering/README.md | 35 - clustering/dynamic.png | Bin 72373 -> 0 bytes clustering/dynamic.seqdiag | 24 - clustering/static.png | Bin 36583 -> 0 bytes clustering/static.seqdiag | 16 - command_execution_port_forwarding.md | 158 -- configmap.md | 300 ---- control-plane-resilience.md | 241 --- daemon.md | 206 --- design/README.md | 62 + design/access.md | 376 +++++ design/admission_control.md | 106 ++ design/admission_control_limit_range.md | 233 +++ design/admission_control_resource_quota.md | 215 +++ design/architecture.dia | Bin 0 -> 6523 bytes design/architecture.md | 85 + design/architecture.png | Bin 0 -> 268126 bytes design/architecture.svg | 1943 ++++++++++++++++++++++ design/aws_under_the_hood.md | 310 ++++ design/clustering.md | 128 ++ design/clustering/.gitignore | 1 + design/clustering/Dockerfile | 26 + design/clustering/Makefile | 41 + design/clustering/README.md | 35 + design/clustering/dynamic.png | Bin 0 -> 72373 bytes design/clustering/dynamic.seqdiag | 24 + design/clustering/static.png | Bin 0 -> 36583 bytes design/clustering/static.seqdiag | 16 + design/command_execution_port_forwarding.md | 158 ++ design/configmap.md | 300 ++++ design/control-plane-resilience.md | 241 +++ design/daemon.md | 206 +++ design/downward_api_resources_limits_requests.md | 622 +++++++ design/enhance-pluggable-policy.md | 429 +++++ design/event_compression.md | 169 ++ design/expansion.md | 417 +++++ design/extending-api.md | 203 +++ design/federated-replicasets.md | 513 ++++++ design/federated-services.md | 517 ++++++ design/federation-phase-1.md | 407 +++++ design/ha_master.md | 236 +++ design/horizontal-pod-autoscaler.md | 263 +++ design/identifiers.md | 113 ++ design/indexed-job.md | 900 ++++++++++ design/metadata-policy.md | 137 ++ design/monitoring_architecture.md | 203 +++ design/monitoring_architecture.png | Bin 0 -> 76662 bytes design/namespaces.md | 370 ++++ design/networking.md | 190 +++ design/nodeaffinity.md | 246 +++ design/persistent-storage.md | 292 ++++ design/podaffinity.md | 673 ++++++++ design/principles.md | 101 ++ design/resource-qos.md | 218 +++ design/resources.md | 370 ++++ design/scheduler_extender.md | 105 ++ design/seccomp.md | 266 +++ design/secrets.md | 628 +++++++ design/security.md | 218 +++ design/security_context.md | 192 +++ design/selector-generation.md | 180 ++ design/selinux.md | 317 ++++ design/service_accounts.md | 210 +++ design/simple-rolling-update.md | 131 ++ design/taint-toleration-dedicated.md | 291 ++++ design/ubernetes-cluster-state.png | Bin 0 -> 13824 bytes design/ubernetes-design.png | Bin 0 -> 20358 bytes design/ubernetes-scheduling.png | Bin 0 -> 39094 bytes design/versioning.md | 174 ++ design/volume-snapshotting.md | 523 ++++++ design/volume-snapshotting.png | Bin 0 -> 49261 bytes downward_api_resources_limits_requests.md | 622 ------- enhance-pluggable-policy.md | 429 ----- event_compression.md | 169 -- expansion.md | 417 ----- extending-api.md | 203 --- federated-replicasets.md | 513 ------ federated-services.md | 517 ------ federation-phase-1.md | 407 ----- ha_master.md | 236 --- horizontal-pod-autoscaler.md | 263 --- identifiers.md | 113 -- indexed-job.md | 900 ---------- metadata-policy.md | 137 -- monitoring_architecture.md | 203 --- monitoring_architecture.png | Bin 76662 -> 0 bytes namespaces.md | 370 ---- networking.md | 190 --- nodeaffinity.md | 246 --- persistent-storage.md | 292 ---- podaffinity.md | 673 -------- principles.md | 101 -- resource-qos.md | 218 --- resources.md | 370 ---- scheduler_extender.md | 105 -- seccomp.md | 266 --- secrets.md | 628 ------- security.md | 218 --- security_context.md | 192 --- selector-generation.md | 180 -- selinux.md | 317 ---- service_accounts.md | 210 --- simple-rolling-update.md | 131 -- taint-toleration-dedicated.md | 291 ---- ubernetes-cluster-state.png | Bin 13824 -> 0 bytes ubernetes-design.png | Bin 20358 -> 0 bytes ubernetes-scheduling.png | Bin 39094 -> 0 bytes versioning.md | 174 -- volume-snapshotting.md | 523 ------ volume-snapshotting.png | Bin 49261 -> 0 bytes 124 files changed, 15330 insertions(+), 15330 deletions(-) delete mode 100644 README.md delete mode 100644 access.md delete mode 100644 admission_control.md delete mode 100644 admission_control_limit_range.md delete mode 100644 admission_control_resource_quota.md delete mode 100644 architecture.dia delete mode 100644 architecture.md delete mode 100644 architecture.png delete mode 100644 architecture.svg delete mode 100644 aws_under_the_hood.md delete mode 100644 clustering.md delete mode 100644 clustering/.gitignore delete mode 100644 clustering/Dockerfile delete mode 100644 clustering/Makefile delete mode 100644 clustering/README.md delete mode 100644 clustering/dynamic.png delete mode 100644 clustering/dynamic.seqdiag delete mode 100644 clustering/static.png delete mode 100644 clustering/static.seqdiag delete mode 100644 command_execution_port_forwarding.md delete mode 100644 configmap.md delete mode 100644 control-plane-resilience.md delete mode 100644 daemon.md create mode 100644 design/README.md create mode 100644 design/access.md create mode 100644 design/admission_control.md create mode 100644 design/admission_control_limit_range.md create mode 100644 design/admission_control_resource_quota.md create mode 100644 design/architecture.dia create mode 100644 design/architecture.md create mode 100644 design/architecture.png create mode 100644 design/architecture.svg create mode 100644 design/aws_under_the_hood.md create mode 100644 design/clustering.md create mode 100644 design/clustering/.gitignore create mode 100644 design/clustering/Dockerfile create mode 100644 design/clustering/Makefile create mode 100644 design/clustering/README.md create mode 100644 design/clustering/dynamic.png create mode 100644 design/clustering/dynamic.seqdiag create mode 100644 design/clustering/static.png create mode 100644 design/clustering/static.seqdiag create mode 100644 design/command_execution_port_forwarding.md create mode 100644 design/configmap.md create mode 100644 design/control-plane-resilience.md create mode 100644 design/daemon.md create mode 100644 design/downward_api_resources_limits_requests.md create mode 100644 design/enhance-pluggable-policy.md create mode 100644 design/event_compression.md create mode 100644 design/expansion.md create mode 100644 design/extending-api.md create mode 100644 design/federated-replicasets.md create mode 100644 design/federated-services.md create mode 100644 design/federation-phase-1.md create mode 100644 design/ha_master.md create mode 100644 design/horizontal-pod-autoscaler.md create mode 100644 design/identifiers.md create mode 100644 design/indexed-job.md create mode 100644 design/metadata-policy.md create mode 100644 design/monitoring_architecture.md create mode 100644 design/monitoring_architecture.png create mode 100644 design/namespaces.md create mode 100644 design/networking.md create mode 100644 design/nodeaffinity.md create mode 100644 design/persistent-storage.md create mode 100644 design/podaffinity.md create mode 100644 design/principles.md create mode 100644 design/resource-qos.md create mode 100644 design/resources.md create mode 100644 design/scheduler_extender.md create mode 100644 design/seccomp.md create mode 100644 design/secrets.md create mode 100644 design/security.md create mode 100644 design/security_context.md create mode 100644 design/selector-generation.md create mode 100644 design/selinux.md create mode 100644 design/service_accounts.md create mode 100644 design/simple-rolling-update.md create mode 100644 design/taint-toleration-dedicated.md create mode 100644 design/ubernetes-cluster-state.png create mode 100644 design/ubernetes-design.png create mode 100644 design/ubernetes-scheduling.png create mode 100644 design/versioning.md create mode 100644 design/volume-snapshotting.md create mode 100644 design/volume-snapshotting.png delete mode 100644 downward_api_resources_limits_requests.md delete mode 100644 enhance-pluggable-policy.md delete mode 100644 event_compression.md delete mode 100644 expansion.md delete mode 100644 extending-api.md delete mode 100644 federated-replicasets.md delete mode 100644 federated-services.md delete mode 100644 federation-phase-1.md delete mode 100644 ha_master.md delete mode 100644 horizontal-pod-autoscaler.md delete mode 100644 identifiers.md delete mode 100644 indexed-job.md delete mode 100644 metadata-policy.md delete mode 100644 monitoring_architecture.md delete mode 100644 monitoring_architecture.png delete mode 100644 namespaces.md delete mode 100644 networking.md delete mode 100644 nodeaffinity.md delete mode 100644 persistent-storage.md delete mode 100644 podaffinity.md delete mode 100644 principles.md delete mode 100644 resource-qos.md delete mode 100644 resources.md delete mode 100644 scheduler_extender.md delete mode 100644 seccomp.md delete mode 100644 secrets.md delete mode 100644 security.md delete mode 100644 security_context.md delete mode 100644 selector-generation.md delete mode 100644 selinux.md delete mode 100644 service_accounts.md delete mode 100644 simple-rolling-update.md delete mode 100644 taint-toleration-dedicated.md delete mode 100644 ubernetes-cluster-state.png delete mode 100644 ubernetes-design.png delete mode 100644 ubernetes-scheduling.png delete mode 100644 versioning.md delete mode 100644 volume-snapshotting.md delete mode 100644 volume-snapshotting.png diff --git a/README.md b/README.md deleted file mode 100644 index 85fc8245..00000000 --- a/README.md +++ /dev/null @@ -1,62 +0,0 @@ -# Kubernetes Design Overview - -Kubernetes is a system for managing containerized applications across multiple -hosts, providing basic mechanisms for deployment, maintenance, and scaling of -applications. - -Kubernetes establishes robust declarative primitives for maintaining the desired -state requested by the user. We see these primitives as the main value added by -Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and -replicating containers require active controllers, not just imperative -orchestration. - -Kubernetes is primarily targeted at applications composed of multiple -containers, such as elastic, distributed micro-services. It is also designed to -facilitate migration of non-containerized application stacks to Kubernetes. It -therefore includes abstractions for grouping containers in both loosely coupled -and tightly coupled formations, and provides ways for containers to find and -communicate with each other in relatively familiar ways. - -Kubernetes enables users to ask a cluster to run a set of containers. The system -automatically chooses hosts to run those containers on. While Kubernetes's -scheduler is currently very simple, we expect it to grow in sophistication over -time. Scheduling is a policy-rich, topology-aware, workload-specific function -that significantly impacts availability, performance, and capacity. The -scheduler needs to take into account individual and collective resource -requirements, quality of service requirements, hardware/software/policy -constraints, affinity and anti-affinity specifications, data locality, -inter-workload interference, deadlines, and so on. Workload-specific -requirements will be exposed through the API as necessary. - -Kubernetes is intended to run on a number of cloud providers, as well as on -physical hosts. - -A single Kubernetes cluster is not intended to span multiple availability zones. -Instead, we recommend building a higher-level layer to replicate complete -deployments of highly available applications across multiple zones (see -[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) -for more details). - -Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS -platform and toolkit. Therefore, architecturally, we want Kubernetes to be built -as a collection of pluggable components and layers, with the ability to use -alternative schedulers, controllers, storage systems, and distribution -mechanisms, and we're evolving its current code in that direction. Furthermore, -we want others to be able to extend Kubernetes functionality, such as with -higher-level PaaS functionality or multi-cluster layers, without modification of -core Kubernetes source. Therefore, its API isn't just (or even necessarily -mainly) targeted at end users, but at tool and extension developers. Its APIs -are intended to serve as the foundation for an open ecosystem of tools, -automation systems, and higher-level API layers. Consequently, there are no -"internal" inter-component APIs. All APIs are visible and available, including -the APIs used by the scheduler, the node controller, the replication-controller -manager, Kubelet's API, etc. There's no glass to break -- in order to handle -more complex use cases, one can just access the lower-level APIs in a fully -transparent, composable manner. - -For more about the Kubernetes architecture, see [architecture](architecture.md). - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() - diff --git a/access.md b/access.md deleted file mode 100644 index b23e463b..00000000 --- a/access.md +++ /dev/null @@ -1,376 +0,0 @@ -# K8s Identity and Access Management Sketch - -This document suggests a direction for identity and access management in the -Kubernetes system. - - -## Background - -High level goals are: - - Have a plan for how identity, authentication, and authorization will fit in -to the API. - - Have a plan for partitioning resources within a cluster between independent -organizational units. - - Ease integration with existing enterprise and hosted scenarios. - -### Actors - -Each of these can act as normal users or attackers. - - External Users: People who are accessing applications running on K8s (e.g. -a web site served by webserver running in a container on K8s), but who do not -have K8s API access. - - K8s Users: People who access the K8s API (e.g. create K8s API objects like -Pods) - - K8s Project Admins: People who manage access for some K8s Users - - K8s Cluster Admins: People who control the machines, networks, or binaries -that make up a K8s cluster. - - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. - -### Threats - -Both intentional attacks and accidental use of privilege are concerns. - -For both cases it may be useful to think about these categories differently: - - Application Path - attack by sending network messages from the internet to -the IP/port of any application running on K8s. May exploit weakness in -application or misconfiguration of K8s. - - K8s API Path - attack by sending network messages to any K8s API endpoint. - - Insider Path - attack on K8s system components. Attacker may have -privileged access to networks, machines or K8s software and data. Software -errors in K8s system components and administrator error are some types of threat -in this category. - -This document is primarily concerned with K8s API paths, and secondarily with -Internal paths. The Application path also needs to be secure, but is not the -focus of this document. - -### Assets to protect - -External User assets: - - Personal information like private messages, or images uploaded by External -Users. - - web server logs. - -K8s User assets: - - External User assets of each K8s User. - - things private to the K8s app, like: - - credentials for accessing other services (docker private repos, storage -services, facebook, etc) - - SSL certificates for web servers - - proprietary data and code - -K8s Cluster assets: - - Assets of each K8s User. - - Machine Certificates or secrets. - - The value of K8s cluster computing resources (cpu, memory, etc). - -This document is primarily about protecting K8s User assets and K8s cluster -assets from other K8s Users and K8s Project and Cluster Admins. - -### Usage environments - -Cluster in Small organization: - - K8s Admins may be the same people as K8s Users. - - Few K8s Admins. - - Prefer ease of use to fine-grained access control/precise accounting, etc. - - Product requirement that it be easy for potential K8s Cluster Admin to try -out setting up a simple cluster. - -Cluster in Large organization: - - K8s Admins typically distinct people from K8s Users. May need to divide -K8s Cluster Admin access by roles. - - K8s Users need to be protected from each other. - - Auditing of K8s User and K8s Admin actions important. - - Flexible accurate usage accounting and resource controls important. - - Lots of automated access to APIs. - - Need to integrate with existing enterprise directory, authentication, -accounting, auditing, and security policy infrastructure. - -Org-run cluster: - - Organization that runs K8s master components is same as the org that runs -apps on K8s. - - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. - -Hosted cluster: - - Offering K8s API as a service, or offering a Paas or Saas built on K8s. - - May already offer web services, and need to integrate with existing customer -account concept, and existing authentication, accounting, auditing, and security -policy infrastructure. - - May want to leverage K8s User accounts and accounting to manage their User -accounts (not a priority to support this use case.) - - Precise and accurate accounting of resources needed. Resource controls -needed for hard limits (Users given limited slice of data) and soft limits -(Users can grow up to some limit and then be expanded). - -K8s ecosystem services: - - There may be companies that want to offer their existing services (Build, CI, -A/B-test, release automation, etc) for use with K8s. There should be some story -for this case. - -Pods configs should be largely portable between Org-run and hosted -configurations. - - -# Design - -Related discussion: -- http://issue.k8s.io/442 -- http://issue.k8s.io/443 - -This doc describes two security profiles: - - Simple profile: like single-user mode. Make it easy to evaluate K8s -without lots of configuring accounts and policies. Protects from unauthorized -users, but does not partition authorized users. - - Enterprise profile: Provide mechanisms needed for large numbers of users. -Defense in depth. Should integrate with existing enterprise security -infrastructure. - -K8s distribution should include templates of config, and documentation, for -simple and enterprise profiles. System should be flexible enough for -knowledgeable users to create intermediate profiles, but K8s developers should -only reason about those two Profiles, not a matrix. - -Features in this doc are divided into "Initial Feature", and "Improvements". -Initial features would be candidates for version 1.00. - -## Identity - -### userAccount - -K8s will have a `userAccount` API object. -- `userAccount` has a UID which is immutable. This is used to associate users -with objects and to record actions in audit logs. -- `userAccount` has a name which is a string and human readable and unique among -userAccounts. It is used to refer to users in Policies, to ensure that the -Policies are human readable. It can be changed only when there are no Policy -objects or other objects which refer to that name. An email address is a -suggested format for this field. -- `userAccount` is not related to the unix username of processes in Pods created -by that userAccount. -- `userAccount` API objects can have labels. - -The system may associate one or more Authentication Methods with a -`userAccount` (but they are not formally part of the userAccount object.) - -In a simple deployment, the authentication method for a user might be an -authentication token which is verified by a K8s server. In a more complex -deployment, the authentication might be delegated to another system which is -trusted by the K8s API to authenticate users, but where the authentication -details are unknown to K8s. - -Initial Features: -- There is no superuser `userAccount` -- `userAccount` objects are statically populated in the K8s API store by reading -a config file. Only a K8s Cluster Admin can do this. -- `userAccount` can have a default `namespace`. If API call does not specify a -`namespace`, the default `namespace` for that caller is assumed. -- `userAccount` is global. A single human with access to multiple namespaces is -recommended to only have one userAccount. - -Improvements: -- Make `userAccount` part of a separate API group from core K8s objects like -`pod.` Facilitates plugging in alternate Access Management. - -Simple Profile: - - Single `userAccount`, used by all K8s Users and Project Admins. One access -token shared by all. - -Enterprise Profile: - - Every human user has own `userAccount`. - - `userAccount`s have labels that indicate both membership in groups, and -ability to act in certain roles. - - Each service using the API has own `userAccount` too. (e.g. `scheduler`, -`repcontroller`) - - Automated jobs to denormalize the ldap group info into the local system -list of users into the K8s userAccount file. - -### Unix accounts - -A `userAccount` is not a Unix user account. The fact that a pod is started by a -`userAccount` does not mean that the processes in that pod's containers run as a -Unix user with a corresponding name or identity. - -Initially: -- The unix accounts available in a container, and used by the processes running -in a container are those that are provided by the combination of the base -operating system and the Docker manifest. -- Kubernetes doesn't enforce any relation between `userAccount` and unix -accounts. - -Improvements: -- Kubelet allocates disjoint blocks of root-namespace uids for each container. -This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) -- requires docker to integrate user namespace support, and deciding what -getpwnam() does for these uids. -- any features that help users avoid use of privileged containers -(http://issue.k8s.io/391) - -### Namespaces - -K8s will have a `namespace` API object. It is similar to a Google Compute -Engine `project`. It provides a namespace for objects created by a group of -people co-operating together, preventing name collisions with non-cooperating -groups. It also serves as a reference point for authorization policies. - -Namespaces are described in [namespaces.md](namespaces.md). - -In the Enterprise Profile: - - a `userAccount` may have permission to access several `namespace`s. - -In the Simple Profile: - - There is a single `namespace` used by the single user. - -Namespaces versus userAccount vs. Labels: -- `userAccount`s are intended for audit logging (both name and UID should be -logged), and to define who has access to `namespace`s. -- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) -should be used to distinguish pods, users, and other objects that cooperate -towards a common goal but are different in some way, such as version, or -responsibilities. -- `namespace`s prevent name collisions between uncoordinated groups of people, -and provide a place to attach common policies for co-operating groups of people. - - -## Authentication - -Goals for K8s authentication: -- Include a built-in authentication system with no configuration required to use -in single-user mode, and little configuration required to add several user -accounts, and no https proxy required. -- Allow for authentication to be handled by a system external to Kubernetes, to -allow integration with existing to enterprise authorization systems. The -Kubernetes namespace itself should avoid taking contributions of multiple -authorization schemes. Instead, a trusted proxy in front of the apiserver can be -used to authenticate users. - - For organizations whose security requirements only allow FIPS compliant -implementations (e.g. apache) for authentication. - - So the proxy can terminate SSL, and isolate the CA-signed certificate from -less trusted, higher-touch APIserver. - - For organizations that already have existing SaaS web services (e.g. -storage, VMs) and want a common authentication portal. -- Avoid mixing authentication and authorization, so that authorization policies -be centrally managed, and to allow changes in authentication methods without -affecting authorization code. - -Initially: -- Tokens used to authenticate a user. -- Long lived tokens identify a particular `userAccount`. -- Administrator utility generates tokens at cluster setup. -- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750 -- No scopes for tokens. Authorization happens in the API server -- Tokens dynamically generated by apiserver to identify pods which are making -API calls. -- Tokens checked in a module of the APIserver. -- Authentication in apiserver can be disabled by flag, to allow testing without -authorization enabled, and to allow use of an authenticating proxy. In this -mode, a query parameter or header added by the proxy will identify the caller. - -Improvements: -- Refresh of tokens. -- SSH keys to access inside containers. - -To be considered for subsequent versions: -- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749) -- Scoped tokens. -- Tokens that are bound to the channel between the client and the api server - - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf - - http://www.browserauth.net - -## Authorization - -K8s authorization should: -- Allow for a range of maturity levels, from single-user for those test driving -the system, to integration with existing to enterprise authorization systems. -- Allow for centralized management of users and policies. In some -organizations, this will mean that the definition of users and access policies -needs to reside on a system other than k8s and encompass other web services -(such as a storage service). -- Allow processes running in K8s Pods to take on identity, and to allow narrow -scoping of permissions for those identities in order to limit damage from -software faults. -- Have Authorization Policies exposed as API objects so that a single config -file can create or delete Pods, Replication Controllers, Services, and the -identities and policies for those Pods and Replication Controllers. -- Be separate as much as practical from Authentication, to allow Authentication -methods to change over time and space, without impacting Authorization policies. - -K8s will implement a relatively simple -[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model. - -The model will be described in more detail in a forthcoming document. The model -will: -- Be less complex than XACML -- Be easily recognizable to those familiar with Amazon IAM Policies. -- Have a subset/aliases/defaults which allow it to be used in a way comfortable -to those users more familiar with Role-Based Access Control. - -Authorization policy is set by creating a set of Policy objects. - -The API Server will be the Enforcement Point for Policy. For each API call that -it receives, it will construct the Attributes needed to evaluate the policy -(what user is making the call, what resource they are accessing, what they are -trying to do that resource, etc) and pass those attributes to a Decision Point. -The Decision Point code evaluates the Attributes against all the Policies and -allows or denies the API call. The system will be modular enough that the -Decision Point code can either be linked into the APIserver binary, or be -another service that the apiserver calls for each Decision (with appropriate -time-limited caching as needed for performance). - -Policy objects may be applicable only to a single namespace or to all -namespaces; K8s Project Admins would be able to create those as needed. Other -Policy objects may be applicable to all namespaces; a K8s Cluster Admin might -create those in order to authorize a new type of controller to be used by all -namespaces, or to make a K8s User into a K8s Project Admin.) - -## Accounting - -The API should have a `quota` concept (see http://issue.k8s.io/442). A quota -object relates a namespace (and optionally a label selector) to a maximum -quantity of resources that may be used (see [resources design doc](resources.md)). - -Initially: -- A `quota` object is immutable. -- For hosted K8s systems that do billing, Project is recommended level for -billing accounts. -- Every object that consumes resources should have a `namespace` so that -Resource usage stats are roll-up-able to `namespace`. -- K8s Cluster Admin sets quota objects by writing a config file. - -Improvements: -- Allow one namespace to charge the quota for one or more other namespaces. This -would be controlled by a policy which allows changing a billing_namespace = -label on an object. -- Allow quota to be set by namespace owners for (namespace x label) combinations -(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't -allow "webserver" namespace and "instance=test" use more than 10 cores. -- Tools to help write consistent quota config files based on number of nodes, -historical namespace usages, QoS needs, etc. -- Way for K8s Cluster Admin to incrementally adjust Quota objects. - -Simple profile: - - A single `namespace` with infinite resource limits. - -Enterprise profile: - - Multiple namespaces each with their own limits. - -Issues: -- Need for locking or "eventual consistency" when multiple apiserver goroutines -are accessing the object store and handling pod creations. - - -## Audit Logging - -API actions can be logged. - -Initial implementation: -- All API calls logged to nginx logs. - -Improvements: -- API server does logging instead. -- Policies to drop logging for high rate trusted API calls, or by users -performing audit or other sensitive functions. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() - diff --git a/admission_control.md b/admission_control.md deleted file mode 100644 index a7330104..00000000 --- a/admission_control.md +++ /dev/null @@ -1,106 +0,0 @@ -# Kubernetes Proposal - Admission Control - -**Related PR:** - -| Topic | Link | -| ----- | ---- | -| Separate validation from RESTStorage | http://issue.k8s.io/2977 | - -## Background - -High level goals: -* Enable an easy-to-use mechanism to provide admission control to cluster. -* Enable a provider to support multiple admission control strategies or author -their own. -* Ensure any rejected request can propagate errors back to the caller with why -the request failed. - -Authorization via policy is focused on answering if a user is authorized to -perform an action. - -Admission Control is focused on if the system will accept an authorized action. - -Kubernetes may choose to dismiss an authorized action based on any number of -admission control strategies. - -This proposal documents the basic design, and describes how any number of -admission control plug-ins could be injected. - -Implementation of specific admission control strategies are handled in separate -documents. - -## kube-apiserver - -The kube-apiserver takes the following OPTIONAL arguments to enable admission -control: - -| Option | Behavior | -| ------ | -------- | -| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | -| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. | - -An **AdmissionControl** plug-in is an implementation of the following interface: - -```go -package admission - -// Attributes is an interface used by a plug-in to make an admission decision -// on a individual request. -type Attributes interface { - GetNamespace() string - GetKind() string - GetOperation() string - GetObject() runtime.Object -} - -// Interface is an abstract, pluggable interface for Admission Control decisions. -type Interface interface { - // Admit makes an admission decision based on the request attributes - // An error is returned if it denies the request. - Admit(a Attributes) (err error) -} -``` - -A **plug-in** must be compiled with the binary, and is registered as an -available option by providing a name, and implementation of admission.Interface. - -```go -func init() { - admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) -} -``` - -A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go) - -```go - // Admission policies - _ "k8s.io/kubernetes/plugin/pkg/admission/admit" - _ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages" - _ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity" - ... - _ "" -``` - -Invocation of admission control is handled by the **APIServer** and not -individual **RESTStorage** implementations. - -This design assumes that **Issue 297** is adopted, and as a consequence, the -general framework of the APIServer request/response flow will ensure the -following: - -1. Incoming request -2. Authenticate user -3. Authorize user -4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes) - - invoke each admission.Interface object in sequence -5. Case on the operation: - - If operation=create|update, then validate(object) and persist - - If operation=delete, delete the object - - If operation=connect, exec - -If at any step, there is an error, the request is canceled. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() - diff --git a/admission_control_limit_range.md b/admission_control_limit_range.md deleted file mode 100644 index 06cce2cb..00000000 --- a/admission_control_limit_range.md +++ /dev/null @@ -1,233 +0,0 @@ -# Admission control plugin: LimitRanger - -## Background - -This document proposes a system for enforcing resource requirements constraints -as part of admission control. - -## Use cases - -1. Ability to enumerate resource requirement constraints per namespace -2. Ability to enumerate min/max resource constraints for a pod -3. Ability to enumerate min/max resource constraints for a container -4. Ability to specify default resource limits for a container -5. Ability to specify default resource requests for a container -6. Ability to enforce a ratio between request and limit for a resource. -7. Ability to enforce min/max storage requests for persistent volume claims - -## Data Model - -The **LimitRange** resource is scoped to a **Namespace**. - -### Type - -```go -// LimitType is a type of object that is limited -type LimitType string - -const ( - // Limit that applies to all pods in a namespace - LimitTypePod LimitType = "Pod" - // Limit that applies to all containers in a namespace - LimitTypeContainer LimitType = "Container" -) - -// LimitRangeItem defines a min/max usage limit for any resource that matches -// on kind. -type LimitRangeItem struct { - // Type of resource that this limit applies to. - Type LimitType `json:"type,omitempty"` - // Max usage constraints on this kind by resource name. - Max ResourceList `json:"max,omitempty"` - // Min usage constraints on this kind by resource name. - Min ResourceList `json:"min,omitempty"` - // Default resource requirement limit value by resource name if resource limit - // is omitted. - Default ResourceList `json:"default,omitempty"` - // DefaultRequest is the default resource requirement request value by - // resource name if resource request is omitted. - DefaultRequest ResourceList `json:"defaultRequest,omitempty"` - // MaxLimitRequestRatio if specified, the named resource must have a request - // and limit that are both non-zero where limit divided by request is less - // than or equal to the enumerated value; this represents the max burst for - // the named resource. - MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` -} - -// LimitRangeSpec defines a min/max usage limit for resources that match -// on kind. -type LimitRangeSpec struct { - // Limits is the list of LimitRangeItem objects that are enforced. - Limits []LimitRangeItem `json:"limits"` -} - -// LimitRange sets resource usage limits for each kind of resource in a -// Namespace. -type LimitRange struct { - TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata - ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the limits enforced. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status - Spec LimitRangeSpec `json:"spec,omitempty"` -} - -// LimitRangeList is a list of LimitRange items. -type LimitRangeList struct { - TypeMeta `json:",inline"` - // Standard list metadata. - // More info: - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds - ListMeta `json:"metadata,omitempty"` - - // Items is a list of LimitRange objects. - // More info: - // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md - Items []LimitRange `json:"items"` -} -``` - -### Validation - -Validation of a **LimitRange** enforces that for a given named resource the -following rules apply: - -Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) -<= Max (if specified) - -### Default Value Behavior - -The following default value behaviors are applied to a LimitRange for a given -named resource. - -``` -if LimitRangeItem.Default[resourceName] is undefined - if LimitRangeItem.Max[resourceName] is defined - LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName] -``` - -``` -if LimitRangeItem.DefaultRequest[resourceName] is undefined - if LimitRangeItem.Default[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] - else if LimitRangeItem.Min[resourceName] is defined - LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] -``` - -## AdmissionControl plugin: LimitRanger - -The **LimitRanger** plug-in introspects all incoming pod requests and evaluates -the constraints defined on a LimitRange. - -If a constraint is not specified for an enumerated resource, it is not enforced -or tracked. - -To enable the plug-in and support for LimitRange, the kube-apiserver must be -configured as follows: - -```console -$ kube-apiserver --admission-control=LimitRanger -``` - -### Enforcement of constraints - -**Type: Container** - -Supported Resources: - -1. memory -2. cpu - -Supported Constraints: - -Per container, the following must hold true: - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min <= Request (required) <= Limit (optional) | -| Max | Limit (required) <= Max | -| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) | - -Supported Defaults: - -1. Default - if the named resource has no enumerated value, the Limit is equal -to the Default -2. DefaultRequest - if the named resource has no enumerated value, the Request -is equal to the DefaultRequest - -**Type: Pod** - -Supported Resources: - -1. memory -2. cpu - -Supported Constraints: - -Across all containers in pod, the following must hold true - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min <= Request (required) <= Limit (optional) | -| Max | Limit (required) <= Max | -| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) | - -**Type: PersistentVolumeClaim** - -Supported Resources: - -1. storage - -Supported Constraints: - -Across all claims in a namespace, the following must hold true: - -| Constraint | Behavior | -| ---------- | -------- | -| Min | Min >= Request (required) | -| Max | Max <= Request (required) | - -Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time. - -## Run-time configuration - -The default ```LimitRange``` that is applied via Salt configuration will be -updated as follows: - -``` -apiVersion: "v1" -kind: "LimitRange" -metadata: - name: "limits" - namespace: default -spec: - limits: - - type: "Container" - defaultRequests: - cpu: "100m" -``` - -## Example - -An example LimitRange configuration: - -| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio | -| ---- | -------- | --- | --- | ------- | -------------- | ----------------- | -| Container | cpu | .1 | 1 | 500m | 250m | 4 | -| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | | - -Assuming an incoming container that specified no incoming resource requirements, -the following would happen. - -1. The incoming container cpu would request 250m with a limit of 500m. -2. The incoming container memory would request 250Mi with a limit of 500Mi -3. If the container is later resized, it's cpu would be constrained to between -.1 and 1 and the ratio of limit to request could not exceed 4. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() - diff --git a/admission_control_resource_quota.md b/admission_control_resource_quota.md deleted file mode 100644 index 575db9a8..00000000 --- a/admission_control_resource_quota.md +++ /dev/null @@ -1,215 +0,0 @@ -# Admission control plugin: ResourceQuota - -## Background - -This document describes a system for enforcing hard resource usage limits per -namespace as part of admission control. - -## Use cases - -1. Ability to enumerate resource usage limits per namespace. -2. Ability to monitor resource usage for tracked resources. -3. Ability to reject resource usage exceeding hard quotas. - -## Data Model - -The **ResourceQuota** object is scoped to a **Namespace**. - -```go -// The following identify resource constants for Kubernetes object types -const ( - // Pods, number - ResourcePods ResourceName = "pods" - // Services, number - ResourceServices ResourceName = "services" - // ReplicationControllers, number - ResourceReplicationControllers ResourceName = "replicationcontrollers" - // ResourceQuotas, number - ResourceQuotas ResourceName = "resourcequotas" - // ResourceSecrets, number - ResourceSecrets ResourceName = "secrets" - // ResourcePersistentVolumeClaims, number - ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims" -) - -// ResourceQuotaSpec defines the desired hard limits to enforce for Quota -type ResourceQuotaSpec struct { - // Hard is the set of desired hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` -} - -// ResourceQuotaStatus defines the enforced hard limits and observed use -type ResourceQuotaStatus struct { - // Hard is the set of enforced hard limits for each named resource - Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` - // Used is the current observed total usage of the resource in the namespace - Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` -} - -// ResourceQuota sets aggregate quota restrictions enforced per namespace -type ResourceQuota struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - - // Spec defines the desired quota - Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` - - // Status defines the actual enforced quota and its current usage - Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` -} - -// ResourceQuotaList is a list of ResourceQuota items -type ResourceQuotaList struct { - TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` - - // Items is a list of ResourceQuota objects - Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` -} -``` - -## Quota Tracked Resources - -The following resources are supported by the quota system: - -| Resource | Description | -| ------------ | ----------- | -| cpu | Total requested cpu usage | -| memory | Total requested memory usage | -| pods | Total number of active pods where phase is pending or active. | -| services | Total number of services | -| replicationcontrollers | Total number of replication controllers | -| resourcequotas | Total number of resource quotas | -| secrets | Total number of secrets | -| persistentvolumeclaims | Total number of persistent volume claims | - -If a third-party wants to track additional resources, it must follow the -resource naming conventions prescribed by Kubernetes. This means the resource -must have a fully-qualified name (i.e. mycompany.org/shinynewresource) - -## Resource Requirements: Requests vs. Limits - -If a resource supports the ability to distinguish between a request and a limit -for a resource, the quota tracking system will only cost the request value -against the quota usage. If a resource is tracked by quota, and no request value -is provided, the associated entity is rejected as part of admission. - -For an example, consider the following scenarios relative to tracking quota on -CPU: - -| Pod | Container | Request CPU | Limit CPU | Result | -| --- | --------- | ----------- | --------- | ------ | -| X | C1 | 100m | 500m | The quota usage is incremented 100m | -| Y | C2 | 100m | none | The quota usage is incremented 100m | -| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit | -| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. | - -The rationale for accounting for the requested amount of a resource versus the -limit is the belief that a user should only be charged for what they are -scheduled against in the cluster. In addition, attempting to track usage against -actual usage, where request < actual < limit, is considered highly volatile. - -As a consequence of this decision, the user is able to spread its usage of a -resource across multiple tiers of service. Let's demonstrate this via an -example with a 4 cpu quota. - -The quota may be allocated as follows: - -| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage | -| --- | --------- | ----------- | --------- | ---- | ----------- | -| X | C1 | 1 | 4 | Burstable | 1 | -| Y | C2 | 2 | 2 | Guaranteed | 2 | -| Z | C3 | 1 | 3 | Burstable | 1 | - -It is possible that the pods may consume 9 cpu over a given time period -depending on the nodes available cpu that held pod X and Z, but since we -scheduled X and Z relative to the request, we only track the requesting value -against their allocated quota. If one wants to restrict the ratio between the -request and limit, it is encouraged that the user define a **LimitRange** with -**LimitRequestRatio** to control burst out behavior. This would in effect, let -an administrator keep the difference between request and limit more in line with -tracked usage if desired. - -## Status API - -A REST API endpoint to update the status section of the **ResourceQuota** is -exposed. It requires an atomic compare-and-swap in order to keep resource usage -tracking consistent. - -## Resource Quota Controller - -A resource quota controller monitors observed usage for tracked resources in the -**Namespace**. - -If there is observed difference between the current usage stats versus the -current **ResourceQuota.Status**, the controller posts an update of the -currently observed usage metrics to the **ResourceQuota** via the /status -endpoint. - -The resource quota controller is the only component capable of monitoring and -recording usage updates after a DELETE operation since admission control is -incapable of guaranteeing a DELETE request actually succeeded. - -## AdmissionControl plugin: ResourceQuota - -The **ResourceQuota** plug-in introspects all incoming admission requests. - -To enable the plug-in and support for ResourceQuota, the kube-apiserver must be -configured as follows: - -``` -$ kube-apiserver --admission-control=ResourceQuota -``` - -It makes decisions by evaluating the incoming object against all defined -**ResourceQuota.Status.Hard** resource limits in the request namespace. If -acceptance of the resource would cause the total usage of a named resource to -exceed its hard limit, the request is denied. - -If the incoming request does not cause the total usage to exceed any of the -enumerated hard resource limits, the plug-in will post a -**ResourceQuota.Status** document to the server to atomically update the -observed usage based on the previously read **ResourceQuota.ResourceVersion**. -This keeps incremental usage atomically consistent, but does introduce a -bottleneck (intentionally) into the system. - -To optimize system performance, it is encouraged that all resource quotas are -tracked on the same **ResourceQuota** document in a **Namespace**. As a result, -it is encouraged to impose a cap on the total number of individual quotas that -are tracked in the **Namespace** to 1 in the **ResourceQuota** document. - -## kubectl - -kubectl is modified to support the **ResourceQuota** resource. - -`kubectl describe` provides a human-readable output of quota. - -For example: - -```console -$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml -namespace "quota-example" created -$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example -resourcequota "quota" created -$ kubectl describe quota quota --namespace=quota-example -Name: quota -Namespace: quota-example -Resource Used Hard --------- ---- ---- -cpu 0 20 -memory 0 1Gi -persistentvolumeclaims 0 10 -pods 0 10 -replicationcontrollers 0 20 -resourcequotas 1 1 -secrets 1 10 -services 0 5 -``` - -## More information - -See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() - diff --git a/architecture.dia b/architecture.dia deleted file mode 100644 index 5c87409f..00000000 Binary files a/architecture.dia and /dev/null differ diff --git a/architecture.md b/architecture.md deleted file mode 100644 index 95e3aef4..00000000 --- a/architecture.md +++ /dev/null @@ -1,85 +0,0 @@ -# Kubernetes architecture - -A running Kubernetes cluster contains node agents (`kubelet`) and master -components (APIs, scheduler, etc), on top of a distributed storage solution. -This diagram shows our desired eventual state, though we're still working on a -few things, like making `kubelet` itself (all our components, really) run within -containers, and making the scheduler 100% pluggable. - -![Architecture Diagram](architecture.png?raw=true "Architecture overview") - -## The Kubernetes Node - -When looking at the architecture of the system, we'll break it down to services -that run on the worker node and services that compose the cluster-level control -plane. - -The Kubernetes node has the services necessary to run application containers and -be managed from the master systems. - -Each node runs Docker, of course. Docker takes care of the details of -downloading images and running containers. - -### `kubelet` - -The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their -images, their volumes, etc. - -### `kube-proxy` - -Each node also runs a simple network proxy and load balancer (see the -[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for -more details). This reflects `services` (see -[the services doc](../user-guide/services.md) for more details) as defined in -the Kubernetes API on each node and can do simple TCP and UDP stream forwarding -(round robin) across a set of backends. - -Service endpoints are currently found via [DNS](../admin/dns.md) or through -environment variables (both -[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and -Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are -supported). These variables resolve to ports managed by the service proxy. - -## The Kubernetes Control Plane - -The Kubernetes control plane is split into a set of components. Currently they -all run on a single _master_ node, but that is expected to change soon in order -to support high-availability clusters. These components work together to provide -a unified view of the cluster. - -### `etcd` - -All persistent master state is stored in an instance of `etcd`. This provides a -great way to store configuration data reliably. With `watch` support, -coordinating components can be notified very quickly of changes. - -### Kubernetes API Server - -The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a -CRUD-y server, with most/all business logic implemented in separate components -or in plug-ins. It mainly processes REST operations, validates them, and updates -the corresponding objects in `etcd` (and eventually other stores). - -### Scheduler - -The scheduler binds unscheduled pods to nodes via the `/binding` API. The -scheduler is pluggable, and we expect to support multiple cluster schedulers and -even user-provided schedulers in the future. - -### Kubernetes Controller Manager Server - -All other cluster-level functions are currently performed by the Controller -Manager. For instance, `Endpoints` objects are created and updated by the -endpoints controller, and nodes are discovered, managed, and monitored by the -node controller. These could eventually be split into separate components to -make them independently pluggable. - -The [`replicationcontroller`](../user-guide/replication-controller.md) is a -mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) -API. We eventually plan to port it to a generic plug-in mechanism, once one is -implemented. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() - diff --git a/architecture.png b/architecture.png deleted file mode 100644 index 0ee8bceb..00000000 Binary files a/architecture.png and /dev/null differ diff --git a/architecture.svg b/architecture.svg deleted file mode 100644 index d6b6aab0..00000000 --- a/architecture.svg +++ /dev/null @@ -1,1943 +0,0 @@ - - - - - - image/svg+xml - - - - - - - - - - - - - - - - Node - - - - - - kubelet - - - - - - - - - - - container - - - - - - - container - - - - - - - cAdvisor - - - - - - - Pod - - - - - - - - - - - container - - - - - - - container - - - - - - - container - - - - - - - Pod - - - - - - - - - - - - container - - - - - - - container - - - - - - - container - - - - - - - Pod - - - - - - - Proxy - - - - - - - kubectl (user commands) - - - - - - - - - - - - - - - Firewall - - - - - - - Internet - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - controller manager - (replication controller etc.) - - - - - - - Scheduler - - - - - - - Scheduler - - - - Master components - Colocated, or spread across machines, - as dictated by cluster size. - - - - - - - - - - - - REST - (pods, services, - rep. controllers) - - - - - - - authentication - authorization - - - - - - - scheduling - actuator - - - - APIs - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - docker - - - - - - - - .. - - - ... - - - - - - - - - - - - - - - - - - - - - - - - Node - - - - - - kubelet - - - - - - - - - - - container - - - - - - - container - - - - - - - cAdvisor - - - - - - - Pod - - - - - - - - - - - container - - - - - - - container - - - - - - - container - - - - - - - Pod - - - - - - - - - - - - container - - - - - - - container - - - - - - - container - - - - - - - Pod - - - - - - - Proxy - - - - - - - - - - - - - - - - - - - docker - - - - - - - - .. - - - ... - - - - - - - - - - - - - - - - - - - - - - - - - - Distributed - Watchable - Storage - - (implemented via etcd) - - - diff --git a/aws_under_the_hood.md b/aws_under_the_hood.md deleted file mode 100644 index 6e3c5afb..00000000 --- a/aws_under_the_hood.md +++ /dev/null @@ -1,310 +0,0 @@ -# Peeking under the hood of Kubernetes on AWS - -This document provides high-level insight into how Kubernetes works on AWS and -maps to AWS objects. We assume that you are familiar with AWS. - -We encourage you to use [kube-up](../getting-started-guides/aws.md) to create -clusters on AWS. We recommend that you avoid manual configuration but are aware -that sometimes it's the only option. - -Tip: You should open an issue and let us know what enhancements can be made to -the scripts to better suit your needs. - -That said, it's also useful to know what's happening under the hood when -Kubernetes clusters are created on AWS. This can be particularly useful if -problems arise or in circumstances where the provided scripts are lacking and -you manually created or configured your cluster. - -**Table of contents:** - * [Architecture overview](#architecture-overview) - * [Storage](#storage) - * [Auto Scaling group](#auto-scaling-group) - * [Networking](#networking) - * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services) - * [Identity and access management (IAM)](#identity-and-access-management-iam) - * [Tagging](#tagging) - * [AWS objects](#aws-objects) - * [Manual infrastructure creation](#manual-infrastructure-creation) - * [Instance boot](#instance-boot) - -### Architecture overview - -Kubernetes is a cluster of several machines that consists of a Kubernetes -master and a set number of nodes (previously known as 'nodes') for which the -master is responsible. See the [Architecture](architecture.md) topic for -more details. - -By default on AWS: - -* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently - modern kernel that pairs well with Docker and doesn't require a - reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) -* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly - because this is what Google Compute Engine uses). - -You can override these defaults by passing different environment variables to -kube-up. - -### Storage - -AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). -These can then be attached to pods that should store persistent data (e.g. if -you're running a database). - -By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) -unless you create pods with persistent volumes -[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes -containers do not have persistent storage unless you attach a persistent -volume, and so nodes on AWS use instance storage. Instance storage is cheaper, -often faster, and historically more reliable. Unless you can make do with -whatever space is left on your root partition, you must choose an instance type -that provides you with sufficient instance storage for your needs. - -To configure Kubernetes to use EBS storage, pass the environment variable -`KUBE_AWS_STORAGE=ebs` to kube-up. - -Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to -track its state. Similar to nodes, containers are mostly run against instance -storage, except that we repoint some important data onto the persistent volume. - -The default storage driver for Docker images is aufs. Specifying btrfs (by -passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a -good choice for a filesystem. btrfs is relatively reliable with Docker and has -improved its reliability with modern kernels. It can easily span multiple -volumes, which is particularly useful when we are using an instance type with -multiple ephemeral instance disks. - -### Auto Scaling group - -Nodes (but not the master) are run in an -[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) -on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled -([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means -that AWS will relaunch any nodes that are terminated. - -We do not currently run the master in an AutoScalingGroup, but we should -([#11934](http://issues.k8s.io/11934)). - -### Networking - -Kubernetes uses an IP-per-pod model. This means that a node, which runs many -pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced -routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then -configured to route to an instance in the VPC routing table. - -It is also possible to use overlay networking on AWS, but that is not the -default configuration of the kube-up script. - -### NodePort and LoadBalancer services - -Kubernetes on AWS integrates with [Elastic Load Balancing -(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). -When you create a service with `Type=LoadBalancer`, Kubernetes (the -kube-controller-manager) will create an ELB, create a security group for the -ELB which allows access on the service ports, attach all the nodes to the ELB, -and modify the security group for the nodes to allow traffic from the ELB to -the nodes. This traffic reaches kube-proxy where it is then forwarded to the -pods. - -ELB has some restrictions: -* ELB requires that all nodes listen on a single port, -* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below -on ELB annotations for pods speaking HTTP). - -To work with these restrictions, in Kubernetes, [LoadBalancer -services](../user-guide/services.md#type-loadbalancer) are exposed as -[NodePort services](../user-guide/services.md#type-nodeport). Then -kube-proxy listens externally on the cluster-wide port that's assigned to -NodePort services and forwards traffic to the corresponding pods. - -For example, if we configure a service of Type LoadBalancer with a -public port of 80: -* Kubernetes will assign a NodePort to the service (e.g. port 31234) -* ELB is configured to proxy traffic on the public port 80 to the NodePort -assigned to the service (in this example port 31234). -* Then any in-coming traffic that ELB forwards to the NodePort (31234) -is recognized by kube-proxy and sent to the correct pods for that service. - -Note that we do not automatically open NodePort services in the AWS firewall -(although we do open LoadBalancer services). This is because we expect that -NodePort services are more of a building block for things like inter-cluster -services or for LoadBalancer. To consume a NodePort service externally, you -will likely have to open the port in the node security group -(`kubernetes-node-`). - -For SSL support, starting with 1.3 two annotations can be added to a service: - -``` -service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012 -``` - -The first specifies which certificate to use. It can be either a -certificate from a third party issuer that was uploaded to IAM or one created -within AWS Certificate Manager. - -``` -service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp) -``` - -The second annotation specifies which protocol a pod speaks. For HTTPS and -SSL, the ELB will expect the pod to authenticate itself over the encrypted -connection. - -HTTP and HTTPS will select layer 7 proxying: the ELB will terminate -the connection with the user, parse headers and inject the `X-Forwarded-For` -header with the user's IP address (pods will only see the IP address of the -ELB at the other end of its connection) when forwarding requests. - -TCP and SSL will select layer 4 proxying: the ELB will forward traffic without -modifying the headers. - -### Identity and Access Management (IAM) - -kube-proxy sets up two IAM roles, one for the master called -[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) -and one for the nodes called -[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). - -The master is responsible for creating ELBs and configuring them, as well as -setting up advanced VPC routing. Currently it has blanket permissions on EC2, -along with rights to create and destroy ELBs. - -The nodes do not need a lot of access to the AWS APIs. They need to download -a distribution file, and then are responsible for attaching and detaching EBS -volumes from itself. - -The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR -authorization tokens, refresh them every 12 hours if needed, and fetch Docker -images from it, as long as the appropriate permissions are enabled. Those in -[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly), -without write access, should suffice. The master policy is probably overly -permissive. The security conscious may want to lock-down the IAM policies -further ([#11936](http://issues.k8s.io/11936)). - -We should make it easier to extend IAM permissions and also ensure that they -are correctly configured ([#14226](http://issues.k8s.io/14226)). - -### Tagging - -All AWS resources are tagged with a tag named "KubernetesCluster", with a value -that is the unique cluster-id. This tag is used to identify a particular -'instance' of Kubernetes, even if two clusters are deployed into the same VPC. -Resources are considered to belong to the same cluster if and only if they have -the same value in the tag named "KubernetesCluster". (The kube-up script is -not configured to create multiple clusters in the same VPC by default, but it -is possible to create another cluster in the same VPC.) - -Within the AWS cloud provider logic, we filter requests to the AWS APIs to -match resources with our cluster tag. By filtering the requests, we ensure -that we see only our own AWS objects. - -**Important:** If you choose not to use kube-up, you must pick a unique -cluster-id value, and ensure that all AWS resources have a tag with -`Name=KubernetesCluster,Value=`. - -### AWS objects - -The kube-up script does a number of things in AWS: -* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes -distribution and the salt scripts into it. They are made world-readable and the -HTTP URLs are passed to instances; this is how Kubernetes code gets onto the -machines. -* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): - * `kubernetes-master` is used by the master. - * `kubernetes-node` is used by nodes. -* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is -the OpenSSH key fingerprint, so that multiple users can run the script with -different keys and their keys will not collide (with near-certainty). It will -use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create -one there. (With the default Ubuntu images, if you have to SSH in: the user is -`ubuntu` and that user can `sudo`). -* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and -enables the `dns-support` and `dns-hostnames` options. -* Creates an internet gateway for the VPC. -* Creates a route table for the VPC, with the internet gateway as the default -route. -* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` -(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a -single AZ on AWS. Although, there are two philosophies in discussion on how to -achieve High Availability (HA): - * cluster-per-AZ: An independent cluster for each AZ, where each cluster -is entirely separate. - * cross-AZ-clusters: A single cluster spans multiple AZs. -The debate is open here, where cluster-per-AZ is discussed as more robust but -cross-AZ-clusters are more convenient. -* Associates the subnet to the route table -* Creates security groups for the master (`kubernetes-master-`) -and the nodes (`kubernetes-node-`). -* Configures security groups so that masters and nodes can communicate. This -includes intercommunication between masters and nodes, opening SSH publicly -for both masters and nodes, and opening port 443 on the master for the HTTPS -API endpoints. -* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type -`MASTER_DISK_TYPE`. -* Launches a master with a fixed IP address (172.20.0.9) that is also -configured for the security group and all the necessary IAM credentials. An -instance script is used to pass vital configuration information to Salt. Note: -The hope is that over time we can reduce the amount of configuration -information that must be passed in this way. -* Once the instance is up, it attaches the EBS volume and sets up a manual -routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to -10.246.0.0/24). -* For auto-scaling, on each nodes it creates a launch configuration and group. -The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default -name is kubernetes-node-group. The auto-scaling group has a min and max size -that are both set to NUM_NODES. You can change the size of the auto-scaling -group to add or remove the total number of nodes from within the AWS API or -Console. Each nodes self-configures, meaning that they come up; run Salt with -the stored configuration; connect to the master; are assigned an internal CIDR; -and then the master configures the route-table with the assigned CIDR. The -kube-up script performs a health-check on the nodes but it's a self-check that -is not required. - -If attempting this configuration manually, it is recommend to follow along -with the kube-up script, and being sure to tag everything with a tag with name -`KubernetesCluster` and value set to a unique cluster-id. Also, passing the -right configuration options to Salt when not using the script is tricky: the -plan here is to simplify this by having Kubernetes take on more node -configuration, and even potentially remove Salt altogether. - -### Manual infrastructure creation - -While this work is not yet complete, advanced users might choose to manually -create certain AWS objects while still making use of the kube-up script (to -configure Salt, for example). These objects can currently be manually created: -* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. -* Set the `VPC_ID` environment variable to reuse an existing VPC. -* Set the `SUBNET_ID` environment variable to reuse an existing subnet. -* If your route table has a matching `KubernetesCluster` tag, it will be reused. -* If your security groups are appropriately named, they will be reused. - -Currently there is no way to do the following with kube-up: -* Use an existing AWS SSH key with an arbitrary name. -* Override the IAM credentials in a sensible way -([#14226](http://issues.k8s.io/14226)). -* Use different security group permissions. -* Configure your own auto-scaling groups. - -If any of the above items apply to your situation, open an issue to request an -enhancement to the kube-up script. You should provide a complete description of -the use-case, including all the details around what you want to accomplish. - -### Instance boot - -The instance boot procedure is currently pretty complicated, primarily because -we must marshal configuration from Bash to Salt via the AWS instance script. -As we move more post-boot configuration out of Salt and into Kubernetes, we -will hopefully be able to simplify this. - -When the kube-up script launches instances, it builds an instance startup -script which includes some configuration options passed to kube-up, and -concatenates some of the scripts found in the cluster/aws/templates directory. -These scripts are responsible for mounting and formatting volumes, downloading -Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually -install Kubernetes. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() - diff --git a/clustering.md b/clustering.md deleted file mode 100644 index ca42035b..00000000 --- a/clustering.md +++ /dev/null @@ -1,128 +0,0 @@ -# Clustering in Kubernetes - - -## Overview - -The term "clustering" refers to the process of having all members of the -Kubernetes cluster find and trust each other. There are multiple different ways -to achieve clustering with different security and usability profiles. This -document attempts to lay out the user experiences for clustering that Kubernetes -aims to address. - -Once a cluster is established, the following is true: - -1. **Master -> Node** The master needs to know which nodes can take work and -what their current status is wrt capacity. - 1. **Location** The master knows the name and location of all of the nodes in -the cluster. - * For the purposes of this doc, location and name should be enough -information so that the master can open a TCP connection to the Node. Most -probably we will make this either an IP address or a DNS name. It is going to be -important to be consistent here (master must be able to reach kubelet on that -DNS name) so that we can verify certificates appropriately. - 2. **Target AuthN** A way to securely talk to the kubelet on that node. -Currently we call out to the kubelet over HTTP. This should be over HTTPS and -the master should know what CA to trust for that node. - 3. **Caller AuthN/Z** This would be the master verifying itself (and -permissions) when calling the node. Currently, this is only used to collect -statistics as authorization isn't critical. This may change in the future -though. -2. **Node -> Master** The nodes currently talk to the master to know which pods -have been assigned to them and to publish events. - 1. **Location** The nodes must know where the master is at. - 2. **Target AuthN** Since the master is assigning work to the nodes, it is -critical that they verify whom they are talking to. - 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to -the master. Ideally this authentication is specific to each node so that -authorization can be narrowly scoped. The details of the work to run (including -things like environment variables) might be considered sensitive and should be -locked down also. - -**Note:** While the description here refers to a singular Master, in the future -we should enable multiple Masters operating in an HA mode. While the "Master" is -currently the combination of the API Server, Scheduler and Controller Manager, -we will restrict ourselves to thinking about the main API and policy engine -- -the API Server. - -## Current Implementation - -A central authority (generally the master) is responsible for determining the -set of machines which are members of the cluster. Calls to create and remove -worker nodes in the cluster are restricted to this single authority, and any -other requests to add or remove worker nodes are rejected. (1.i.) - -Communication from the master to nodes is currently over HTTP and is not secured -or authenticated in any way. (1.ii, 1.iii.) - -The location of the master is communicated out of band to the nodes. For GCE, -this is done via Salt. Other cluster instructions/scripts use other methods. -(2.i.) - -Currently most communication from the node to the master is over HTTP. When it -is done over HTTPS there is currently no verification of the cert of the master -(2.ii.) - -Currently, the node/kubelet is authenticated to the master via a token shared -across all nodes. This token is distributed out of band (using Salt for GCE) and -is optional. If it is not present then the kubelet is unable to publish events -to the master. (2.iii.) - -Our current mix of out of band communication doesn't meet all of our needs from -a security point of view and is difficult to set up and configure. - -## Proposed Solution - -The proposed solution will provide a range of options for setting up and -maintaining a secure Kubernetes cluster. We want to both allow for centrally -controlled systems (leveraging pre-existing trust and configuration systems) or -more ad-hoc automagic systems that are incredibly easy to set up. - -The building blocks of an easier solution: - -* **Move to TLS** We will move to using TLS for all intra-cluster communication. -We will explicitly identify the trust chain (the set of trusted CAs) as opposed -to trusting the system CAs. We will also use client certificates for all AuthN. -* [optional] **API driven CA** Optionally, we will run a CA in the master that -will mint certificates for the nodes/kubelets. There will be pluggable policies -that will automatically approve certificate requests here as appropriate. - * **CA approval policy** This is a pluggable policy object that can -automatically approve CA signing requests. Stock policies will include -`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would -be an API for evaluating and accepting/rejecting requests. Cloud providers could -implement a policy here that verifies other out of band information and -automatically approves/rejects based on other external factors. -* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give -a node permission to register itself. - * To start with, we'd have the kubelets generate a cert/account in the form of -`kubelet:`. To start we would then hard code policy such that we give that -particular account appropriate permissions. Over time, we can make the policy -engine more generic. -* [optional] **Bootstrap API endpoint** This is a helper service hosted outside -of the Kubernetes cluster that helps with initial discovery of the master. - -### Static Clustering - -In this sequence diagram there is out of band admin entity that is creating all -certificates and distributing them. It is also making sure that the kubelets -know where to find the master. This provides for a lot of control but is more -difficult to set up as lots of information must be communicated outside of -Kubernetes. - -![Static Sequence Diagram](clustering/static.png) - -### Dynamic Clustering - -This diagram shows dynamic clustering using the bootstrap API endpoint. This -endpoint is used to both find the location of the master and communicate the -root CA for the master. - -This flow has the admin manually approving the kubelet signing requests. This is -the `queue` policy defined above. This manual intervention could be replaced by -code that can verify the signing requests via other means. - -![Dynamic Sequence Diagram](clustering/dynamic.png) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() - diff --git a/clustering/.gitignore b/clustering/.gitignore deleted file mode 100644 index 67bcd6cb..00000000 --- a/clustering/.gitignore +++ /dev/null @@ -1 +0,0 @@ -DroidSansMono.ttf diff --git a/clustering/Dockerfile b/clustering/Dockerfile deleted file mode 100644 index e7abc753..00000000 --- a/clustering/Dockerfile +++ /dev/null @@ -1,26 +0,0 @@ -# Copyright 2016 The Kubernetes Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -FROM debian:jessie - -RUN apt-get update -RUN apt-get -qy install python-seqdiag make curl - -WORKDIR /diagrams - -RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf - -ADD . /diagrams - -CMD bash -c 'make >/dev/stderr && tar cf - *.png' \ No newline at end of file diff --git a/clustering/Makefile b/clustering/Makefile deleted file mode 100644 index e72d441e..00000000 --- a/clustering/Makefile +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright 2016 The Kubernetes Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -FONT := DroidSansMono.ttf - -PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag)) - -.PHONY: all -all: $(PNGS) - -.PHONY: watch -watch: - fswatch *.seqdiag | xargs -n 1 sh -c "make || true" - -$(FONT): - curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT) - -%.png: %.seqdiag $(FONT) - seqdiag --no-transparency -a -f '$(FONT)' $< - -# Build the stuff via a docker image -.PHONY: docker -docker: - docker build -t clustering-seqdiag . - docker run --rm clustering-seqdiag | tar xvf - - -.PHONY: docker-clean -docker-clean: - docker rmi clustering-seqdiag || true - docker images -q --filter "dangling=true" | xargs docker rmi diff --git a/clustering/README.md b/clustering/README.md deleted file mode 100644 index d7e2e2e0..00000000 --- a/clustering/README.md +++ /dev/null @@ -1,35 +0,0 @@ -This directory contains diagrams for the clustering design doc. - -This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). -Assuming you have a non-borked python install, this should be installable with: - -```sh -pip install seqdiag -``` - -Just call `make` to regenerate the diagrams. - -## Building with Docker - -If you are on a Mac or your pip install is messed up, you can easily build with -docker: - -```sh -make docker -``` - -The first run will be slow but things should be fast after that. - -To clean up the docker containers that are created (and other cruft that is left -around) you can run `make docker-clean`. - -## Automatically rebuild on file changes - -If you have the fswatch utility installed, you can have it monitor the file -system and automatically rebuild when files have changed. Just do a -`make watch`. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() - diff --git a/clustering/dynamic.png b/clustering/dynamic.png deleted file mode 100644 index 92b40fee..00000000 Binary files a/clustering/dynamic.png and /dev/null differ diff --git a/clustering/dynamic.seqdiag b/clustering/dynamic.seqdiag deleted file mode 100644 index 567d5bf9..00000000 --- a/clustering/dynamic.seqdiag +++ /dev/null @@ -1,24 +0,0 @@ -seqdiag { - activation = none; - - - user[label = "Admin User"]; - bootstrap[label = "Bootstrap API\nEndpoint"]; - master; - kubelet[stacked]; - - user -> bootstrap [label="createCluster", return="cluster ID"]; - user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"]; - - user ->> master [label="start\n- bootstrap-cluster-uri"]; - master => bootstrap [label="setMaster\n- master-location\n- master-ca"]; - - user ->> kubelet [label="start\n- bootstrap-cluster-uri"]; - kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"]; - kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"]; - user => master [label="getSignRequests"]; - user => master [label="approveSignRequests"]; - kubelet <<-- master [label="returns\n- kubelet-cert"]; - - kubelet => master [label="register\n- kubelet-location"] -} diff --git a/clustering/static.png b/clustering/static.png deleted file mode 100644 index bcdeca7e..00000000 Binary files a/clustering/static.png and /dev/null differ diff --git a/clustering/static.seqdiag b/clustering/static.seqdiag deleted file mode 100644 index bdc54b76..00000000 --- a/clustering/static.seqdiag +++ /dev/null @@ -1,16 +0,0 @@ -seqdiag { - activation = none; - - admin[label = "Manual Admin"]; - ca[label = "Manual CA"] - master; - kubelet[stacked]; - - admin => ca [label="create\n- master-cert"]; - admin ->> master [label="start\n- ca-root\n- master-cert"]; - - admin => ca [label="create\n- kubelet-cert"]; - admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"]; - - kubelet => master [label="register\n- kubelet-location"]; -} diff --git a/command_execution_port_forwarding.md b/command_execution_port_forwarding.md deleted file mode 100644 index a7175403..00000000 --- a/command_execution_port_forwarding.md +++ /dev/null @@ -1,158 +0,0 @@ -# Container Command Execution & Port Forwarding in Kubernetes - -## Abstract - -This document describes how to use Kubernetes to execute commands in containers, -with stdin/stdout/stderr streams attached and how to implement port forwarding -to the containers. - -## Background - -See the following related issues/PRs: - -- [Support attach](http://issue.k8s.io/1521) -- [Real container ssh](http://issue.k8s.io/1513) -- [Provide easy debug network access to services](http://issue.k8s.io/1863) -- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576) - -## Motivation - -Users and administrators are accustomed to being able to access their systems -via SSH to run remote commands, get shell access, and do port forwarding. - -Supporting SSH to containers in Kubernetes is a difficult task. You must -specify a "user" and a hostname to make an SSH connection, and `sshd` requires -real users (resolvable by NSS and PAM). Because a container belongs to a pod, -and the pod belongs to a namespace, you need to specify namespace/pod/container -to uniquely identify the target container. Unfortunately, a -namespace/pod/container is not a real user as far as SSH is concerned. Also, -most Linux systems limit user names to 32 characters, which is unlikely to be -large enough to contain namespace/pod/container. We could devise some scheme to -map each namespace/pod/container to a 32-character user name, adding entries to -`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the -time. Alternatively, we could write custom NSS and PAM modules that allow the -host to resolve a namespace/pod/container to a user without needing to keep -files or LDAP in sync. - -As an alternative to SSH, we are using a multiplexed streaming protocol that -runs on top of HTTP. There are no requirements about users being real users, -nor is there any limitation on user name length, as the protocol is under our -control. The only downside is that standard tooling that expects to use SSH -won't be able to work with this mechanism, unless adapters can be written. - -## Constraints and Assumptions - -- SSH support is not currently in scope. -- CGroup confinement is ultimately desired, but implementing that support is not -currently in scope. -- SELinux confinement is ultimately desired, but implementing that support is -not currently in scope. - -## Use Cases - -- A user of a Kubernetes cluster wants to run arbitrary commands in a -container with local stdin/stdout/stderr attached to the container. -- A user of a Kubernetes cluster wants to connect to local ports on his computer -and have them forwarded to ports in a container. - -## Process Flow - -### Remote Command Execution Flow - -1. The client connects to the Kubernetes Master to initiate a remote command -execution request. -2. The Master proxies the request to the Kubelet where the container lives. -3. The Kubelet executes nsenter + the requested command and streams -stdin/stdout/stderr back and forth between the client and the container. - -### Port Forwarding Flow - -1. The client connects to the Kubernetes Master to initiate a remote command -execution request. -2. The Master proxies the request to the Kubelet where the container lives. -3. The client listens on each specified local port, awaiting local connections. -4. The client connects to one of the local listening ports. -4. The client notifies the Kubelet of the new connection. -5. The Kubelet executes nsenter + socat and streams data back and forth between -the client and the port in the container. - -## Design Considerations - -### Streaming Protocol - -The current multiplexed streaming protocol used is SPDY. This is not the -long-term desire, however. As soon as there is viable support for HTTP/2 in Go, -we will switch to that. - -### Master as First Level Proxy - -Clients should not be allowed to communicate directly with the Kubelet for -security reasons. Therefore, the Master is currently the only suggested entry -point to be used for remote command execution and port forwarding. This is not -necessarily desirable, as it means that all remote command execution and port -forwarding traffic must travel through the Master, potentially impacting other -API requests. - -In the future, it might make more sense to retrieve an authorization token from -the Master, and then use that token to initiate a remote command execution or -port forwarding request with a load balanced proxy service dedicated to this -functionality. This would keep the streaming traffic out of the Master. - -### Kubelet as Backend Proxy - -The kubelet is currently responsible for handling remote command execution and -port forwarding requests. Just like with the Master described above, this means -that all remote command execution and port forwarding streaming traffic must -travel through the Kubelet, which could result in a degraded ability to service -other requests. - -In the future, it might make more sense to use a separate service on the node. - -Alternatively, we could possibly inject a process into the container that only -listens for a single request, expose that process's listening port on the node, -and then issue a redirect to the client such that it would connect to the first -level proxy, which would then proxy directly to the injected process's exposed -port. This would minimize the amount of proxying that takes place. - -### Scalability - -There are at least 2 different ways to execute a command in a container: -`docker exec` and `nsenter`. While `docker exec` might seem like an easier and -more obvious choice, it has some drawbacks. - -#### `docker exec` - -We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port -on the node), but this would require proxying from the edge and securing the -Docker API. `docker exec` calls go through the Docker daemon, meaning that all -stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop. -Additionally, you can't isolate 1 malicious `docker exec` call from normal -usage, meaning an attacker could initiate a denial of service or other attack -and take down the Docker daemon, or the node itself. - -We expect remote command execution and port forwarding requests to be long -running and/or high bandwidth operations, and routing all the streaming data -through the Docker daemon feels like a bottleneck we can avoid. - -#### `nsenter` - -The implementation currently uses `nsenter` to run commands in containers, -joining the appropriate container namespaces. `nsenter` runs directly on the -node and is not proxied through any single daemon process. - -### Security - -Authentication and authorization hasn't specifically been tested yet with this -functionality. We need to make sure that users are not allowed to execute -remote commands or do port forwarding to containers they aren't allowed to -access. - -Additional work is required to ensure that multiple command execution or port -forwarding connections from different clients are not able to see each other's -data. This can most likely be achieved via SELinux labeling and unique process - contexts. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() - diff --git a/configmap.md b/configmap.md deleted file mode 100644 index 658ac73b..00000000 --- a/configmap.md +++ /dev/null @@ -1,300 +0,0 @@ -# Generic Configuration Object - -## Abstract - -The `ConfigMap` API resource stores data used for the configuration of -applications deployed on Kubernetes. - -The main focus of this resource is to: - -* Provide dynamic distribution of configuration data to deployed applications. -* Encapsulate configuration information and simplify `Kubernetes` deployments. -* Create a flexible configuration model for `Kubernetes`. - -## Motivation - -A `Secret`-like API resource is needed to store configuration data that pods can -consume. - -Goals of this design: - -1. Describe a `ConfigMap` API resource. -2. Describe the semantics of consuming `ConfigMap` as environment variables. -3. Describe the semantics of consuming `ConfigMap` as files in a volume. - -## Use Cases - -1. As a user, I want to be able to consume configuration data as environment -variables. -2. As a user, I want to be able to consume configuration data as files in a -volume. -3. As a user, I want my view of configuration data in files to be eventually -consistent with changes to the data. - -### Consuming `ConfigMap` as Environment Variables - -A series of events for consuming `ConfigMap` as environment variables: - -1. Create a `ConfigMap` object. -2. Create a pod to consume the configuration data via environment variables. -3. The pod is scheduled onto a node. -4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and -starts the container processes with the appropriate configuration data from -environment variables. - -### Consuming `ConfigMap` in Volumes - -A series of events for consuming `ConfigMap` as configuration files in a volume: - -1. Create a `ConfigMap` object. -2. Create a new pod using the `ConfigMap` via a volume plugin. -3. The pod is scheduled onto a node. -4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` -method. -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod -and projects the appropriate configuration data into the volume. - -### Consuming `ConfigMap` Updates - -Any long-running system has configuration that is mutated over time. Changes -made to configuration data must be made visible to pods consuming data in -volumes so that they can respond to those changes. - -The `resourceVersion` of the `ConfigMap` object will be updated by the API -server every time the object is modified. After an update, modifications will be -made visible to the consumer container: - -1. Create a `ConfigMap` object. -2. Create a new pod using the `ConfigMap` via the volume plugin. -3. The pod is scheduled onto a node. -4. During the sync loop, the Kubelet creates an instance of the volume plugin -and calls its `Setup()` method. -5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod -and projects the appropriate data into the volume. -6. The `ConfigMap` referenced by the pod is updated. -7. During the next iteration of the `syncLoop`, the Kubelet creates an instance -of the volume plugin and calls its `Setup()` method. -8. The volume plugin projects the updated data into the volume atomically. - -It is the consuming pod's responsibility to make use of the updated data once it -is made visible. - -Because environment variables cannot be updated without restarting a container, -configuration data consumed in environment variables will not be updated. - -### Advantages - -* Easy to consume in pods; consumer-agnostic -* Configuration data is persistent and versioned -* Consumers of configuration data in volumes can respond to changes in the data - -## Proposed Design - -### API Resource - -The `ConfigMap` resource will be added to the main API: - -```go -package api - -// ConfigMap holds configuration data for pods to consume. -type ConfigMap struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - // Data contains the configuration data. Each key must be a valid - // DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN. - Data map[string]string `json:"data,omitempty"` -} - -type ConfigMapList struct { - TypeMeta `json:",inline"` - ListMeta `json:"metadata,omitempty"` - - Items []ConfigMap `json:"items"` -} -``` - -A `Registry` implementation for `ConfigMap` will be added to -`pkg/registry/configmap`. - -### Environment Variables - -The `EnvVarSource` will be extended with a new selector for `ConfigMap`: - -```go -package api - -// EnvVarSource represents a source for the value of an EnvVar. -type EnvVarSource struct { - // other fields omitted - - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` -} - -// Selects a key from a ConfigMap. -type ConfigMapKeySelector struct { - // The ConfigMap to select from. - LocalObjectReference `json:",inline"` - // The key to select. - Key string `json:"key"` -} -``` - -### Volume Source - -A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` -object will be added to the `VolumeSource` struct in the API: - -```go -package api - -type VolumeSource struct { - // other fields omitted - ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"` -} - -// Represents a volume that holds configuration data. -type ConfigMapVolumeSource struct { - LocalObjectReference `json:",inline"` - // A list of keys to project into the volume. - // If unspecified, each key-value pair in the Data field of the - // referenced ConfigMap will be projected into the volume as a file whose name - // is the key and content is the value. - // If specified, the listed keys will be project into the specified paths, and - // unlisted keys will not be present. - Items []KeyToPath `json:"items,omitempty"` -} - -// Represents a mapping of a key to a relative path. -type KeyToPath struct { - // The name of the key to select - Key string `json:"key"` - - // The relative path name of the file to be created. - // Must not be absolute or contain the '..' path. Must be utf-8 encoded. - // The first item of the relative path must not start with '..' - Path string `json:"path"` -} -``` - -**Note:** The update logic used in the downward API volume plug-in will be -extracted and re-used in the volume plug-in for `ConfigMap`. - -### Changes to Secret - -We will update the Secret volume plugin to have a similar API to the new -`ConfigMap` volume plugin. The secret volume plugin will also begin updating -secret content in the volume when secrets change. - -## Examples - -#### Consuming `ConfigMap` as Environment Variables - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: etcd-env-config -data: - number-of-members: "1" - initial-cluster-state: new - initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN - discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN - discovery-url: http://etcd-discovery:2379 - etcdctl-peers: http://etcd:2379 -``` - -This pod consumes the `ConfigMap` as environment variables: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-env-example -spec: - containers: - - name: etcd - image: openshift/etcd-20-centos7 - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - env: - - name: ETCD_NUM_MEMBERS - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: number-of-members - - name: ETCD_INITIAL_CLUSTER_STATE - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: initial-cluster-state - - name: ETCD_DISCOVERY_TOKEN - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: discovery-token - - name: ETCD_DISCOVERY_URL - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: discovery-url - - name: ETCDCTL_PEERS - valueFrom: - configMapKeyRef: - name: etcd-env-config - key: etcdctl-peers -``` - -#### Consuming `ConfigMap` as Volumes - -`redis-volume-config` is intended to be used as a volume containing a config -file: - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: redis-volume-config -data: - redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n" -``` - -The following pod consumes the `redis-volume-config` in a volume: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: config-volume-example -spec: - containers: - - name: redis - image: kubernetes/redis - command: ["redis-server", "/mnt/config-map/etc/redis.conf"] - ports: - - containerPort: 6379 - volumeMounts: - - name: config-map-volume - mountPath: /mnt/config-map - volumes: - - name: config-map-volume - configMap: - name: redis-volume-config - items: - - path: "etc/redis.conf" - key: redis.conf -``` - -## Future Improvements - -In the future, we may add the ability to specify an init-container that can -watch the volume contents for updates and respond to changes when they occur. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() - diff --git a/control-plane-resilience.md b/control-plane-resilience.md deleted file mode 100644 index 8193fd97..00000000 --- a/control-plane-resilience.md +++ /dev/null @@ -1,241 +0,0 @@ -# Kubernetes and Cluster Federation Control Plane Resilience - -## Long Term Design and Current Status - -### by Quinton Hoole, Mike Danese and Justin Santa-Barbara - -### December 14, 2015 - -## Summary - -Some amount of confusion exists around how we currently, and in future -want to ensure resilience of the Kubernetes (and by implication -Kubernetes Cluster Federation) control plane. This document is an attempt to capture that -definitively. It covers areas including self-healing, high -availability, bootstrapping and recovery. Most of the information in -this document already exists in the form of github comments, -PR's/proposals, scattered documents, and corridor conversations, so -document is primarily a consolidation and clarification of existing -ideas. - -## Terms - -* **Self-healing:** automatically restarting or replacing failed - processes and machines without human intervention -* **High availability:** continuing to be available and work correctly - even if some components are down or uncontactable. This typically - involves multiple replicas of critical services, and a reliable way - to find available replicas. Note that it's possible (but not - desirable) to have high - availability properties (e.g. multiple replicas) in the absence of - self-healing properties (e.g. if a replica fails, nothing replaces - it). Fairly obviously, given enough time, such systems typically - become unavailable (after enough replicas have failed). -* **Bootstrapping**: creating an empty cluster from nothing -* **Recovery**: recreating a non-empty cluster after perhaps - catastrophic failure/unavailability/data corruption - -## Overall Goals - -1. **Resilience to single failures:** Kubernetes clusters constrained - to single availability zones should be resilient to individual - machine and process failures by being both self-healing and highly - available (within the context of such individual failures). -1. **Ubiquitous resilience by default:** The default cluster creation - scripts for (at least) GCE, AWS and basic bare metal should adhere - to the above (self-healing and high availability) by default (with - options available to disable these features to reduce control plane - resource requirements if so required). It is hoped that other - cloud providers will also follow the above guidelines, but the - above 3 are the primary canonical use cases. -1. **Resilience to some correlated failures:** Kubernetes clusters - which span multiple availability zones in a region should by - default be resilient to complete failure of one entire availability - zone (by similarly providing self-healing and high availability in - the default cluster creation scripts as above). -1. **Default implementation shared across cloud providers:** The - differences between the default implementations of the above for - GCE, AWS and basic bare metal should be minimized. This implies - using shared libraries across these providers in the default - scripts in preference to highly customized implementations per - cloud provider. This is not to say that highly differentiated, - customized per-cloud cluster creation processes (e.g. for GKE on - GCE, or some hosted Kubernetes provider on AWS) are discouraged. - But those fall squarely outside the basic cross-platform OSS - Kubernetes distro. -1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms - for achieving system resilience (replication controllers, health - checking, service load balancing etc) should be used in preference - to building a separate set of mechanisms to achieve the same thing. - This implies that self hosting (the kubernetes control plane on - kubernetes) is strongly preferred, with the caveat below. -1. **Recovery from catastrophic failure:** The ability to quickly and - reliably recover a cluster from catastrophic failure is critical, - and should not be compromised by the above goal to self-host - (i.e. it goes without saying that the cluster should be quickly and - reliably recoverable, even if the cluster control plane is - broken). This implies that such catastrophic failure scenarios - should be carefully thought out, and the subject of regular - continuous integration testing, and disaster recovery exercises. - -## Relative Priorities - -1. **(Possibly manual) recovery from catastrophic failures:** having a -Kubernetes cluster, and all applications running inside it, disappear forever -perhaps is the worst possible failure mode. So it is critical that we be able to -recover the applications running inside a cluster from such failures in some -well-bounded time period. - 1. In theory a cluster can be recovered by replaying all API calls - that have ever been executed against it, in order, but most - often that state has been lost, and/or is scattered across - multiple client applications or groups. So in general it is - probably infeasible. - 1. In theory a cluster can also be recovered to some relatively - recent non-corrupt backup/snapshot of the disk(s) backing the - etcd cluster state. But we have no default consistent - backup/snapshot, verification or restoration process. And we - don't routinely test restoration, so even if we did routinely - perform and verify backups, we have no hard evidence that we - can in practise effectively recover from catastrophic cluster - failure or data corruption by restoring from these backups. So - there's more work to be done here. -1. **Self-healing:** Most major cloud providers provide the ability to - easily and automatically replace failed virtual machines within a - small number of minutes (e.g. GCE - [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) - and Managed Instance Groups, - AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) - and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This - can fairly trivially be used to reduce control-plane down-time due - to machine failure to a small number of minutes per failure - (i.e. typically around "3 nines" availability), provided that: - 1. cluster persistent state (i.e. etcd disks) is either: - 1. truely persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using etcd [dynamic member - addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) - or [backup and - recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). - 1. and boot disks are either: - 1. truely persistent (i.e. remote persistent disks), or - 1. reconstructible (e.g. using boot-from-snapshot, - boot-from-pre-configured-image or - boot-from-auto-initializing image). -1. **High Availability:** This has the potential to increase - availability above the approximately "3 nines" level provided by - automated self-healing, but it's somewhat more complex, and - requires additional resources (e.g. redundant API servers and etcd - quorum members). In environments where cloud-assisted automatic - self-healing might be infeasible (e.g. on-premise bare-metal - deployments), it also gives cluster administrators more time to - respond (e.g. replace/repair failed machines) without incurring - system downtime. - -## Design and Status (as of December 2015) - - - - - - - - - - - - - - - - - - - - - - -
Control Plane ComponentResilience PlanCurrent Status
API Server - -Multiple stateless, self-hosted, self-healing API servers behind a HA -load balancer, built out by the default "kube-up" automation on GCE, -AWS and basic bare metal (BBM). Note that the single-host approach of -having etcd listen only on localhost to ensure that only API server can -connect to it will no longer work, so alternative security will be -needed in the regard (either using firewall rules, SSL certs, or -something else). All necessary flags are currently supported to enable -SSL between API server and etcd (OpenShift runs like this out of the -box), but this needs to be woven into the "kube-up" and related -scripts. Detailed design of self-hosting and related bootstrapping -and catastrophic failure recovery will be detailed in a separate -design doc. - - - -No scripted self-healing or HA on GCE, AWS or basic bare metal -currently exists in the OSS distro. To be clear, "no self healing" -means that even if multiple e.g. API servers are provisioned for HA -purposes, if they fail, nothing replaces them, so eventually the -system will fail. Self-healing and HA can be set up -manually by following documented instructions, but this is not -currently an automated process, and it is not tested as part of -continuous integration. So it's probably safest to assume that it -doesn't actually work in practise. - -
Controller manager and scheduler - -Multiple self-hosted, self healing warm standby stateless controller -managers and schedulers with leader election and automatic failover of API -server clients, automatically installed by default "kube-up" automation. - -As above.
etcd - -Multiple (3-5) etcd quorum members behind a load balancer with session -affinity (to prevent clients from being bounced from one to another). - -Regarding self-healing, if a node running etcd goes down, it is always necessary -to do three things: -
    -
  1. allocate a new node (not necessary if running etcd as a pod, in -which case specific measures are required to prevent user pods from -interfering with system pods, for example using node selectors as -described in -dynamic member addition. - -In the case of remote persistent disk, the etcd state can be recovered by -attaching the remote persistent disk to the replacement node, thus the state is -recoverable even if all other replicas are down. - -There are also significant performance differences between local disks and remote -persistent disks. For example, the - -sustained throughput local disks in GCE is approximatley 20x that of remote -disks. - -Hence we suggest that self-healing be provided by remotely mounted persistent -disks in non-performance critical, single-zone cloud deployments. For -performance critical installations, faster local SSD's should be used, in which -case remounting on node failure is not an option, so - -etcd runtime configuration should be used to replace the failed machine. -Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so -automatic -runtime configuration is required. Similarly, basic bare metal deployments -cannot generally rely on remote persistent disks, so the same approach applies -there. -
- -Somewhat vague instructions exist on how to set some of this up manually in -a self-hosted configuration. But automatic bootstrapping and self-healing is not -described (and is not implemented for the non-PD cases). This all still needs to -be automated and continuously tested. -
- - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() - diff --git a/daemon.md b/daemon.md deleted file mode 100644 index 2c306056..00000000 --- a/daemon.md +++ /dev/null @@ -1,206 +0,0 @@ -# DaemonSet in Kubernetes - -**Author**: Ananya Kumar (@AnanyaKumar) - -**Status**: Implemented. - -This document presents the design of the Kubernetes DaemonSet, describes use -cases, and gives an overview of the code. - -## Motivation - -Many users have requested for a way to run a daemon on every node in a -Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential -for use cases such as building a sharded datastore, or running a logger on every -node. In comes the DaemonSet, a way to conveniently create and manage -daemon-like workloads in Kubernetes. - -## Use Cases - -The DaemonSet can be used for user-specified system services, cluster-level -applications with strong node ties, and Kubernetes node services. Below are -example use cases in each category. - -### User-Specified System Services: - -Logging: Some users want a way to collect statistics about nodes in a cluster -and send those logs to an external database. For example, system administrators -might want to know if their machines are performing as expected, if they need to -add more machines to the cluster, or if they should switch cloud providers. The -DaemonSet can be used to run a data collection service (for example fluentd) on -every node and send the data to a service like ElasticSearch for analysis. - -### Cluster-Level Applications - -Datastore: Users might want to implement a sharded datastore in their cluster. A -few nodes in the cluster, labeled ‘app=datastore’, might be responsible for -storing data shards, and pods running on these nodes might serve data. This -architecture requires a way to bind pods to specific nodes, so it cannot be -achieved using a Replication Controller. A DaemonSet is a convenient way to -implement such a datastore. - -For other uses, see the related [feature request](https://issues.k8s.io/1518) - -## Functionality - -The DaemonSet supports standard API features: - - create - - The spec for DaemonSets has a pod template field. - - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate -over nodes that have a certain label. For example, suppose that in a cluster -some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a -datastore pod on exactly those nodes labeled ‘app=database’. - - Using the pod's nodeName field, DaemonSets can be restricted to operate on a -specified node. - - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec -used by the Replication Controller. - - The initial implementation will not guarantee that DaemonSet pods are -created on nodes before other pods. - - The initial implementation of DaemonSet does not guarantee that DaemonSet -pods show up on nodes (for example because of resource limitations of the node), -but makes a best effort to launch DaemonSet pods (like Replication Controllers -do with pods). Subsequent revisions might ensure that DaemonSet pods show up on -nodes, preempting other pods if necessary. - - The DaemonSet controller adds an annotation: -```"kubernetes.io/created-by: \"``` - - YAML example: - - ```YAML - apiVersion: extensions/v1beta1 - kind: DaemonSet - metadata: - labels: - app: datastore - name: datastore - spec: - template: - metadata: - labels: - app: datastore-shard - spec: - nodeSelector: - app: datastore-node - containers: - name: datastore-shard - image: kubernetes/sharded - ports: - - containerPort: 9042 - name: main -``` - - - commands that get info: - - get (e.g. kubectl get daemonsets) - - describe - - Modifiers: - - delete (if --cascade=true, then first the client turns down all the pods -controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is -unlikely to be set on any node); then it deletes the DaemonSet; then it deletes -the pods) - - label - - annotate - - update operations like patch and replace (only allowed to selector and to -nodeSelector and nodeName of pod template) - - DaemonSets have labels, so you could, for example, list all DaemonSets -with certain labels (the same way you would for a Replication Controller). - -In general, for all the supported features like get, describe, update, etc, -the DaemonSet works in a similar way to the Replication Controller. However, -note that the DaemonSet and the Replication Controller are different constructs. - -### Persisting Pods - - - Ordinary liveness probes specified in the pod template work to keep pods -created by a DaemonSet running. - - If a daemon pod is killed or stopped, the DaemonSet will create a new -replica of the daemon pod on the node. - -### Cluster Mutations - - - When a new node is added to the cluster, the DaemonSet controller starts -daemon pods on the node for DaemonSets whose pod template nodeSelectors match -the node’s labels. - - Suppose the user launches a DaemonSet that runs a logging daemon on all -nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label -to a node (that did not initially have the label), the logging daemon will -launch on the node. Additionally, if a user removes the label from a node, the -logging daemon on that node will be killed. - -## Alternatives Considered - -We considered several alternatives, that were deemed inferior to the approach of -creating a new DaemonSet abstraction. - -One alternative is to include the daemon in the machine image. In this case it -would run outside of Kubernetes proper, and thus not be monitored, health -checked, usable as a service endpoint, easily upgradable, etc. - -A related alternative is to package daemons as static pods. This would address -most of the problems described above, but they would still not be easily -upgradable, and more generally could not be managed through the API server -interface. - -A third alternative is to generalize the Replication Controller. We would do -something like: if you set the `replicas` field of the ReplicationControllerSpec -to -1, then it means "run exactly one replica on every node matching the -nodeSelector in the pod template." The ReplicationController would pretend -`replicas` had been set to some large number -- larger than the largest number -of nodes ever expected in the cluster -- and would use some anti-affinity -mechanism to ensure that no more than one Pod from the ReplicationController -runs on any given node. There are two downsides to this approach. First, -there would always be a large number of Pending pods in the scheduler (these -will be scheduled onto new machines when they are added to the cluster). The -second downside is more philosophical: DaemonSet and the Replication Controller -are very different concepts. We believe that having small, targeted controllers -for distinct purposes makes Kubernetes easier to understand and use, compared to -having larger multi-functional controllers (see -["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for -some discussion of this topic). - -## Design - -#### Client - -- Add support for DaemonSet commands to kubectl and the client. Client code was -added to pkg/client/unversioned. The main files in Kubectl that were modified are -pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create, -and Update, the client simply forwards the request to the backend via the REST -API. - -#### Apiserver - -- Accept, parse, validate client commands -- REST API calls are handled in pkg/registry/daemonset - - In particular, the api server will add the object to etcd - - DaemonManager listens for updates to etcd (using Framework.informer) -- API objects for DaemonSet were created in expapi/v1/types.go and -expapi/v1/register.go -- Validation code is in expapi/validation - -#### Daemon Manager - -- Creates new DaemonSets when requested. Launches the corresponding daemon pod -on all nodes with labels matching the new DaemonSet’s selector. -- Listens for addition of new nodes to the cluster, by setting up a -framework.NewInformer that watches for the creation of Node API objects. When a -new node is added, the daemon manager will loop through each DaemonSet. If the -label of the node matches the selector of the DaemonSet, then the daemon manager -will create the corresponding daemon pod in the new node. -- The daemon manager creates a pod on a node by sending a command to the API -server, requesting for a pod to be bound to the node (the node will be specified -via its hostname.) - -#### Kubelet - -- Does not need to be modified, but health checking will occur for the daemon -pods and revive the pods if they are killed (we set the pod restartPolicy to -Always). We reject DaemonSet objects with pod templates that don’t have -restartPolicy set to Always. - -## Open Issues - -- Should work similarly to [Deployment](http://issues.k8s.io/1743). - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() - diff --git a/design/README.md b/design/README.md new file mode 100644 index 00000000..85fc8245 --- /dev/null +++ b/design/README.md @@ -0,0 +1,62 @@ +# Kubernetes Design Overview + +Kubernetes is a system for managing containerized applications across multiple +hosts, providing basic mechanisms for deployment, maintenance, and scaling of +applications. + +Kubernetes establishes robust declarative primitives for maintaining the desired +state requested by the user. We see these primitives as the main value added by +Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and +replicating containers require active controllers, not just imperative +orchestration. + +Kubernetes is primarily targeted at applications composed of multiple +containers, such as elastic, distributed micro-services. It is also designed to +facilitate migration of non-containerized application stacks to Kubernetes. It +therefore includes abstractions for grouping containers in both loosely coupled +and tightly coupled formations, and provides ways for containers to find and +communicate with each other in relatively familiar ways. + +Kubernetes enables users to ask a cluster to run a set of containers. The system +automatically chooses hosts to run those containers on. While Kubernetes's +scheduler is currently very simple, we expect it to grow in sophistication over +time. Scheduling is a policy-rich, topology-aware, workload-specific function +that significantly impacts availability, performance, and capacity. The +scheduler needs to take into account individual and collective resource +requirements, quality of service requirements, hardware/software/policy +constraints, affinity and anti-affinity specifications, data locality, +inter-workload interference, deadlines, and so on. Workload-specific +requirements will be exposed through the API as necessary. + +Kubernetes is intended to run on a number of cloud providers, as well as on +physical hosts. + +A single Kubernetes cluster is not intended to span multiple availability zones. +Instead, we recommend building a higher-level layer to replicate complete +deployments of highly available applications across multiple zones (see +[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md) +for more details). + +Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS +platform and toolkit. Therefore, architecturally, we want Kubernetes to be built +as a collection of pluggable components and layers, with the ability to use +alternative schedulers, controllers, storage systems, and distribution +mechanisms, and we're evolving its current code in that direction. Furthermore, +we want others to be able to extend Kubernetes functionality, such as with +higher-level PaaS functionality or multi-cluster layers, without modification of +core Kubernetes source. Therefore, its API isn't just (or even necessarily +mainly) targeted at end users, but at tool and extension developers. Its APIs +are intended to serve as the foundation for an open ecosystem of tools, +automation systems, and higher-level API layers. Consequently, there are no +"internal" inter-component APIs. All APIs are visible and available, including +the APIs used by the scheduler, the node controller, the replication-controller +manager, Kubelet's API, etc. There's no glass to break -- in order to handle +more complex use cases, one can just access the lower-level APIs in a fully +transparent, composable manner. + +For more about the Kubernetes architecture, see [architecture](architecture.md). + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]() + diff --git a/design/access.md b/design/access.md new file mode 100644 index 00000000..b23e463b --- /dev/null +++ b/design/access.md @@ -0,0 +1,376 @@ +# K8s Identity and Access Management Sketch + +This document suggests a direction for identity and access management in the +Kubernetes system. + + +## Background + +High level goals are: + - Have a plan for how identity, authentication, and authorization will fit in +to the API. + - Have a plan for partitioning resources within a cluster between independent +organizational units. + - Ease integration with existing enterprise and hosted scenarios. + +### Actors + +Each of these can act as normal users or attackers. + - External Users: People who are accessing applications running on K8s (e.g. +a web site served by webserver running in a container on K8s), but who do not +have K8s API access. + - K8s Users: People who access the K8s API (e.g. create K8s API objects like +Pods) + - K8s Project Admins: People who manage access for some K8s Users + - K8s Cluster Admins: People who control the machines, networks, or binaries +that make up a K8s cluster. + - K8s Admin means K8s Cluster Admins and K8s Project Admins taken together. + +### Threats + +Both intentional attacks and accidental use of privilege are concerns. + +For both cases it may be useful to think about these categories differently: + - Application Path - attack by sending network messages from the internet to +the IP/port of any application running on K8s. May exploit weakness in +application or misconfiguration of K8s. + - K8s API Path - attack by sending network messages to any K8s API endpoint. + - Insider Path - attack on K8s system components. Attacker may have +privileged access to networks, machines or K8s software and data. Software +errors in K8s system components and administrator error are some types of threat +in this category. + +This document is primarily concerned with K8s API paths, and secondarily with +Internal paths. The Application path also needs to be secure, but is not the +focus of this document. + +### Assets to protect + +External User assets: + - Personal information like private messages, or images uploaded by External +Users. + - web server logs. + +K8s User assets: + - External User assets of each K8s User. + - things private to the K8s app, like: + - credentials for accessing other services (docker private repos, storage +services, facebook, etc) + - SSL certificates for web servers + - proprietary data and code + +K8s Cluster assets: + - Assets of each K8s User. + - Machine Certificates or secrets. + - The value of K8s cluster computing resources (cpu, memory, etc). + +This document is primarily about protecting K8s User assets and K8s cluster +assets from other K8s Users and K8s Project and Cluster Admins. + +### Usage environments + +Cluster in Small organization: + - K8s Admins may be the same people as K8s Users. + - Few K8s Admins. + - Prefer ease of use to fine-grained access control/precise accounting, etc. + - Product requirement that it be easy for potential K8s Cluster Admin to try +out setting up a simple cluster. + +Cluster in Large organization: + - K8s Admins typically distinct people from K8s Users. May need to divide +K8s Cluster Admin access by roles. + - K8s Users need to be protected from each other. + - Auditing of K8s User and K8s Admin actions important. + - Flexible accurate usage accounting and resource controls important. + - Lots of automated access to APIs. + - Need to integrate with existing enterprise directory, authentication, +accounting, auditing, and security policy infrastructure. + +Org-run cluster: + - Organization that runs K8s master components is same as the org that runs +apps on K8s. + - Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix. + +Hosted cluster: + - Offering K8s API as a service, or offering a Paas or Saas built on K8s. + - May already offer web services, and need to integrate with existing customer +account concept, and existing authentication, accounting, auditing, and security +policy infrastructure. + - May want to leverage K8s User accounts and accounting to manage their User +accounts (not a priority to support this use case.) + - Precise and accurate accounting of resources needed. Resource controls +needed for hard limits (Users given limited slice of data) and soft limits +(Users can grow up to some limit and then be expanded). + +K8s ecosystem services: + - There may be companies that want to offer their existing services (Build, CI, +A/B-test, release automation, etc) for use with K8s. There should be some story +for this case. + +Pods configs should be largely portable between Org-run and hosted +configurations. + + +# Design + +Related discussion: +- http://issue.k8s.io/442 +- http://issue.k8s.io/443 + +This doc describes two security profiles: + - Simple profile: like single-user mode. Make it easy to evaluate K8s +without lots of configuring accounts and policies. Protects from unauthorized +users, but does not partition authorized users. + - Enterprise profile: Provide mechanisms needed for large numbers of users. +Defense in depth. Should integrate with existing enterprise security +infrastructure. + +K8s distribution should include templates of config, and documentation, for +simple and enterprise profiles. System should be flexible enough for +knowledgeable users to create intermediate profiles, but K8s developers should +only reason about those two Profiles, not a matrix. + +Features in this doc are divided into "Initial Feature", and "Improvements". +Initial features would be candidates for version 1.00. + +## Identity + +### userAccount + +K8s will have a `userAccount` API object. +- `userAccount` has a UID which is immutable. This is used to associate users +with objects and to record actions in audit logs. +- `userAccount` has a name which is a string and human readable and unique among +userAccounts. It is used to refer to users in Policies, to ensure that the +Policies are human readable. It can be changed only when there are no Policy +objects or other objects which refer to that name. An email address is a +suggested format for this field. +- `userAccount` is not related to the unix username of processes in Pods created +by that userAccount. +- `userAccount` API objects can have labels. + +The system may associate one or more Authentication Methods with a +`userAccount` (but they are not formally part of the userAccount object.) + +In a simple deployment, the authentication method for a user might be an +authentication token which is verified by a K8s server. In a more complex +deployment, the authentication might be delegated to another system which is +trusted by the K8s API to authenticate users, but where the authentication +details are unknown to K8s. + +Initial Features: +- There is no superuser `userAccount` +- `userAccount` objects are statically populated in the K8s API store by reading +a config file. Only a K8s Cluster Admin can do this. +- `userAccount` can have a default `namespace`. If API call does not specify a +`namespace`, the default `namespace` for that caller is assumed. +- `userAccount` is global. A single human with access to multiple namespaces is +recommended to only have one userAccount. + +Improvements: +- Make `userAccount` part of a separate API group from core K8s objects like +`pod.` Facilitates plugging in alternate Access Management. + +Simple Profile: + - Single `userAccount`, used by all K8s Users and Project Admins. One access +token shared by all. + +Enterprise Profile: + - Every human user has own `userAccount`. + - `userAccount`s have labels that indicate both membership in groups, and +ability to act in certain roles. + - Each service using the API has own `userAccount` too. (e.g. `scheduler`, +`repcontroller`) + - Automated jobs to denormalize the ldap group info into the local system +list of users into the K8s userAccount file. + +### Unix accounts + +A `userAccount` is not a Unix user account. The fact that a pod is started by a +`userAccount` does not mean that the processes in that pod's containers run as a +Unix user with a corresponding name or identity. + +Initially: +- The unix accounts available in a container, and used by the processes running +in a container are those that are provided by the combination of the base +operating system and the Docker manifest. +- Kubernetes doesn't enforce any relation between `userAccount` and unix +accounts. + +Improvements: +- Kubelet allocates disjoint blocks of root-namespace uids for each container. +This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572) +- requires docker to integrate user namespace support, and deciding what +getpwnam() does for these uids. +- any features that help users avoid use of privileged containers +(http://issue.k8s.io/391) + +### Namespaces + +K8s will have a `namespace` API object. It is similar to a Google Compute +Engine `project`. It provides a namespace for objects created by a group of +people co-operating together, preventing name collisions with non-cooperating +groups. It also serves as a reference point for authorization policies. + +Namespaces are described in [namespaces.md](namespaces.md). + +In the Enterprise Profile: + - a `userAccount` may have permission to access several `namespace`s. + +In the Simple Profile: + - There is a single `namespace` used by the single user. + +Namespaces versus userAccount vs. Labels: +- `userAccount`s are intended for audit logging (both name and UID should be +logged), and to define who has access to `namespace`s. +- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md)) +should be used to distinguish pods, users, and other objects that cooperate +towards a common goal but are different in some way, such as version, or +responsibilities. +- `namespace`s prevent name collisions between uncoordinated groups of people, +and provide a place to attach common policies for co-operating groups of people. + + +## Authentication + +Goals for K8s authentication: +- Include a built-in authentication system with no configuration required to use +in single-user mode, and little configuration required to add several user +accounts, and no https proxy required. +- Allow for authentication to be handled by a system external to Kubernetes, to +allow integration with existing to enterprise authorization systems. The +Kubernetes namespace itself should avoid taking contributions of multiple +authorization schemes. Instead, a trusted proxy in front of the apiserver can be +used to authenticate users. + - For organizations whose security requirements only allow FIPS compliant +implementations (e.g. apache) for authentication. + - So the proxy can terminate SSL, and isolate the CA-signed certificate from +less trusted, higher-touch APIserver. + - For organizations that already have existing SaaS web services (e.g. +storage, VMs) and want a common authentication portal. +- Avoid mixing authentication and authorization, so that authorization policies +be centrally managed, and to allow changes in authentication methods without +affecting authorization code. + +Initially: +- Tokens used to authenticate a user. +- Long lived tokens identify a particular `userAccount`. +- Administrator utility generates tokens at cluster setup. +- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750 +- No scopes for tokens. Authorization happens in the API server +- Tokens dynamically generated by apiserver to identify pods which are making +API calls. +- Tokens checked in a module of the APIserver. +- Authentication in apiserver can be disabled by flag, to allow testing without +authorization enabled, and to allow use of an authenticating proxy. In this +mode, a query parameter or header added by the proxy will identify the caller. + +Improvements: +- Refresh of tokens. +- SSH keys to access inside containers. + +To be considered for subsequent versions: +- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749) +- Scoped tokens. +- Tokens that are bound to the channel between the client and the api server + - http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf + - http://www.browserauth.net + +## Authorization + +K8s authorization should: +- Allow for a range of maturity levels, from single-user for those test driving +the system, to integration with existing to enterprise authorization systems. +- Allow for centralized management of users and policies. In some +organizations, this will mean that the definition of users and access policies +needs to reside on a system other than k8s and encompass other web services +(such as a storage service). +- Allow processes running in K8s Pods to take on identity, and to allow narrow +scoping of permissions for those identities in order to limit damage from +software faults. +- Have Authorization Policies exposed as API objects so that a single config +file can create or delete Pods, Replication Controllers, Services, and the +identities and policies for those Pods and Replication Controllers. +- Be separate as much as practical from Authentication, to allow Authentication +methods to change over time and space, without impacting Authorization policies. + +K8s will implement a relatively simple +[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model. + +The model will be described in more detail in a forthcoming document. The model +will: +- Be less complex than XACML +- Be easily recognizable to those familiar with Amazon IAM Policies. +- Have a subset/aliases/defaults which allow it to be used in a way comfortable +to those users more familiar with Role-Based Access Control. + +Authorization policy is set by creating a set of Policy objects. + +The API Server will be the Enforcement Point for Policy. For each API call that +it receives, it will construct the Attributes needed to evaluate the policy +(what user is making the call, what resource they are accessing, what they are +trying to do that resource, etc) and pass those attributes to a Decision Point. +The Decision Point code evaluates the Attributes against all the Policies and +allows or denies the API call. The system will be modular enough that the +Decision Point code can either be linked into the APIserver binary, or be +another service that the apiserver calls for each Decision (with appropriate +time-limited caching as needed for performance). + +Policy objects may be applicable only to a single namespace or to all +namespaces; K8s Project Admins would be able to create those as needed. Other +Policy objects may be applicable to all namespaces; a K8s Cluster Admin might +create those in order to authorize a new type of controller to be used by all +namespaces, or to make a K8s User into a K8s Project Admin.) + +## Accounting + +The API should have a `quota` concept (see http://issue.k8s.io/442). A quota +object relates a namespace (and optionally a label selector) to a maximum +quantity of resources that may be used (see [resources design doc](resources.md)). + +Initially: +- A `quota` object is immutable. +- For hosted K8s systems that do billing, Project is recommended level for +billing accounts. +- Every object that consumes resources should have a `namespace` so that +Resource usage stats are roll-up-able to `namespace`. +- K8s Cluster Admin sets quota objects by writing a config file. + +Improvements: +- Allow one namespace to charge the quota for one or more other namespaces. This +would be controlled by a policy which allows changing a billing_namespace = +label on an object. +- Allow quota to be set by namespace owners for (namespace x label) combinations +(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't +allow "webserver" namespace and "instance=test" use more than 10 cores. +- Tools to help write consistent quota config files based on number of nodes, +historical namespace usages, QoS needs, etc. +- Way for K8s Cluster Admin to incrementally adjust Quota objects. + +Simple profile: + - A single `namespace` with infinite resource limits. + +Enterprise profile: + - Multiple namespaces each with their own limits. + +Issues: +- Need for locking or "eventual consistency" when multiple apiserver goroutines +are accessing the object store and handling pod creations. + + +## Audit Logging + +API actions can be logged. + +Initial implementation: +- All API calls logged to nginx logs. + +Improvements: +- API server does logging instead. +- Policies to drop logging for high rate trusted API calls, or by users +performing audit or other sensitive functions. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]() + diff --git a/design/admission_control.md b/design/admission_control.md new file mode 100644 index 00000000..a7330104 --- /dev/null +++ b/design/admission_control.md @@ -0,0 +1,106 @@ +# Kubernetes Proposal - Admission Control + +**Related PR:** + +| Topic | Link | +| ----- | ---- | +| Separate validation from RESTStorage | http://issue.k8s.io/2977 | + +## Background + +High level goals: +* Enable an easy-to-use mechanism to provide admission control to cluster. +* Enable a provider to support multiple admission control strategies or author +their own. +* Ensure any rejected request can propagate errors back to the caller with why +the request failed. + +Authorization via policy is focused on answering if a user is authorized to +perform an action. + +Admission Control is focused on if the system will accept an authorized action. + +Kubernetes may choose to dismiss an authorized action based on any number of +admission control strategies. + +This proposal documents the basic design, and describes how any number of +admission control plug-ins could be injected. + +Implementation of specific admission control strategies are handled in separate +documents. + +## kube-apiserver + +The kube-apiserver takes the following OPTIONAL arguments to enable admission +control: + +| Option | Behavior | +| ------ | -------- | +| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. | +| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. | + +An **AdmissionControl** plug-in is an implementation of the following interface: + +```go +package admission + +// Attributes is an interface used by a plug-in to make an admission decision +// on a individual request. +type Attributes interface { + GetNamespace() string + GetKind() string + GetOperation() string + GetObject() runtime.Object +} + +// Interface is an abstract, pluggable interface for Admission Control decisions. +type Interface interface { + // Admit makes an admission decision based on the request attributes + // An error is returned if it denies the request. + Admit(a Attributes) (err error) +} +``` + +A **plug-in** must be compiled with the binary, and is registered as an +available option by providing a name, and implementation of admission.Interface. + +```go +func init() { + admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil }) +} +``` + +A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go) + +```go + // Admission policies + _ "k8s.io/kubernetes/plugin/pkg/admission/admit" + _ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages" + _ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity" + ... + _ "" +``` + +Invocation of admission control is handled by the **APIServer** and not +individual **RESTStorage** implementations. + +This design assumes that **Issue 297** is adopted, and as a consequence, the +general framework of the APIServer request/response flow will ensure the +following: + +1. Incoming request +2. Authenticate user +3. Authorize user +4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes) + - invoke each admission.Interface object in sequence +5. Case on the operation: + - If operation=create|update, then validate(object) and persist + - If operation=delete, delete the object + - If operation=connect, exec + +If at any step, there is an error, the request is canceled. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]() + diff --git a/design/admission_control_limit_range.md b/design/admission_control_limit_range.md new file mode 100644 index 00000000..06cce2cb --- /dev/null +++ b/design/admission_control_limit_range.md @@ -0,0 +1,233 @@ +# Admission control plugin: LimitRanger + +## Background + +This document proposes a system for enforcing resource requirements constraints +as part of admission control. + +## Use cases + +1. Ability to enumerate resource requirement constraints per namespace +2. Ability to enumerate min/max resource constraints for a pod +3. Ability to enumerate min/max resource constraints for a container +4. Ability to specify default resource limits for a container +5. Ability to specify default resource requests for a container +6. Ability to enforce a ratio between request and limit for a resource. +7. Ability to enforce min/max storage requests for persistent volume claims + +## Data Model + +The **LimitRange** resource is scoped to a **Namespace**. + +### Type + +```go +// LimitType is a type of object that is limited +type LimitType string + +const ( + // Limit that applies to all pods in a namespace + LimitTypePod LimitType = "Pod" + // Limit that applies to all containers in a namespace + LimitTypeContainer LimitType = "Container" +) + +// LimitRangeItem defines a min/max usage limit for any resource that matches +// on kind. +type LimitRangeItem struct { + // Type of resource that this limit applies to. + Type LimitType `json:"type,omitempty"` + // Max usage constraints on this kind by resource name. + Max ResourceList `json:"max,omitempty"` + // Min usage constraints on this kind by resource name. + Min ResourceList `json:"min,omitempty"` + // Default resource requirement limit value by resource name if resource limit + // is omitted. + Default ResourceList `json:"default,omitempty"` + // DefaultRequest is the default resource requirement request value by + // resource name if resource request is omitted. + DefaultRequest ResourceList `json:"defaultRequest,omitempty"` + // MaxLimitRequestRatio if specified, the named resource must have a request + // and limit that are both non-zero where limit divided by request is less + // than or equal to the enumerated value; this represents the max burst for + // the named resource. + MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"` +} + +// LimitRangeSpec defines a min/max usage limit for resources that match +// on kind. +type LimitRangeSpec struct { + // Limits is the list of LimitRangeItem objects that are enforced. + Limits []LimitRangeItem `json:"limits"` +} + +// LimitRange sets resource usage limits for each kind of resource in a +// Namespace. +type LimitRange struct { + TypeMeta `json:",inline"` + // Standard object's metadata. + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the limits enforced. + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + Spec LimitRangeSpec `json:"spec,omitempty"` +} + +// LimitRangeList is a list of LimitRange items. +type LimitRangeList struct { + TypeMeta `json:",inline"` + // Standard list metadata. + // More info: + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + ListMeta `json:"metadata,omitempty"` + + // Items is a list of LimitRange objects. + // More info: + // http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md + Items []LimitRange `json:"items"` +} +``` + +### Validation + +Validation of a **LimitRange** enforces that for a given named resource the +following rules apply: + +Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) +<= Max (if specified) + +### Default Value Behavior + +The following default value behaviors are applied to a LimitRange for a given +named resource. + +``` +if LimitRangeItem.Default[resourceName] is undefined + if LimitRangeItem.Max[resourceName] is defined + LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName] +``` + +``` +if LimitRangeItem.DefaultRequest[resourceName] is undefined + if LimitRangeItem.Default[resourceName] is defined + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName] + else if LimitRangeItem.Min[resourceName] is defined + LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName] +``` + +## AdmissionControl plugin: LimitRanger + +The **LimitRanger** plug-in introspects all incoming pod requests and evaluates +the constraints defined on a LimitRange. + +If a constraint is not specified for an enumerated resource, it is not enforced +or tracked. + +To enable the plug-in and support for LimitRange, the kube-apiserver must be +configured as follows: + +```console +$ kube-apiserver --admission-control=LimitRanger +``` + +### Enforcement of constraints + +**Type: Container** + +Supported Resources: + +1. memory +2. cpu + +Supported Constraints: + +Per container, the following must hold true: + +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min <= Request (required) <= Limit (optional) | +| Max | Limit (required) <= Max | +| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) | + +Supported Defaults: + +1. Default - if the named resource has no enumerated value, the Limit is equal +to the Default +2. DefaultRequest - if the named resource has no enumerated value, the Request +is equal to the DefaultRequest + +**Type: Pod** + +Supported Resources: + +1. memory +2. cpu + +Supported Constraints: + +Across all containers in pod, the following must hold true + +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min <= Request (required) <= Limit (optional) | +| Max | Limit (required) <= Max | +| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) | + +**Type: PersistentVolumeClaim** + +Supported Resources: + +1. storage + +Supported Constraints: + +Across all claims in a namespace, the following must hold true: + +| Constraint | Behavior | +| ---------- | -------- | +| Min | Min >= Request (required) | +| Max | Max <= Request (required) | + +Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time. + +## Run-time configuration + +The default ```LimitRange``` that is applied via Salt configuration will be +updated as follows: + +``` +apiVersion: "v1" +kind: "LimitRange" +metadata: + name: "limits" + namespace: default +spec: + limits: + - type: "Container" + defaultRequests: + cpu: "100m" +``` + +## Example + +An example LimitRange configuration: + +| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio | +| ---- | -------- | --- | --- | ------- | -------------- | ----------------- | +| Container | cpu | .1 | 1 | 500m | 250m | 4 | +| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | | + +Assuming an incoming container that specified no incoming resource requirements, +the following would happen. + +1. The incoming container cpu would request 250m with a limit of 500m. +2. The incoming container memory would request 250Mi with a limit of 500Mi +3. If the container is later resized, it's cpu would be constrained to between +.1 and 1 and the ratio of limit to request could not exceed 4. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]() + diff --git a/design/admission_control_resource_quota.md b/design/admission_control_resource_quota.md new file mode 100644 index 00000000..575db9a8 --- /dev/null +++ b/design/admission_control_resource_quota.md @@ -0,0 +1,215 @@ +# Admission control plugin: ResourceQuota + +## Background + +This document describes a system for enforcing hard resource usage limits per +namespace as part of admission control. + +## Use cases + +1. Ability to enumerate resource usage limits per namespace. +2. Ability to monitor resource usage for tracked resources. +3. Ability to reject resource usage exceeding hard quotas. + +## Data Model + +The **ResourceQuota** object is scoped to a **Namespace**. + +```go +// The following identify resource constants for Kubernetes object types +const ( + // Pods, number + ResourcePods ResourceName = "pods" + // Services, number + ResourceServices ResourceName = "services" + // ReplicationControllers, number + ResourceReplicationControllers ResourceName = "replicationcontrollers" + // ResourceQuotas, number + ResourceQuotas ResourceName = "resourcequotas" + // ResourceSecrets, number + ResourceSecrets ResourceName = "secrets" + // ResourcePersistentVolumeClaims, number + ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims" +) + +// ResourceQuotaSpec defines the desired hard limits to enforce for Quota +type ResourceQuotaSpec struct { + // Hard is the set of desired hard limits for each named resource + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` +} + +// ResourceQuotaStatus defines the enforced hard limits and observed use +type ResourceQuotaStatus struct { + // Hard is the set of enforced hard limits for each named resource + Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` + // Used is the current observed total usage of the resource in the namespace + Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"` +} + +// ResourceQuota sets aggregate quota restrictions enforced per namespace +type ResourceQuota struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + + // Spec defines the desired quota + Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` + + // Status defines the actual enforced quota and its current usage + Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"` +} + +// ResourceQuotaList is a list of ResourceQuota items +type ResourceQuotaList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"` + + // Items is a list of ResourceQuota objects + Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"` +} +``` + +## Quota Tracked Resources + +The following resources are supported by the quota system: + +| Resource | Description | +| ------------ | ----------- | +| cpu | Total requested cpu usage | +| memory | Total requested memory usage | +| pods | Total number of active pods where phase is pending or active. | +| services | Total number of services | +| replicationcontrollers | Total number of replication controllers | +| resourcequotas | Total number of resource quotas | +| secrets | Total number of secrets | +| persistentvolumeclaims | Total number of persistent volume claims | + +If a third-party wants to track additional resources, it must follow the +resource naming conventions prescribed by Kubernetes. This means the resource +must have a fully-qualified name (i.e. mycompany.org/shinynewresource) + +## Resource Requirements: Requests vs. Limits + +If a resource supports the ability to distinguish between a request and a limit +for a resource, the quota tracking system will only cost the request value +against the quota usage. If a resource is tracked by quota, and no request value +is provided, the associated entity is rejected as part of admission. + +For an example, consider the following scenarios relative to tracking quota on +CPU: + +| Pod | Container | Request CPU | Limit CPU | Result | +| --- | --------- | ----------- | --------- | ------ | +| X | C1 | 100m | 500m | The quota usage is incremented 100m | +| Y | C2 | 100m | none | The quota usage is incremented 100m | +| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit | +| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. | + +The rationale for accounting for the requested amount of a resource versus the +limit is the belief that a user should only be charged for what they are +scheduled against in the cluster. In addition, attempting to track usage against +actual usage, where request < actual < limit, is considered highly volatile. + +As a consequence of this decision, the user is able to spread its usage of a +resource across multiple tiers of service. Let's demonstrate this via an +example with a 4 cpu quota. + +The quota may be allocated as follows: + +| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage | +| --- | --------- | ----------- | --------- | ---- | ----------- | +| X | C1 | 1 | 4 | Burstable | 1 | +| Y | C2 | 2 | 2 | Guaranteed | 2 | +| Z | C3 | 1 | 3 | Burstable | 1 | + +It is possible that the pods may consume 9 cpu over a given time period +depending on the nodes available cpu that held pod X and Z, but since we +scheduled X and Z relative to the request, we only track the requesting value +against their allocated quota. If one wants to restrict the ratio between the +request and limit, it is encouraged that the user define a **LimitRange** with +**LimitRequestRatio** to control burst out behavior. This would in effect, let +an administrator keep the difference between request and limit more in line with +tracked usage if desired. + +## Status API + +A REST API endpoint to update the status section of the **ResourceQuota** is +exposed. It requires an atomic compare-and-swap in order to keep resource usage +tracking consistent. + +## Resource Quota Controller + +A resource quota controller monitors observed usage for tracked resources in the +**Namespace**. + +If there is observed difference between the current usage stats versus the +current **ResourceQuota.Status**, the controller posts an update of the +currently observed usage metrics to the **ResourceQuota** via the /status +endpoint. + +The resource quota controller is the only component capable of monitoring and +recording usage updates after a DELETE operation since admission control is +incapable of guaranteeing a DELETE request actually succeeded. + +## AdmissionControl plugin: ResourceQuota + +The **ResourceQuota** plug-in introspects all incoming admission requests. + +To enable the plug-in and support for ResourceQuota, the kube-apiserver must be +configured as follows: + +``` +$ kube-apiserver --admission-control=ResourceQuota +``` + +It makes decisions by evaluating the incoming object against all defined +**ResourceQuota.Status.Hard** resource limits in the request namespace. If +acceptance of the resource would cause the total usage of a named resource to +exceed its hard limit, the request is denied. + +If the incoming request does not cause the total usage to exceed any of the +enumerated hard resource limits, the plug-in will post a +**ResourceQuota.Status** document to the server to atomically update the +observed usage based on the previously read **ResourceQuota.ResourceVersion**. +This keeps incremental usage atomically consistent, but does introduce a +bottleneck (intentionally) into the system. + +To optimize system performance, it is encouraged that all resource quotas are +tracked on the same **ResourceQuota** document in a **Namespace**. As a result, +it is encouraged to impose a cap on the total number of individual quotas that +are tracked in the **Namespace** to 1 in the **ResourceQuota** document. + +## kubectl + +kubectl is modified to support the **ResourceQuota** resource. + +`kubectl describe` provides a human-readable output of quota. + +For example: + +```console +$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml +namespace "quota-example" created +$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example +resourcequota "quota" created +$ kubectl describe quota quota --namespace=quota-example +Name: quota +Namespace: quota-example +Resource Used Hard +-------- ---- ---- +cpu 0 20 +memory 0 1Gi +persistentvolumeclaims 0 10 +pods 0 10 +replicationcontrollers 0 20 +resourcequotas 1 1 +secrets 1 10 +services 0 5 +``` + +## More information + +See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]() + diff --git a/design/architecture.dia b/design/architecture.dia new file mode 100644 index 00000000..5c87409f Binary files /dev/null and b/design/architecture.dia differ diff --git a/design/architecture.md b/design/architecture.md new file mode 100644 index 00000000..95e3aef4 --- /dev/null +++ b/design/architecture.md @@ -0,0 +1,85 @@ +# Kubernetes architecture + +A running Kubernetes cluster contains node agents (`kubelet`) and master +components (APIs, scheduler, etc), on top of a distributed storage solution. +This diagram shows our desired eventual state, though we're still working on a +few things, like making `kubelet` itself (all our components, really) run within +containers, and making the scheduler 100% pluggable. + +![Architecture Diagram](architecture.png?raw=true "Architecture overview") + +## The Kubernetes Node + +When looking at the architecture of the system, we'll break it down to services +that run on the worker node and services that compose the cluster-level control +plane. + +The Kubernetes node has the services necessary to run application containers and +be managed from the master systems. + +Each node runs Docker, of course. Docker takes care of the details of +downloading images and running containers. + +### `kubelet` + +The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their +images, their volumes, etc. + +### `kube-proxy` + +Each node also runs a simple network proxy and load balancer (see the +[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for +more details). This reflects `services` (see +[the services doc](../user-guide/services.md) for more details) as defined in +the Kubernetes API on each node and can do simple TCP and UDP stream forwarding +(round robin) across a set of backends. + +Service endpoints are currently found via [DNS](../admin/dns.md) or through +environment variables (both +[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and +Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are +supported). These variables resolve to ports managed by the service proxy. + +## The Kubernetes Control Plane + +The Kubernetes control plane is split into a set of components. Currently they +all run on a single _master_ node, but that is expected to change soon in order +to support high-availability clusters. These components work together to provide +a unified view of the cluster. + +### `etcd` + +All persistent master state is stored in an instance of `etcd`. This provides a +great way to store configuration data reliably. With `watch` support, +coordinating components can be notified very quickly of changes. + +### Kubernetes API Server + +The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a +CRUD-y server, with most/all business logic implemented in separate components +or in plug-ins. It mainly processes REST operations, validates them, and updates +the corresponding objects in `etcd` (and eventually other stores). + +### Scheduler + +The scheduler binds unscheduled pods to nodes via the `/binding` API. The +scheduler is pluggable, and we expect to support multiple cluster schedulers and +even user-provided schedulers in the future. + +### Kubernetes Controller Manager Server + +All other cluster-level functions are currently performed by the Controller +Manager. For instance, `Endpoints` objects are created and updated by the +endpoints controller, and nodes are discovered, managed, and monitored by the +node controller. These could eventually be split into separate components to +make them independently pluggable. + +The [`replicationcontroller`](../user-guide/replication-controller.md) is a +mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md) +API. We eventually plan to port it to a generic plug-in mechanism, once one is +implemented. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]() + diff --git a/design/architecture.png b/design/architecture.png new file mode 100644 index 00000000..0ee8bceb Binary files /dev/null and b/design/architecture.png differ diff --git a/design/architecture.svg b/design/architecture.svg new file mode 100644 index 00000000..d6b6aab0 --- /dev/null +++ b/design/architecture.svg @@ -0,0 +1,1943 @@ + + + + + + image/svg+xml + + + + + + + + + + + + + + + + Node + + + + + + kubelet + + + + + + + + + + + container + + + + + + + container + + + + + + + cAdvisor + + + + + + + Pod + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + Proxy + + + + + + + kubectl (user commands) + + + + + + + + + + + + + + + Firewall + + + + + + + Internet + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + controller manager + (replication controller etc.) + + + + + + + Scheduler + + + + + + + Scheduler + + + + Master components + Colocated, or spread across machines, + as dictated by cluster size. + + + + + + + + + + + + REST + (pods, services, + rep. controllers) + + + + + + + authentication + authorization + + + + + + + scheduling + actuator + + + + APIs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + docker + + + + + + + + .. + + + ... + + + + + + + + + + + + + + + + + + + + + + + + Node + + + + + + kubelet + + + + + + + + + + + container + + + + + + + container + + + + + + + cAdvisor + + + + + + + Pod + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + + + + + + container + + + + + + + container + + + + + + + container + + + + + + + Pod + + + + + + + Proxy + + + + + + + + + + + + + + + + + + + docker + + + + + + + + .. + + + ... + + + + + + + + + + + + + + + + + + + + + + + + + + Distributed + Watchable + Storage + + (implemented via etcd) + + + diff --git a/design/aws_under_the_hood.md b/design/aws_under_the_hood.md new file mode 100644 index 00000000..6e3c5afb --- /dev/null +++ b/design/aws_under_the_hood.md @@ -0,0 +1,310 @@ +# Peeking under the hood of Kubernetes on AWS + +This document provides high-level insight into how Kubernetes works on AWS and +maps to AWS objects. We assume that you are familiar with AWS. + +We encourage you to use [kube-up](../getting-started-guides/aws.md) to create +clusters on AWS. We recommend that you avoid manual configuration but are aware +that sometimes it's the only option. + +Tip: You should open an issue and let us know what enhancements can be made to +the scripts to better suit your needs. + +That said, it's also useful to know what's happening under the hood when +Kubernetes clusters are created on AWS. This can be particularly useful if +problems arise or in circumstances where the provided scripts are lacking and +you manually created or configured your cluster. + +**Table of contents:** + * [Architecture overview](#architecture-overview) + * [Storage](#storage) + * [Auto Scaling group](#auto-scaling-group) + * [Networking](#networking) + * [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services) + * [Identity and access management (IAM)](#identity-and-access-management-iam) + * [Tagging](#tagging) + * [AWS objects](#aws-objects) + * [Manual infrastructure creation](#manual-infrastructure-creation) + * [Instance boot](#instance-boot) + +### Architecture overview + +Kubernetes is a cluster of several machines that consists of a Kubernetes +master and a set number of nodes (previously known as 'nodes') for which the +master is responsible. See the [Architecture](architecture.md) topic for +more details. + +By default on AWS: + +* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently + modern kernel that pairs well with Docker and doesn't require a + reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.) +* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly + because this is what Google Compute Engine uses). + +You can override these defaults by passing different environment variables to +kube-up. + +### Storage + +AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore). +These can then be attached to pods that should store persistent data (e.g. if +you're running a database). + +By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html) +unless you create pods with persistent volumes +[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes +containers do not have persistent storage unless you attach a persistent +volume, and so nodes on AWS use instance storage. Instance storage is cheaper, +often faster, and historically more reliable. Unless you can make do with +whatever space is left on your root partition, you must choose an instance type +that provides you with sufficient instance storage for your needs. + +To configure Kubernetes to use EBS storage, pass the environment variable +`KUBE_AWS_STORAGE=ebs` to kube-up. + +Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to +track its state. Similar to nodes, containers are mostly run against instance +storage, except that we repoint some important data onto the persistent volume. + +The default storage driver for Docker images is aufs. Specifying btrfs (by +passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a +good choice for a filesystem. btrfs is relatively reliable with Docker and has +improved its reliability with modern kernels. It can easily span multiple +volumes, which is particularly useful when we are using an instance type with +multiple ephemeral instance disks. + +### Auto Scaling group + +Nodes (but not the master) are run in an +[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html) +on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled +([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means +that AWS will relaunch any nodes that are terminated. + +We do not currently run the master in an AutoScalingGroup, but we should +([#11934](http://issues.k8s.io/11934)). + +### Networking + +Kubernetes uses an IP-per-pod model. This means that a node, which runs many +pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced +routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then +configured to route to an instance in the VPC routing table. + +It is also possible to use overlay networking on AWS, but that is not the +default configuration of the kube-up script. + +### NodePort and LoadBalancer services + +Kubernetes on AWS integrates with [Elastic Load Balancing +(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html). +When you create a service with `Type=LoadBalancer`, Kubernetes (the +kube-controller-manager) will create an ELB, create a security group for the +ELB which allows access on the service ports, attach all the nodes to the ELB, +and modify the security group for the nodes to allow traffic from the ELB to +the nodes. This traffic reaches kube-proxy where it is then forwarded to the +pods. + +ELB has some restrictions: +* ELB requires that all nodes listen on a single port, +* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below +on ELB annotations for pods speaking HTTP). + +To work with these restrictions, in Kubernetes, [LoadBalancer +services](../user-guide/services.md#type-loadbalancer) are exposed as +[NodePort services](../user-guide/services.md#type-nodeport). Then +kube-proxy listens externally on the cluster-wide port that's assigned to +NodePort services and forwards traffic to the corresponding pods. + +For example, if we configure a service of Type LoadBalancer with a +public port of 80: +* Kubernetes will assign a NodePort to the service (e.g. port 31234) +* ELB is configured to proxy traffic on the public port 80 to the NodePort +assigned to the service (in this example port 31234). +* Then any in-coming traffic that ELB forwards to the NodePort (31234) +is recognized by kube-proxy and sent to the correct pods for that service. + +Note that we do not automatically open NodePort services in the AWS firewall +(although we do open LoadBalancer services). This is because we expect that +NodePort services are more of a building block for things like inter-cluster +services or for LoadBalancer. To consume a NodePort service externally, you +will likely have to open the port in the node security group +(`kubernetes-node-`). + +For SSL support, starting with 1.3 two annotations can be added to a service: + +``` +service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012 +``` + +The first specifies which certificate to use. It can be either a +certificate from a third party issuer that was uploaded to IAM or one created +within AWS Certificate Manager. + +``` +service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp) +``` + +The second annotation specifies which protocol a pod speaks. For HTTPS and +SSL, the ELB will expect the pod to authenticate itself over the encrypted +connection. + +HTTP and HTTPS will select layer 7 proxying: the ELB will terminate +the connection with the user, parse headers and inject the `X-Forwarded-For` +header with the user's IP address (pods will only see the IP address of the +ELB at the other end of its connection) when forwarding requests. + +TCP and SSL will select layer 4 proxying: the ELB will forward traffic without +modifying the headers. + +### Identity and Access Management (IAM) + +kube-proxy sets up two IAM roles, one for the master called +[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json) +and one for the nodes called +[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json). + +The master is responsible for creating ELBs and configuring them, as well as +setting up advanced VPC routing. Currently it has blanket permissions on EC2, +along with rights to create and destroy ELBs. + +The nodes do not need a lot of access to the AWS APIs. They need to download +a distribution file, and then are responsible for attaching and detaching EBS +volumes from itself. + +The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR +authorization tokens, refresh them every 12 hours if needed, and fetch Docker +images from it, as long as the appropriate permissions are enabled. Those in +[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly), +without write access, should suffice. The master policy is probably overly +permissive. The security conscious may want to lock-down the IAM policies +further ([#11936](http://issues.k8s.io/11936)). + +We should make it easier to extend IAM permissions and also ensure that they +are correctly configured ([#14226](http://issues.k8s.io/14226)). + +### Tagging + +All AWS resources are tagged with a tag named "KubernetesCluster", with a value +that is the unique cluster-id. This tag is used to identify a particular +'instance' of Kubernetes, even if two clusters are deployed into the same VPC. +Resources are considered to belong to the same cluster if and only if they have +the same value in the tag named "KubernetesCluster". (The kube-up script is +not configured to create multiple clusters in the same VPC by default, but it +is possible to create another cluster in the same VPC.) + +Within the AWS cloud provider logic, we filter requests to the AWS APIs to +match resources with our cluster tag. By filtering the requests, we ensure +that we see only our own AWS objects. + +**Important:** If you choose not to use kube-up, you must pick a unique +cluster-id value, and ensure that all AWS resources have a tag with +`Name=KubernetesCluster,Value=`. + +### AWS objects + +The kube-up script does a number of things in AWS: +* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes +distribution and the salt scripts into it. They are made world-readable and the +HTTP URLs are passed to instances; this is how Kubernetes code gets onto the +machines. +* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/): + * `kubernetes-master` is used by the master. + * `kubernetes-node` is used by nodes. +* Creates an AWS SSH key named `kubernetes-`. Fingerprint here is +the OpenSSH key fingerprint, so that multiple users can run the script with +different keys and their keys will not collide (with near-certainty). It will +use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create +one there. (With the default Ubuntu images, if you have to SSH in: the user is +`ubuntu` and that user can `sudo`). +* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and +enables the `dns-support` and `dns-hostnames` options. +* Creates an internet gateway for the VPC. +* Creates a route table for the VPC, with the internet gateway as the default +route. +* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE` +(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a +single AZ on AWS. Although, there are two philosophies in discussion on how to +achieve High Availability (HA): + * cluster-per-AZ: An independent cluster for each AZ, where each cluster +is entirely separate. + * cross-AZ-clusters: A single cluster spans multiple AZs. +The debate is open here, where cluster-per-AZ is discussed as more robust but +cross-AZ-clusters are more convenient. +* Associates the subnet to the route table +* Creates security groups for the master (`kubernetes-master-`) +and the nodes (`kubernetes-node-`). +* Configures security groups so that masters and nodes can communicate. This +includes intercommunication between masters and nodes, opening SSH publicly +for both masters and nodes, and opening port 443 on the master for the HTTPS +API endpoints. +* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type +`MASTER_DISK_TYPE`. +* Launches a master with a fixed IP address (172.20.0.9) that is also +configured for the security group and all the necessary IAM credentials. An +instance script is used to pass vital configuration information to Salt. Note: +The hope is that over time we can reduce the amount of configuration +information that must be passed in this way. +* Once the instance is up, it attaches the EBS volume and sets up a manual +routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to +10.246.0.0/24). +* For auto-scaling, on each nodes it creates a launch configuration and group. +The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default +name is kubernetes-node-group. The auto-scaling group has a min and max size +that are both set to NUM_NODES. You can change the size of the auto-scaling +group to add or remove the total number of nodes from within the AWS API or +Console. Each nodes self-configures, meaning that they come up; run Salt with +the stored configuration; connect to the master; are assigned an internal CIDR; +and then the master configures the route-table with the assigned CIDR. The +kube-up script performs a health-check on the nodes but it's a self-check that +is not required. + +If attempting this configuration manually, it is recommend to follow along +with the kube-up script, and being sure to tag everything with a tag with name +`KubernetesCluster` and value set to a unique cluster-id. Also, passing the +right configuration options to Salt when not using the script is tricky: the +plan here is to simplify this by having Kubernetes take on more node +configuration, and even potentially remove Salt altogether. + +### Manual infrastructure creation + +While this work is not yet complete, advanced users might choose to manually +create certain AWS objects while still making use of the kube-up script (to +configure Salt, for example). These objects can currently be manually created: +* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket. +* Set the `VPC_ID` environment variable to reuse an existing VPC. +* Set the `SUBNET_ID` environment variable to reuse an existing subnet. +* If your route table has a matching `KubernetesCluster` tag, it will be reused. +* If your security groups are appropriately named, they will be reused. + +Currently there is no way to do the following with kube-up: +* Use an existing AWS SSH key with an arbitrary name. +* Override the IAM credentials in a sensible way +([#14226](http://issues.k8s.io/14226)). +* Use different security group permissions. +* Configure your own auto-scaling groups. + +If any of the above items apply to your situation, open an issue to request an +enhancement to the kube-up script. You should provide a complete description of +the use-case, including all the details around what you want to accomplish. + +### Instance boot + +The instance boot procedure is currently pretty complicated, primarily because +we must marshal configuration from Bash to Salt via the AWS instance script. +As we move more post-boot configuration out of Salt and into Kubernetes, we +will hopefully be able to simplify this. + +When the kube-up script launches instances, it builds an instance startup +script which includes some configuration options passed to kube-up, and +concatenates some of the scripts found in the cluster/aws/templates directory. +These scripts are responsible for mounting and formatting volumes, downloading +Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually +install Kubernetes. + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]() + diff --git a/design/clustering.md b/design/clustering.md new file mode 100644 index 00000000..ca42035b --- /dev/null +++ b/design/clustering.md @@ -0,0 +1,128 @@ +# Clustering in Kubernetes + + +## Overview + +The term "clustering" refers to the process of having all members of the +Kubernetes cluster find and trust each other. There are multiple different ways +to achieve clustering with different security and usability profiles. This +document attempts to lay out the user experiences for clustering that Kubernetes +aims to address. + +Once a cluster is established, the following is true: + +1. **Master -> Node** The master needs to know which nodes can take work and +what their current status is wrt capacity. + 1. **Location** The master knows the name and location of all of the nodes in +the cluster. + * For the purposes of this doc, location and name should be enough +information so that the master can open a TCP connection to the Node. Most +probably we will make this either an IP address or a DNS name. It is going to be +important to be consistent here (master must be able to reach kubelet on that +DNS name) so that we can verify certificates appropriately. + 2. **Target AuthN** A way to securely talk to the kubelet on that node. +Currently we call out to the kubelet over HTTP. This should be over HTTPS and +the master should know what CA to trust for that node. + 3. **Caller AuthN/Z** This would be the master verifying itself (and +permissions) when calling the node. Currently, this is only used to collect +statistics as authorization isn't critical. This may change in the future +though. +2. **Node -> Master** The nodes currently talk to the master to know which pods +have been assigned to them and to publish events. + 1. **Location** The nodes must know where the master is at. + 2. **Target AuthN** Since the master is assigning work to the nodes, it is +critical that they verify whom they are talking to. + 3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to +the master. Ideally this authentication is specific to each node so that +authorization can be narrowly scoped. The details of the work to run (including +things like environment variables) might be considered sensitive and should be +locked down also. + +**Note:** While the description here refers to a singular Master, in the future +we should enable multiple Masters operating in an HA mode. While the "Master" is +currently the combination of the API Server, Scheduler and Controller Manager, +we will restrict ourselves to thinking about the main API and policy engine -- +the API Server. + +## Current Implementation + +A central authority (generally the master) is responsible for determining the +set of machines which are members of the cluster. Calls to create and remove +worker nodes in the cluster are restricted to this single authority, and any +other requests to add or remove worker nodes are rejected. (1.i.) + +Communication from the master to nodes is currently over HTTP and is not secured +or authenticated in any way. (1.ii, 1.iii.) + +The location of the master is communicated out of band to the nodes. For GCE, +this is done via Salt. Other cluster instructions/scripts use other methods. +(2.i.) + +Currently most communication from the node to the master is over HTTP. When it +is done over HTTPS there is currently no verification of the cert of the master +(2.ii.) + +Currently, the node/kubelet is authenticated to the master via a token shared +across all nodes. This token is distributed out of band (using Salt for GCE) and +is optional. If it is not present then the kubelet is unable to publish events +to the master. (2.iii.) + +Our current mix of out of band communication doesn't meet all of our needs from +a security point of view and is difficult to set up and configure. + +## Proposed Solution + +The proposed solution will provide a range of options for setting up and +maintaining a secure Kubernetes cluster. We want to both allow for centrally +controlled systems (leveraging pre-existing trust and configuration systems) or +more ad-hoc automagic systems that are incredibly easy to set up. + +The building blocks of an easier solution: + +* **Move to TLS** We will move to using TLS for all intra-cluster communication. +We will explicitly identify the trust chain (the set of trusted CAs) as opposed +to trusting the system CAs. We will also use client certificates for all AuthN. +* [optional] **API driven CA** Optionally, we will run a CA in the master that +will mint certificates for the nodes/kubelets. There will be pluggable policies +that will automatically approve certificate requests here as appropriate. + * **CA approval policy** This is a pluggable policy object that can +automatically approve CA signing requests. Stock policies will include +`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would +be an API for evaluating and accepting/rejecting requests. Cloud providers could +implement a policy here that verifies other out of band information and +automatically approves/rejects based on other external factors. +* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give +a node permission to register itself. + * To start with, we'd have the kubelets generate a cert/account in the form of +`kubelet:`. To start we would then hard code policy such that we give that +particular account appropriate permissions. Over time, we can make the policy +engine more generic. +* [optional] **Bootstrap API endpoint** This is a helper service hosted outside +of the Kubernetes cluster that helps with initial discovery of the master. + +### Static Clustering + +In this sequence diagram there is out of band admin entity that is creating all +certificates and distributing them. It is also making sure that the kubelets +know where to find the master. This provides for a lot of control but is more +difficult to set up as lots of information must be communicated outside of +Kubernetes. + +![Static Sequence Diagram](clustering/static.png) + +### Dynamic Clustering + +This diagram shows dynamic clustering using the bootstrap API endpoint. This +endpoint is used to both find the location of the master and communicate the +root CA for the master. + +This flow has the admin manually approving the kubelet signing requests. This is +the `queue` policy defined above. This manual intervention could be replaced by +code that can verify the signing requests via other means. + +![Dynamic Sequence Diagram](clustering/dynamic.png) + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]() + diff --git a/design/clustering/.gitignore b/design/clustering/.gitignore new file mode 100644 index 00000000..67bcd6cb --- /dev/null +++ b/design/clustering/.gitignore @@ -0,0 +1 @@ +DroidSansMono.ttf diff --git a/design/clustering/Dockerfile b/design/clustering/Dockerfile new file mode 100644 index 00000000..e7abc753 --- /dev/null +++ b/design/clustering/Dockerfile @@ -0,0 +1,26 @@ +# Copyright 2016 The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FROM debian:jessie + +RUN apt-get update +RUN apt-get -qy install python-seqdiag make curl + +WORKDIR /diagrams + +RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf + +ADD . /diagrams + +CMD bash -c 'make >/dev/stderr && tar cf - *.png' \ No newline at end of file diff --git a/design/clustering/Makefile b/design/clustering/Makefile new file mode 100644 index 00000000..e72d441e --- /dev/null +++ b/design/clustering/Makefile @@ -0,0 +1,41 @@ +# Copyright 2016 The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FONT := DroidSansMono.ttf + +PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag)) + +.PHONY: all +all: $(PNGS) + +.PHONY: watch +watch: + fswatch *.seqdiag | xargs -n 1 sh -c "make || true" + +$(FONT): + curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT) + +%.png: %.seqdiag $(FONT) + seqdiag --no-transparency -a -f '$(FONT)' $< + +# Build the stuff via a docker image +.PHONY: docker +docker: + docker build -t clustering-seqdiag . + docker run --rm clustering-seqdiag | tar xvf - + +.PHONY: docker-clean +docker-clean: + docker rmi clustering-seqdiag || true + docker images -q --filter "dangling=true" | xargs docker rmi diff --git a/design/clustering/README.md b/design/clustering/README.md new file mode 100644 index 00000000..d7e2e2e0 --- /dev/null +++ b/design/clustering/README.md @@ -0,0 +1,35 @@ +This directory contains diagrams for the clustering design doc. + +This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html). +Assuming you have a non-borked python install, this should be installable with: + +```sh +pip install seqdiag +``` + +Just call `make` to regenerate the diagrams. + +## Building with Docker + +If you are on a Mac or your pip install is messed up, you can easily build with +docker: + +```sh +make docker +``` + +The first run will be slow but things should be fast after that. + +To clean up the docker containers that are created (and other cruft that is left +around) you can run `make docker-clean`. + +## Automatically rebuild on file changes + +If you have the fswatch utility installed, you can have it monitor the file +system and automatically rebuild when files have changed. Just do a +`make watch`. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]() + diff --git a/design/clustering/dynamic.png b/design/clustering/dynamic.png new file mode 100644 index 00000000..92b40fee Binary files /dev/null and b/design/clustering/dynamic.png differ diff --git a/design/clustering/dynamic.seqdiag b/design/clustering/dynamic.seqdiag new file mode 100644 index 00000000..567d5bf9 --- /dev/null +++ b/design/clustering/dynamic.seqdiag @@ -0,0 +1,24 @@ +seqdiag { + activation = none; + + + user[label = "Admin User"]; + bootstrap[label = "Bootstrap API\nEndpoint"]; + master; + kubelet[stacked]; + + user -> bootstrap [label="createCluster", return="cluster ID"]; + user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"]; + + user ->> master [label="start\n- bootstrap-cluster-uri"]; + master => bootstrap [label="setMaster\n- master-location\n- master-ca"]; + + user ->> kubelet [label="start\n- bootstrap-cluster-uri"]; + kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"]; + kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"]; + user => master [label="getSignRequests"]; + user => master [label="approveSignRequests"]; + kubelet <<-- master [label="returns\n- kubelet-cert"]; + + kubelet => master [label="register\n- kubelet-location"] +} diff --git a/design/clustering/static.png b/design/clustering/static.png new file mode 100644 index 00000000..bcdeca7e Binary files /dev/null and b/design/clustering/static.png differ diff --git a/design/clustering/static.seqdiag b/design/clustering/static.seqdiag new file mode 100644 index 00000000..bdc54b76 --- /dev/null +++ b/design/clustering/static.seqdiag @@ -0,0 +1,16 @@ +seqdiag { + activation = none; + + admin[label = "Manual Admin"]; + ca[label = "Manual CA"] + master; + kubelet[stacked]; + + admin => ca [label="create\n- master-cert"]; + admin ->> master [label="start\n- ca-root\n- master-cert"]; + + admin => ca [label="create\n- kubelet-cert"]; + admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"]; + + kubelet => master [label="register\n- kubelet-location"]; +} diff --git a/design/command_execution_port_forwarding.md b/design/command_execution_port_forwarding.md new file mode 100644 index 00000000..a7175403 --- /dev/null +++ b/design/command_execution_port_forwarding.md @@ -0,0 +1,158 @@ +# Container Command Execution & Port Forwarding in Kubernetes + +## Abstract + +This document describes how to use Kubernetes to execute commands in containers, +with stdin/stdout/stderr streams attached and how to implement port forwarding +to the containers. + +## Background + +See the following related issues/PRs: + +- [Support attach](http://issue.k8s.io/1521) +- [Real container ssh](http://issue.k8s.io/1513) +- [Provide easy debug network access to services](http://issue.k8s.io/1863) +- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576) + +## Motivation + +Users and administrators are accustomed to being able to access their systems +via SSH to run remote commands, get shell access, and do port forwarding. + +Supporting SSH to containers in Kubernetes is a difficult task. You must +specify a "user" and a hostname to make an SSH connection, and `sshd` requires +real users (resolvable by NSS and PAM). Because a container belongs to a pod, +and the pod belongs to a namespace, you need to specify namespace/pod/container +to uniquely identify the target container. Unfortunately, a +namespace/pod/container is not a real user as far as SSH is concerned. Also, +most Linux systems limit user names to 32 characters, which is unlikely to be +large enough to contain namespace/pod/container. We could devise some scheme to +map each namespace/pod/container to a 32-character user name, adding entries to +`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the +time. Alternatively, we could write custom NSS and PAM modules that allow the +host to resolve a namespace/pod/container to a user without needing to keep +files or LDAP in sync. + +As an alternative to SSH, we are using a multiplexed streaming protocol that +runs on top of HTTP. There are no requirements about users being real users, +nor is there any limitation on user name length, as the protocol is under our +control. The only downside is that standard tooling that expects to use SSH +won't be able to work with this mechanism, unless adapters can be written. + +## Constraints and Assumptions + +- SSH support is not currently in scope. +- CGroup confinement is ultimately desired, but implementing that support is not +currently in scope. +- SELinux confinement is ultimately desired, but implementing that support is +not currently in scope. + +## Use Cases + +- A user of a Kubernetes cluster wants to run arbitrary commands in a +container with local stdin/stdout/stderr attached to the container. +- A user of a Kubernetes cluster wants to connect to local ports on his computer +and have them forwarded to ports in a container. + +## Process Flow + +### Remote Command Execution Flow + +1. The client connects to the Kubernetes Master to initiate a remote command +execution request. +2. The Master proxies the request to the Kubelet where the container lives. +3. The Kubelet executes nsenter + the requested command and streams +stdin/stdout/stderr back and forth between the client and the container. + +### Port Forwarding Flow + +1. The client connects to the Kubernetes Master to initiate a remote command +execution request. +2. The Master proxies the request to the Kubelet where the container lives. +3. The client listens on each specified local port, awaiting local connections. +4. The client connects to one of the local listening ports. +4. The client notifies the Kubelet of the new connection. +5. The Kubelet executes nsenter + socat and streams data back and forth between +the client and the port in the container. + +## Design Considerations + +### Streaming Protocol + +The current multiplexed streaming protocol used is SPDY. This is not the +long-term desire, however. As soon as there is viable support for HTTP/2 in Go, +we will switch to that. + +### Master as First Level Proxy + +Clients should not be allowed to communicate directly with the Kubelet for +security reasons. Therefore, the Master is currently the only suggested entry +point to be used for remote command execution and port forwarding. This is not +necessarily desirable, as it means that all remote command execution and port +forwarding traffic must travel through the Master, potentially impacting other +API requests. + +In the future, it might make more sense to retrieve an authorization token from +the Master, and then use that token to initiate a remote command execution or +port forwarding request with a load balanced proxy service dedicated to this +functionality. This would keep the streaming traffic out of the Master. + +### Kubelet as Backend Proxy + +The kubelet is currently responsible for handling remote command execution and +port forwarding requests. Just like with the Master described above, this means +that all remote command execution and port forwarding streaming traffic must +travel through the Kubelet, which could result in a degraded ability to service +other requests. + +In the future, it might make more sense to use a separate service on the node. + +Alternatively, we could possibly inject a process into the container that only +listens for a single request, expose that process's listening port on the node, +and then issue a redirect to the client such that it would connect to the first +level proxy, which would then proxy directly to the injected process's exposed +port. This would minimize the amount of proxying that takes place. + +### Scalability + +There are at least 2 different ways to execute a command in a container: +`docker exec` and `nsenter`. While `docker exec` might seem like an easier and +more obvious choice, it has some drawbacks. + +#### `docker exec` + +We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port +on the node), but this would require proxying from the edge and securing the +Docker API. `docker exec` calls go through the Docker daemon, meaning that all +stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop. +Additionally, you can't isolate 1 malicious `docker exec` call from normal +usage, meaning an attacker could initiate a denial of service or other attack +and take down the Docker daemon, or the node itself. + +We expect remote command execution and port forwarding requests to be long +running and/or high bandwidth operations, and routing all the streaming data +through the Docker daemon feels like a bottleneck we can avoid. + +#### `nsenter` + +The implementation currently uses `nsenter` to run commands in containers, +joining the appropriate container namespaces. `nsenter` runs directly on the +node and is not proxied through any single daemon process. + +### Security + +Authentication and authorization hasn't specifically been tested yet with this +functionality. We need to make sure that users are not allowed to execute +remote commands or do port forwarding to containers they aren't allowed to +access. + +Additional work is required to ensure that multiple command execution or port +forwarding connections from different clients are not able to see each other's +data. This can most likely be achieved via SELinux labeling and unique process + contexts. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]() + diff --git a/design/configmap.md b/design/configmap.md new file mode 100644 index 00000000..658ac73b --- /dev/null +++ b/design/configmap.md @@ -0,0 +1,300 @@ +# Generic Configuration Object + +## Abstract + +The `ConfigMap` API resource stores data used for the configuration of +applications deployed on Kubernetes. + +The main focus of this resource is to: + +* Provide dynamic distribution of configuration data to deployed applications. +* Encapsulate configuration information and simplify `Kubernetes` deployments. +* Create a flexible configuration model for `Kubernetes`. + +## Motivation + +A `Secret`-like API resource is needed to store configuration data that pods can +consume. + +Goals of this design: + +1. Describe a `ConfigMap` API resource. +2. Describe the semantics of consuming `ConfigMap` as environment variables. +3. Describe the semantics of consuming `ConfigMap` as files in a volume. + +## Use Cases + +1. As a user, I want to be able to consume configuration data as environment +variables. +2. As a user, I want to be able to consume configuration data as files in a +volume. +3. As a user, I want my view of configuration data in files to be eventually +consistent with changes to the data. + +### Consuming `ConfigMap` as Environment Variables + +A series of events for consuming `ConfigMap` as environment variables: + +1. Create a `ConfigMap` object. +2. Create a pod to consume the configuration data via environment variables. +3. The pod is scheduled onto a node. +4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and +starts the container processes with the appropriate configuration data from +environment variables. + +### Consuming `ConfigMap` in Volumes + +A series of events for consuming `ConfigMap` as configuration files in a volume: + +1. Create a `ConfigMap` object. +2. Create a new pod using the `ConfigMap` via a volume plugin. +3. The pod is scheduled onto a node. +4. The Kubelet creates an instance of the volume plugin and calls its `Setup()` +method. +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod +and projects the appropriate configuration data into the volume. + +### Consuming `ConfigMap` Updates + +Any long-running system has configuration that is mutated over time. Changes +made to configuration data must be made visible to pods consuming data in +volumes so that they can respond to those changes. + +The `resourceVersion` of the `ConfigMap` object will be updated by the API +server every time the object is modified. After an update, modifications will be +made visible to the consumer container: + +1. Create a `ConfigMap` object. +2. Create a new pod using the `ConfigMap` via the volume plugin. +3. The pod is scheduled onto a node. +4. During the sync loop, the Kubelet creates an instance of the volume plugin +and calls its `Setup()` method. +5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod +and projects the appropriate data into the volume. +6. The `ConfigMap` referenced by the pod is updated. +7. During the next iteration of the `syncLoop`, the Kubelet creates an instance +of the volume plugin and calls its `Setup()` method. +8. The volume plugin projects the updated data into the volume atomically. + +It is the consuming pod's responsibility to make use of the updated data once it +is made visible. + +Because environment variables cannot be updated without restarting a container, +configuration data consumed in environment variables will not be updated. + +### Advantages + +* Easy to consume in pods; consumer-agnostic +* Configuration data is persistent and versioned +* Consumers of configuration data in volumes can respond to changes in the data + +## Proposed Design + +### API Resource + +The `ConfigMap` resource will be added to the main API: + +```go +package api + +// ConfigMap holds configuration data for pods to consume. +type ConfigMap struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + // Data contains the configuration data. Each key must be a valid + // DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN. + Data map[string]string `json:"data,omitempty"` +} + +type ConfigMapList struct { + TypeMeta `json:",inline"` + ListMeta `json:"metadata,omitempty"` + + Items []ConfigMap `json:"items"` +} +``` + +A `Registry` implementation for `ConfigMap` will be added to +`pkg/registry/configmap`. + +### Environment Variables + +The `EnvVarSource` will be extended with a new selector for `ConfigMap`: + +```go +package api + +// EnvVarSource represents a source for the value of an EnvVar. +type EnvVarSource struct { + // other fields omitted + + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` +} + +// Selects a key from a ConfigMap. +type ConfigMapKeySelector struct { + // The ConfigMap to select from. + LocalObjectReference `json:",inline"` + // The key to select. + Key string `json:"key"` +} +``` + +### Volume Source + +A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap` +object will be added to the `VolumeSource` struct in the API: + +```go +package api + +type VolumeSource struct { + // other fields omitted + ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"` +} + +// Represents a volume that holds configuration data. +type ConfigMapVolumeSource struct { + LocalObjectReference `json:",inline"` + // A list of keys to project into the volume. + // If unspecified, each key-value pair in the Data field of the + // referenced ConfigMap will be projected into the volume as a file whose name + // is the key and content is the value. + // If specified, the listed keys will be project into the specified paths, and + // unlisted keys will not be present. + Items []KeyToPath `json:"items,omitempty"` +} + +// Represents a mapping of a key to a relative path. +type KeyToPath struct { + // The name of the key to select + Key string `json:"key"` + + // The relative path name of the file to be created. + // Must not be absolute or contain the '..' path. Must be utf-8 encoded. + // The first item of the relative path must not start with '..' + Path string `json:"path"` +} +``` + +**Note:** The update logic used in the downward API volume plug-in will be +extracted and re-used in the volume plug-in for `ConfigMap`. + +### Changes to Secret + +We will update the Secret volume plugin to have a similar API to the new +`ConfigMap` volume plugin. The secret volume plugin will also begin updating +secret content in the volume when secrets change. + +## Examples + +#### Consuming `ConfigMap` as Environment Variables + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: etcd-env-config +data: + number-of-members: "1" + initial-cluster-state: new + initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN + discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN + discovery-url: http://etcd-discovery:2379 + etcdctl-peers: http://etcd:2379 +``` + +This pod consumes the `ConfigMap` as environment variables: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: config-env-example +spec: + containers: + - name: etcd + image: openshift/etcd-20-centos7 + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + env: + - name: ETCD_NUM_MEMBERS + valueFrom: + configMapKeyRef: + name: etcd-env-config + key: number-of-members + - name: ETCD_INITIAL_CLUSTER_STATE + valueFrom: + configMapKeyRef: + name: etcd-env-config + key: initial-cluster-state + - name: ETCD_DISCOVERY_TOKEN + valueFrom: + configMapKeyRef: + name: etcd-env-config + key: discovery-token + - name: ETCD_DISCOVERY_URL + valueFrom: + configMapKeyRef: + name: etcd-env-config + key: discovery-url + - name: ETCDCTL_PEERS + valueFrom: + configMapKeyRef: + name: etcd-env-config + key: etcdctl-peers +``` + +#### Consuming `ConfigMap` as Volumes + +`redis-volume-config` is intended to be used as a volume containing a config +file: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: redis-volume-config +data: + redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n" +``` + +The following pod consumes the `redis-volume-config` in a volume: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: config-volume-example +spec: + containers: + - name: redis + image: kubernetes/redis + command: ["redis-server", "/mnt/config-map/etc/redis.conf"] + ports: + - containerPort: 6379 + volumeMounts: + - name: config-map-volume + mountPath: /mnt/config-map + volumes: + - name: config-map-volume + configMap: + name: redis-volume-config + items: + - path: "etc/redis.conf" + key: redis.conf +``` + +## Future Improvements + +In the future, we may add the ability to specify an init-container that can +watch the volume contents for updates and respond to changes when they occur. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]() + diff --git a/design/control-plane-resilience.md b/design/control-plane-resilience.md new file mode 100644 index 00000000..8193fd97 --- /dev/null +++ b/design/control-plane-resilience.md @@ -0,0 +1,241 @@ +# Kubernetes and Cluster Federation Control Plane Resilience + +## Long Term Design and Current Status + +### by Quinton Hoole, Mike Danese and Justin Santa-Barbara + +### December 14, 2015 + +## Summary + +Some amount of confusion exists around how we currently, and in future +want to ensure resilience of the Kubernetes (and by implication +Kubernetes Cluster Federation) control plane. This document is an attempt to capture that +definitively. It covers areas including self-healing, high +availability, bootstrapping and recovery. Most of the information in +this document already exists in the form of github comments, +PR's/proposals, scattered documents, and corridor conversations, so +document is primarily a consolidation and clarification of existing +ideas. + +## Terms + +* **Self-healing:** automatically restarting or replacing failed + processes and machines without human intervention +* **High availability:** continuing to be available and work correctly + even if some components are down or uncontactable. This typically + involves multiple replicas of critical services, and a reliable way + to find available replicas. Note that it's possible (but not + desirable) to have high + availability properties (e.g. multiple replicas) in the absence of + self-healing properties (e.g. if a replica fails, nothing replaces + it). Fairly obviously, given enough time, such systems typically + become unavailable (after enough replicas have failed). +* **Bootstrapping**: creating an empty cluster from nothing +* **Recovery**: recreating a non-empty cluster after perhaps + catastrophic failure/unavailability/data corruption + +## Overall Goals + +1. **Resilience to single failures:** Kubernetes clusters constrained + to single availability zones should be resilient to individual + machine and process failures by being both self-healing and highly + available (within the context of such individual failures). +1. **Ubiquitous resilience by default:** The default cluster creation + scripts for (at least) GCE, AWS and basic bare metal should adhere + to the above (self-healing and high availability) by default (with + options available to disable these features to reduce control plane + resource requirements if so required). It is hoped that other + cloud providers will also follow the above guidelines, but the + above 3 are the primary canonical use cases. +1. **Resilience to some correlated failures:** Kubernetes clusters + which span multiple availability zones in a region should by + default be resilient to complete failure of one entire availability + zone (by similarly providing self-healing and high availability in + the default cluster creation scripts as above). +1. **Default implementation shared across cloud providers:** The + differences between the default implementations of the above for + GCE, AWS and basic bare metal should be minimized. This implies + using shared libraries across these providers in the default + scripts in preference to highly customized implementations per + cloud provider. This is not to say that highly differentiated, + customized per-cloud cluster creation processes (e.g. for GKE on + GCE, or some hosted Kubernetes provider on AWS) are discouraged. + But those fall squarely outside the basic cross-platform OSS + Kubernetes distro. +1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms + for achieving system resilience (replication controllers, health + checking, service load balancing etc) should be used in preference + to building a separate set of mechanisms to achieve the same thing. + This implies that self hosting (the kubernetes control plane on + kubernetes) is strongly preferred, with the caveat below. +1. **Recovery from catastrophic failure:** The ability to quickly and + reliably recover a cluster from catastrophic failure is critical, + and should not be compromised by the above goal to self-host + (i.e. it goes without saying that the cluster should be quickly and + reliably recoverable, even if the cluster control plane is + broken). This implies that such catastrophic failure scenarios + should be carefully thought out, and the subject of regular + continuous integration testing, and disaster recovery exercises. + +## Relative Priorities + +1. **(Possibly manual) recovery from catastrophic failures:** having a +Kubernetes cluster, and all applications running inside it, disappear forever +perhaps is the worst possible failure mode. So it is critical that we be able to +recover the applications running inside a cluster from such failures in some +well-bounded time period. + 1. In theory a cluster can be recovered by replaying all API calls + that have ever been executed against it, in order, but most + often that state has been lost, and/or is scattered across + multiple client applications or groups. So in general it is + probably infeasible. + 1. In theory a cluster can also be recovered to some relatively + recent non-corrupt backup/snapshot of the disk(s) backing the + etcd cluster state. But we have no default consistent + backup/snapshot, verification or restoration process. And we + don't routinely test restoration, so even if we did routinely + perform and verify backups, we have no hard evidence that we + can in practise effectively recover from catastrophic cluster + failure or data corruption by restoring from these backups. So + there's more work to be done here. +1. **Self-healing:** Most major cloud providers provide the ability to + easily and automatically replace failed virtual machines within a + small number of minutes (e.g. GCE + [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart) + and Managed Instance Groups, + AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/) + and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This + can fairly trivially be used to reduce control-plane down-time due + to machine failure to a small number of minutes per failure + (i.e. typically around "3 nines" availability), provided that: + 1. cluster persistent state (i.e. etcd disks) is either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using etcd [dynamic member + addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member) + or [backup and + recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)). + 1. and boot disks are either: + 1. truely persistent (i.e. remote persistent disks), or + 1. reconstructible (e.g. using boot-from-snapshot, + boot-from-pre-configured-image or + boot-from-auto-initializing image). +1. **High Availability:** This has the potential to increase + availability above the approximately "3 nines" level provided by + automated self-healing, but it's somewhat more complex, and + requires additional resources (e.g. redundant API servers and etcd + quorum members). In environments where cloud-assisted automatic + self-healing might be infeasible (e.g. on-premise bare-metal + deployments), it also gives cluster administrators more time to + respond (e.g. replace/repair failed machines) without incurring + system downtime. + +## Design and Status (as of December 2015) + + + + + + + + + + + + + + + + + + + + + + +
Control Plane ComponentResilience PlanCurrent Status
API Server + +Multiple stateless, self-hosted, self-healing API servers behind a HA +load balancer, built out by the default "kube-up" automation on GCE, +AWS and basic bare metal (BBM). Note that the single-host approach of +having etcd listen only on localhost to ensure that only API server can +connect to it will no longer work, so alternative security will be +needed in the regard (either using firewall rules, SSL certs, or +something else). All necessary flags are currently supported to enable +SSL between API server and etcd (OpenShift runs like this out of the +box), but this needs to be woven into the "kube-up" and related +scripts. Detailed design of self-hosting and related bootstrapping +and catastrophic failure recovery will be detailed in a separate +design doc. + + + +No scripted self-healing or HA on GCE, AWS or basic bare metal +currently exists in the OSS distro. To be clear, "no self healing" +means that even if multiple e.g. API servers are provisioned for HA +purposes, if they fail, nothing replaces them, so eventually the +system will fail. Self-healing and HA can be set up +manually by following documented instructions, but this is not +currently an automated process, and it is not tested as part of +continuous integration. So it's probably safest to assume that it +doesn't actually work in practise. + +
Controller manager and scheduler + +Multiple self-hosted, self healing warm standby stateless controller +managers and schedulers with leader election and automatic failover of API +server clients, automatically installed by default "kube-up" automation. + +As above.
etcd + +Multiple (3-5) etcd quorum members behind a load balancer with session +affinity (to prevent clients from being bounced from one to another). + +Regarding self-healing, if a node running etcd goes down, it is always necessary +to do three things: +
    +
  1. allocate a new node (not necessary if running etcd as a pod, in +which case specific measures are required to prevent user pods from +interfering with system pods, for example using node selectors as +described in +dynamic member addition. + +In the case of remote persistent disk, the etcd state can be recovered by +attaching the remote persistent disk to the replacement node, thus the state is +recoverable even if all other replicas are down. + +There are also significant performance differences between local disks and remote +persistent disks. For example, the + +sustained throughput local disks in GCE is approximatley 20x that of remote +disks. + +Hence we suggest that self-healing be provided by remotely mounted persistent +disks in non-performance critical, single-zone cloud deployments. For +performance critical installations, faster local SSD's should be used, in which +case remounting on node failure is not an option, so + +etcd runtime configuration should be used to replace the failed machine. +Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so +automatic +runtime configuration is required. Similarly, basic bare metal deployments +cannot generally rely on remote persistent disks, so the same approach applies +there. +
+ +Somewhat vague instructions exist on how to set some of this up manually in +a self-hosted configuration. But automatic bootstrapping and self-healing is not +described (and is not implemented for the non-PD cases). This all still needs to +be automated and continuously tested. +
+ + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]() + diff --git a/design/daemon.md b/design/daemon.md new file mode 100644 index 00000000..2c306056 --- /dev/null +++ b/design/daemon.md @@ -0,0 +1,206 @@ +# DaemonSet in Kubernetes + +**Author**: Ananya Kumar (@AnanyaKumar) + +**Status**: Implemented. + +This document presents the design of the Kubernetes DaemonSet, describes use +cases, and gives an overview of the code. + +## Motivation + +Many users have requested for a way to run a daemon on every node in a +Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential +for use cases such as building a sharded datastore, or running a logger on every +node. In comes the DaemonSet, a way to conveniently create and manage +daemon-like workloads in Kubernetes. + +## Use Cases + +The DaemonSet can be used for user-specified system services, cluster-level +applications with strong node ties, and Kubernetes node services. Below are +example use cases in each category. + +### User-Specified System Services: + +Logging: Some users want a way to collect statistics about nodes in a cluster +and send those logs to an external database. For example, system administrators +might want to know if their machines are performing as expected, if they need to +add more machines to the cluster, or if they should switch cloud providers. The +DaemonSet can be used to run a data collection service (for example fluentd) on +every node and send the data to a service like ElasticSearch for analysis. + +### Cluster-Level Applications + +Datastore: Users might want to implement a sharded datastore in their cluster. A +few nodes in the cluster, labeled ‘app=datastore’, might be responsible for +storing data shards, and pods running on these nodes might serve data. This +architecture requires a way to bind pods to specific nodes, so it cannot be +achieved using a Replication Controller. A DaemonSet is a convenient way to +implement such a datastore. + +For other uses, see the related [feature request](https://issues.k8s.io/1518) + +## Functionality + +The DaemonSet supports standard API features: + - create + - The spec for DaemonSets has a pod template field. + - Using the pod’s nodeSelector field, DaemonSets can be restricted to operate +over nodes that have a certain label. For example, suppose that in a cluster +some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a +datastore pod on exactly those nodes labeled ‘app=database’. + - Using the pod's nodeName field, DaemonSets can be restricted to operate on a +specified node. + - The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec +used by the Replication Controller. + - The initial implementation will not guarantee that DaemonSet pods are +created on nodes before other pods. + - The initial implementation of DaemonSet does not guarantee that DaemonSet +pods show up on nodes (for example because of resource limitations of the node), +but makes a best effort to launch DaemonSet pods (like Replication Controllers +do with pods). Subsequent revisions might ensure that DaemonSet pods show up on +nodes, preempting other pods if necessary. + - The DaemonSet controller adds an annotation: +```"kubernetes.io/created-by: \"``` + - YAML example: + + ```YAML + apiVersion: extensions/v1beta1 + kind: DaemonSet + metadata: + labels: + app: datastore + name: datastore + spec: + template: + metadata: + labels: + app: datastore-shard + spec: + nodeSelector: + app: datastore-node + containers: + name: datastore-shard + image: kubernetes/sharded + ports: + - containerPort: 9042 + name: main +``` + + - commands that get info: + - get (e.g. kubectl get daemonsets) + - describe + - Modifiers: + - delete (if --cascade=true, then first the client turns down all the pods +controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is +unlikely to be set on any node); then it deletes the DaemonSet; then it deletes +the pods) + - label + - annotate + - update operations like patch and replace (only allowed to selector and to +nodeSelector and nodeName of pod template) + - DaemonSets have labels, so you could, for example, list all DaemonSets +with certain labels (the same way you would for a Replication Controller). + +In general, for all the supported features like get, describe, update, etc, +the DaemonSet works in a similar way to the Replication Controller. However, +note that the DaemonSet and the Replication Controller are different constructs. + +### Persisting Pods + + - Ordinary liveness probes specified in the pod template work to keep pods +created by a DaemonSet running. + - If a daemon pod is killed or stopped, the DaemonSet will create a new +replica of the daemon pod on the node. + +### Cluster Mutations + + - When a new node is added to the cluster, the DaemonSet controller starts +daemon pods on the node for DaemonSets whose pod template nodeSelectors match +the node’s labels. + - Suppose the user launches a DaemonSet that runs a logging daemon on all +nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label +to a node (that did not initially have the label), the logging daemon will +launch on the node. Additionally, if a user removes the label from a node, the +logging daemon on that node will be killed. + +## Alternatives Considered + +We considered several alternatives, that were deemed inferior to the approach of +creating a new DaemonSet abstraction. + +One alternative is to include the daemon in the machine image. In this case it +would run outside of Kubernetes proper, and thus not be monitored, health +checked, usable as a service endpoint, easily upgradable, etc. + +A related alternative is to package daemons as static pods. This would address +most of the problems described above, but they would still not be easily +upgradable, and more generally could not be managed through the API server +interface. + +A third alternative is to generalize the Replication Controller. We would do +something like: if you set the `replicas` field of the ReplicationControllerSpec +to -1, then it means "run exactly one replica on every node matching the +nodeSelector in the pod template." The ReplicationController would pretend +`replicas` had been set to some large number -- larger than the largest number +of nodes ever expected in the cluster -- and would use some anti-affinity +mechanism to ensure that no more than one Pod from the ReplicationController +runs on any given node. There are two downsides to this approach. First, +there would always be a large number of Pending pods in the scheduler (these +will be scheduled onto new machines when they are added to the cluster). The +second downside is more philosophical: DaemonSet and the Replication Controller +are very different concepts. We believe that having small, targeted controllers +for distinct purposes makes Kubernetes easier to understand and use, compared to +having larger multi-functional controllers (see +["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for +some discussion of this topic). + +## Design + +#### Client + +- Add support for DaemonSet commands to kubectl and the client. Client code was +added to pkg/client/unversioned. The main files in Kubectl that were modified are +pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create, +and Update, the client simply forwards the request to the backend via the REST +API. + +#### Apiserver + +- Accept, parse, validate client commands +- REST API calls are handled in pkg/registry/daemonset + - In particular, the api server will add the object to etcd + - DaemonManager listens for updates to etcd (using Framework.informer) +- API objects for DaemonSet were created in expapi/v1/types.go and +expapi/v1/register.go +- Validation code is in expapi/validation + +#### Daemon Manager + +- Creates new DaemonSets when requested. Launches the corresponding daemon pod +on all nodes with labels matching the new DaemonSet’s selector. +- Listens for addition of new nodes to the cluster, by setting up a +framework.NewInformer that watches for the creation of Node API objects. When a +new node is added, the daemon manager will loop through each DaemonSet. If the +label of the node matches the selector of the DaemonSet, then the daemon manager +will create the corresponding daemon pod in the new node. +- The daemon manager creates a pod on a node by sending a command to the API +server, requesting for a pod to be bound to the node (the node will be specified +via its hostname.) + +#### Kubelet + +- Does not need to be modified, but health checking will occur for the daemon +pods and revive the pods if they are killed (we set the pod restartPolicy to +Always). We reject DaemonSet objects with pod templates that don’t have +restartPolicy set to Always. + +## Open Issues + +- Should work similarly to [Deployment](http://issues.k8s.io/1743). + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]() + diff --git a/design/downward_api_resources_limits_requests.md b/design/downward_api_resources_limits_requests.md new file mode 100644 index 00000000..ab17c321 --- /dev/null +++ b/design/downward_api_resources_limits_requests.md @@ -0,0 +1,622 @@ +# Downward API for resource limits and requests + +## Background + +Currently the downward API (via environment variables and volume plugin) only +supports exposing a Pod's name, namespace, annotations, labels and its IP +([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This +document explains the need and design to extend them to expose resources +(e.g. cpu, memory) limits and requests. + +## Motivation + +Software applications require configuration to work optimally with the resources they're allowed to use. +Exposing the requested and limited amounts of available resources inside containers will allow +these applications to be configured more easily. Although docker already +exposes some of this information inside containers, the downward API helps +exposing this information in a runtime-agnostic manner in Kubernetes. + +## Use cases + +As an application author, I want to be able to use cpu or memory requests and +limits to configure the operational requirements of my applications inside containers. +For example, Java applications expect to be made aware of the available heap size via +a command line argument to the JVM, for example: java -Xmx:``. Similarly, an +application may want to configure its thread pool based on available cpu resources and +the exported value of GOMAXPROCS. + +## Design + +This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473). +There are three approaches discussed in this document to obtain resources limits +and requests to be exposed as environment variables and volumes inside +containers: + +1. The first approach requires users to specify full json path selectors +in which selectors are relative to the pod spec. The benefit of this +approach is to specify pod-level resources, and since containers are +also part of a pod spec, it can be used to specify container-level +resources too. + +2. The second approach requires specifying partial json path selectors +which are relative to the container spec. This approach helps +in retrieving a container specific resource limits and requests, and at +the same time, it is simpler to specify than full json path selectors. + +3. In the third approach, users specify fixed strings (magic keys) to retrieve +resources limits and requests and do not specify any json path +selectors. This approach is similar to the existing downward API +implementation approach. The advantages of this approach are that it is +simpler to specify that the first two, and does not require any type of +conversion between internal and versioned objects or json selectors as +discussed below. + +Before discussing a bit more about merits of each approach, here is a +brief discussion about json path selectors and some implications related +to their use. + +#### JSONpath selectors + +Versioned objects in kubernetes have json tags as part of their golang fields. +Currently, objects in the internal API have json tags, but it is planned that +these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933) +for discussion). So for discussion in this proposal, we assume that +internal objects do not have json tags. In the first two approaches +(full and partial json selectors), when a user creates a pod and its +containers, the user specifies a json path selector in the pod's +spec to retrieve values of its limits and requests. The selector +is composed of json tags similar to json paths used with kubectl +([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal +uses kubernetes' json path library to process the selectors to retrieve +the values. As kubelet operates on internal objects (without json tags), +and the selectors are part of versioned objects, retrieving values of +the limits and requests can be handled using these two solutions: + +1. By converting an internal object to versioned object, and then using +the json path library to retrieve the values from the versioned object +by processing the selector. + +2. By converting a json selector of the versioned objects to internal +object's golang expression and then using the json path library to +retrieve the values from the internal object by processing the golang +expression. However, converting a json selector of the versioned objects +to internal object's golang expression will still require an instance +of the versioned object, so it seems more work from the first solution +unless there is another way without requiring the versioned object. + +So there is a one time conversion cost associated with the first (full +path) and second (partial path) approaches, whereas the third approach +(magic keys) does not require any such conversion and can directly +work on internal objects. If we want to avoid conversion cost and to +have implementation simplicity, my opinion is that magic keys approach +is relatively easiest to implement to expose limits and requests with +least impact on existing functionality. + +To summarize merits/demerits of each approach: + +|Approach | Scope | Conversion cost | JSON selectors | Future extension| +| ---------- | ------------------- | -------------------| ------------------- | ------------------- | +|Full selectors | Pod/Container | Yes | Yes | Possible | +|Partial selectors | Container | Yes | Yes | Possible | +|Magic keys | Container | No | No | Possible| + +Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object +in conjunction with partial selectors and magic keys approaches. + +### API with full JSONpath selectors + +Full json path selectors specify the complete path to the resources +limits and requests relative to pod spec. + +#### Environment variables + +This table shows how selectors can be used for various requests and +limits to be exposed as environment variables. Environment variable names +are examples only and not necessarily as specified, and the selectors do not +have to start with dot. + +| Env Var Name | Selector | +| ---- | ------------------- | +| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu| +| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory| +| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu| +| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory | + +#### Volume plugin + +This table shows how selectors can be used for various requests and +limits to be exposed as volumes. The path names are examples only and +not necessarily as specified, and the selectors do not have to start with dot. + + +| Path | Selector | +| ---- | ------------------- | +| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu| +| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory| +| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu| +| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory| + +Volumes are pod scoped, so a selector must be specified with a container name. + +Full json path selectors will use existing `type ObjectFieldSelector` +to extend the current implementation for resources requests and limits. + +``` +// ObjectFieldSelector selects an APIVersioned field of an object. +type ObjectFieldSelector struct { + APIVersion string `json:"apiVersion"` + // Required: Path of the field to select in the specified API version + FieldPath string `json:"fieldPath"` +} +``` + +#### Examples + +These examples show how to use full selectors with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + fieldRef: + fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu +``` + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + fieldRef: + fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu +``` + +#### Validations + +For APIs with full json path selectors, verify that selectors are +valid relative to pod spec. + + +### API with partial JSONpath selectors + +Partial json path selectors specify paths to resources limits and requests +relative to the container spec. These will be implemented by introducing a +`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current +implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. + +``` +// ContainerSpecFieldSelector selects an APIVersioned field of an object. +type ContainerSpecFieldSelector struct { + APIVersion string `json:"apiVersion"` + // Container name + ContainerName string `json:"containerName,omitempty"` + // Required: Path of the field to select in the specified API version + FieldPath string `json:"fieldPath"` +} + +// Represents a single file containing information from the downward API +type DownwardAPIVolumeFile struct { + // Required: Path is the relative path name of the file to be created. + Path string `json:"path"` + // Selects a field of the pod: only annotations, labels, name and + // namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` + // Selects a field of the container: only resources limits and requests + // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, + // resources.requests.memory) are currently supported. + ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` +} + +// EnvVarSource represents a source for the value of an EnvVar. +// Only one of its fields may be set. +type EnvVarSource struct { + // Selects a field of the container: only resources limits and requests + // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, + // resources.requests.memory) are currently supported. + ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` + // Selects a field of the pod; only name and namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` + // Selects a key of a secret in the pod's namespace. + SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` +} +``` + +#### Environment variables + +This table shows how partial selectors can be used for various requests and +limits to be exposed as environment variables. Environment variable names +are examples only and not necessarily as specified, and the selectors do not +have to start with dot. + +| Env Var Name | Selector | +| -------------------- | -------------------| +| CPU_LIMIT | resources.limits.cpu | +| MEMORY_LIMIT | resources.limits.memory | +| CPU_REQUEST | resources.requests.cpu | +| MEMORY_REQUEST | resources.requests.memory | + +Since environment variables are container scoped, it is optional +to specify container name as part of the partial selectors as they are +relative to container spec. If container name is not specified, then +it defaults to current container. However, container name could be specified +to expose variables from other containers. + +#### Volume plugin + +This table shows volume paths and partial selectors used for resources cpu and memory. +Volume path names are examples only and not necessarily as specified, and the +selectors do not have to start with dot. + +| Path | Selector | +| -------------------- | -------------------| +| cpu_limit | resources.limits.cpu | +| memory_limit | resources.limits.memory | +| cpu_request | resources.requests.cpu | +| memory_request | resources.requests.memory | + +Volumes are pod scoped, the container name must be specified as part of +`containerSpecFieldRef` with them. + +#### Examples + +These examples show how to use partial selectors with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + containerSpecFieldRef: + fieldPath: resources.limits.cpu +``` + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + containerSpecFieldRef: + containerName: "client-container" + fieldPath: resources.limits.cpu +``` + +#### Validations + +For APIs with partial json path selectors, verify +that selectors are valid relative to container spec. +Also verify that container name is provided with volumes. + + +### API with magic keys + +In this approach, users specify fixed strings (or magic keys) to retrieve resources +limits and requests. This approach is similar to the existing downward +API implementation approach. The fixed string used for resources limits and requests +for cpu and memory are `limits.cpu`, `limits.memory`, +`requests.cpu` and `requests.memory`. Though these strings are same +as json path selectors but are processed as fixed strings. These will be implemented by +introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current +implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. + +The fields in ResourceFieldSelector are `containerName` to specify the name of a +container, `resource` to specify the type of a resource (cpu or memory), and `divisor` +to specify the output format of values of exposed resources. The default value of divisor +is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid +values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer +(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes), +`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`, +`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes). +For more information about these resource formats, [see details](resources.md). + +Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor. +For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then +exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and +the ceiling of 0.25 is 1. + +``` +type ResourceFieldSelector struct { + // Container name + ContainerName string `json:"containerName,omitempty"` + // Required: Resource to select + Resource string `json:"resource"` + // Specifies the output format of the exposed resources + Divisor resource.Quantity `json:"divisor,omitempty"` +} + +// Represents a single file containing information from the downward API +type DownwardAPIVolumeFile struct { + // Required: Path is the relative path name of the file to be created. + Path string `json:"path"` + // Selects a field of the pod: only annotations, labels, name and + // namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` + // Selects a resource of the container: only resources limits and requests + // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. + ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` +} + +// EnvVarSource represents a source for the value of an EnvVar. +// Only one of its fields may be set. +type EnvVarSource struct { + // Selects a resource of the container: only resources limits and requests + // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. + ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` + // Selects a field of the pod; only name and namespace are supported. + FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` + // Selects a key of a ConfigMap. + ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` + // Selects a key of a secret in the pod's namespace. + SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` +} +``` + +#### Environment variables + +This table shows environment variable names and strings used for resources cpu and memory. +The variable names are examples only and not necessarily as specified. + +| Env Var Name | Resource | +| -------------------- | -------------------| +| CPU_LIMIT | limits.cpu | +| MEMORY_LIMIT | limits.memory | +| CPU_REQUEST | requests.cpu | +| MEMORY_REQUEST | requests.memory | + +Since environment variables are container scoped, it is optional +to specify container name as part of the partial selectors as they are +relative to container spec. If container name is not specified, then +it defaults to current container. However, container name could be specified +to expose variables from other containers. + +#### Volume plugin + +This table shows volume paths and strings used for resources cpu and memory. +Volume path names are examples only and not necessarily as specified. + +| Path | Resource | +| -------------------- | -------------------| +| cpu_limit | limits.cpu | +| memory_limit | limits.memory| +| cpu_request | requests.cpu | +| memory_request | requests.memory | + +Volumes are pod scoped, the container name must be specified as part of +`resourceFieldRef` with them. + +#### Examples + +These examples show how to use magic keys approach with environment variables and volume plugin. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: dapi-test-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + env: + - name: CPU_LIMIT + valueFrom: + resourceFieldRef: + resource: limits.cpu + - name: MEMORY_LIMIT + valueFrom: + resourceFieldRef: + resource: limits.memory + divisor: "1Mi" +``` + +In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively. + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: client-container + image: gcr.io/google_containers/busybox + command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] + resources: + requests: + memory: "64Mi" + cpu: "250m" + limits: + memory: "128Mi" + cpu: "500m" + volumeMounts: + - name: podinfo + mountPath: /etc + readOnly: false + volumes: + - name: podinfo + downwardAPI: + items: + - path: "cpu_limit" + resourceFieldRef: + containerName: client-container + resource: limits.cpu + divisor: "1m" + - path: "memory_limit" + resourceFieldRef: + containerName: client-container + resource: limits.memory +``` + +In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively. + + +#### Validations + +For APIs with magic keys, verify that the resource strings are valid and is one +of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`. +Also verify that container name is provided with volumes. + +## Pod-level and container-level resource access + +Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in +all approaches. Container-level resources will be accessed by `type ObjectFieldSelector` +with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef` +with partial and magic keys approaches, respectively. The following table +summarizes resource access with these approaches. + +| Approach | Pod resources| Container resources | +| -------------------- | -------------------|-------------------| +| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`| +| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` | +| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` | + +## Output format + +The output format for resources limits and requests will be same as +cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024 +and rounded to integer) and memory in bytes. For example, memory request +or limit of `64Mi` in the container spec will be output as `67108864` +bytes, and cpu request or limit of `250m` (millicores) will be output as +`256` of cpu shares. + +## Implementation approach + +The current implementation of this proposal will focus on the API with magic keys +approach. The main reason for selecting this approach is that it might be +easier to incorporate and extend resource specific functionality. + +## Applied example + +Here we discuss how to use exposed resource values to set, for example, Java +memory size or GOMAXPROCS for your applications. Lets say, you expose a container's +(running an application like tomcat for example) requested memory as `HEAP_SIZE` +and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable. +One way to set the heap size or cpu for this application would be to wrap the binary +in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it) +and GOMAXPROCS environment variables inside the container image. The spec file for the +application pod could look like: + +``` +apiVersion: v1 +kind: Pod +metadata: + name: kubernetes-downwardapi-volume-example +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh","-c", "env" ] + resources: + requests: + memory: "64M" + cpu: "250m" + limits: + memory: "128M" + cpu: "500m" + env: + - name: HEAP_SIZE + valueFrom: + resourceFieldRef: + resource: requests.memory + - name: CPU_LIMIT + valueFrom: + resourceFieldRef: + resource: requests.cpu +``` + +Note that the value of divisor by default is `1`. Now inside the container, +the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as: + +``` +export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)" + +and + +export GOMAXPROCS=$(CPU_LIMIT)" +``` + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() + diff --git a/design/enhance-pluggable-policy.md b/design/enhance-pluggable-policy.md new file mode 100644 index 00000000..2468d3c1 --- /dev/null +++ b/design/enhance-pluggable-policy.md @@ -0,0 +1,429 @@ +# Enhance Pluggable Policy + +While trying to develop an authorization plugin for Kubernetes, we found a few +places where API extensions would ease development and add power. There are a +few goals: + 1. Provide an authorization plugin that can evaluate a .Authorize() call based +on the full content of the request to RESTStorage. This includes information +like the full verb, the content of creates and updates, and the names of +resources being acted upon. + 1. Provide a way to ask whether a user is permitted to take an action without + running in process with the API Authorizer. For instance, a proxy for exec + calls could ask whether a user can run the exec they are requesting. + 1. Provide a way to ask who can perform a given action on a given resource. +This is useful for answering questions like, "who can create replication +controllers in my namespace". + +This proposal adds to and extends the existing API to so that authorizers may +provide the functionality described above. It does not attempt to describe how +the policies themselves can be expressed, that is up the authorization plugins +themselves. + + +## Enhancements to existing Authorization interfaces + +The existing Authorization interfaces are described +[here](../admin/authorization.md). A couple additions will allow the development +of an Authorizer that matches based on different rules than the existing +implementation. + +### Request Attributes + +The existing authorizer.Attributes only has 5 attributes (user, groups, +isReadOnly, kind, and namespace). If we add more detailed verbs, content, and +resource names, then Authorizer plugins will have the same level of information +available to RESTStorage components in order to express more detailed policy. +The replacement excerpt is below. + +An API request has the following attributes that can be considered for +authorization: + - user - the user-string which a user was authenticated as. This is included +in the Context. + - groups - the groups to which the user belongs. This is included in the +Context. + - verb - string describing the requesting action. Today we have: get, list, +watch, create, update, and delete. The old `readOnly` behavior is equivalent to +allowing get, list, watch. + - namespace - the namespace of the object being access, or the empty string if +the endpoint does not support namespaced objects. This is included in the +Context. + - resourceGroup - the API group of the resource being accessed + - resourceVersion - the API version of the resource being accessed + - resource - which resource is being accessed + - applies only to the API endpoints, such as `/api/v1beta1/pods`. For +miscellaneous endpoints, like `/version`, the kind is the empty string. + - resourceName - the name of the resource during a get, update, or delete +action. + - subresource - which subresource is being accessed + +A non-API request has 2 attributes: + - verb - the HTTP verb of the request + - path - the path of the URL being requested + + +### Authorizer Interface + +The existing Authorizer interface is very simple, but there isn't a way to +provide details about allows, denies, or failures. The extended detail is useful +for UIs that want to describe why certain actions are allowed or disallowed. Not +all Authorizers will want to provide that information, but for those that do, +having that capability is useful. In addition, adding a `GetAllowedSubjects` +method that returns back the users and groups that can perform a particular +action makes it possible to answer questions like, "who can see resources in my +namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). + +```go +// OLD +type Authorizer interface { + Authorize(a Attributes) error +} +``` + +```go +// NEW +// Authorizer provides the ability to determine if a particular user can perform +// a particular action +type Authorizer interface { + // Authorize takes a Context (for namespace, user, and traceability) and + // Attributes to make a policy determination. + // reason is an optional return value that can describe why a policy decision + // was made. Reasons are useful during debugging when trying to figure out + // why a user or group has access to perform a particular action. + Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error) +} + +// AuthorizerIntrospection is an optional interface that provides the ability to +// determine which users and groups can perform a particular action. This is +// useful for building caches of who can see what. For instance, "which +// namespaces can this user see". That would allow someone to see only the +// namespaces they are allowed to view instead of having to choose between +// listing them all or listing none. +type AuthorizerIntrospection interface { + // GetAllowedSubjects takes a Context (for namespace and traceability) and + // Attributes to determine which users and groups are allowed to perform the + // described action in the namespace. This API enables the ResourceBasedReview + // requests below + GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error) +} +``` + +### SubjectAccessReviews + +This set of APIs answers the question: can a user or group (use authenticated +user if none is specified) perform a given action. Given the Authorizer +interface (proposed or existing), this endpoint can be implemented generically +against any Authorizer by creating the correct Attributes and making an +.Authorize() call. + +There are three different flavors: + +1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this +checks to see if a specified user or group can perform a given action at the +cluster scope or across all namespaces. This is a highly privileged operation. +It allows a cluster-admin to inspect rights of any person across the entire +cluster and against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - +this checks to see if the current user (including his groups) can perform a +given action at any specified scope. This is an unprivileged operation. It +doesn't expose any information that a user couldn't discover simply by trying an +endpoint themselves. +3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - +this checks to see if a specified user or group can perform a given action in +**this** namespace. This is a moderately privileged operation. In a multi-tenant +environment, having a namespace scoped resource makes it very easy to reason +about powers granted to a namespace admin. This allows a namespace admin +(someone able to manage permissions inside of one namespaces, but not all +namespaces), the power to inspect whether a given user or group can manipulate +resources in his namespace. + +SubjectAccessReview is runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets +a SubjectAccessReviewResponse back. Here is an example of a call and its +corresponding return: + +``` +// input +{ + "kind": "SubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "user": "Clark", + "groups": ["admins", "managers"] + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json +// or +accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject) + +// output +{ + "kind": "SubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "allowed": true +} +``` + +PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that +only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL +and he gets a SubjectAccessReviewResponse back. Here is an example of a call and +its corresponding return: + +``` +// input +{ + "kind": "PersonalSubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "namespace": "any-ns", + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json +// or +accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject) + +// output +{ + "kind": "PersonalSubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "allowed": true +} +``` + +LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he +gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and +its corresponding return: + +``` +// input +{ + "kind": "LocalSubjectAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "my-ns" + "authorizationAttributes": { + "verb": "create", + "resource": "pods", + "user": "Clark", + "groups": ["admins", "managers"] + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json +// or +accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject) + +// output +{ + "kind": "LocalSubjectAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "my-ns" + "allowed": true +} +``` + +The actual Go objects look like this: + +```go +type AuthorizationAttributes struct { + // Namespace is the namespace of the action being requested. Currently, there + // is no distinction between no namespace and all namespaces + Namespace string `json:"namespace" description:"namespace of the action being requested"` + // Verb is one of: get, list, watch, create, update, delete + Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"` + // Resource is one of the existing resource types + ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"` + // ResourceVersion is the version of resource + ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"` + // Resource is one of the existing resource types + Resource string `json:"resource" description:"one of the existing resource types"` + // ResourceName is the name of the resource being requested for a "get" or + // deleted for a "delete" + ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"` + // Subresource is one of the existing subresources types + Subresource string `json:"subresource" description:"one of the existing subresources"` +} + +// SubjectAccessReview is an object for requesting information about whether a +// user or group can perform an action +type SubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` + // User is optional, but at least one of User or Groups must be specified + User string `json:"user" description:"optional, user to check"` + // Groups is optional, but at least one of User or Groups must be specified + Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` +} + +// SubjectAccessReviewResponse describes whether or not a user or group can +// perform an action +type SubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} + +// PersonalSubjectAccessReview is an object for requesting information about +// whether a user or group can perform an action +type PersonalSubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// PersonalSubjectAccessReviewResponse describes whether this user can perform +// an action +type PersonalSubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} + +// LocalSubjectAccessReview is an object for requesting information about +// whether a user or group can perform an action +type LocalSubjectAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` + // User is optional, but at least one of User or Groups must be specified + User string `json:"user" description:"optional, user to check"` + // Groups is optional, but at least one of User or Groups must be specified + Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` +} + +// LocalSubjectAccessReviewResponse describes whether or not a user or group can +// perform an action +type LocalSubjectAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Allowed is required. True if the action would be allowed, false otherwise. + Allowed bool + // Reason is optional. It indicates why a request was allowed or denied. + Reason string +} +``` + +### ResourceAccessReview + +This set of APIs nswers the question: which users and groups can perform the +specified verb on the specified resourceKind. Given the Authorizer interface +described above, this endpoint can be implemented generically against any +Authorizer by calling the .GetAllowedSubjects() function. + +There are two different flavors: + +1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this +checks to see which users and groups can perform a given action at the cluster +scope or across all namespaces. This is a highly privileged operation. It allows +a cluster-admin to inspect rights of all subjects across the entire cluster and +against cluster level resources. +2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - +this checks to see which users and groups can perform a given action in **this** +namespace. This is a moderately privileged operation. In a multi-tenant +environment, having a namespace scoped resource makes it very easy to reason +about powers granted to a namespace admin. This allows a namespace admin +(someone able to manage permissions inside of one namespaces, but not all +namespaces), the power to inspect which users and groups can manipulate +resources in his namespace. + +ResourceAccessReview is a runtime.Object with associated RESTStorage that only +accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets +a ResourceAccessReviewResponse back. Here is an example of a call and its +corresponding return: + +``` +// input +{ + "kind": "ResourceAccessReview", + "apiVersion": "authorization.kubernetes.io/v1", + "authorizationAttributes": { + "verb": "list", + "resource": "replicationcontrollers" + } +} + +// POSTed like this +curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json +// or +accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject) + +// output +{ + "kind": "ResourceAccessReviewResponse", + "apiVersion": "authorization.kubernetes.io/v1", + "namespace": "default" + "users": ["Clark", "Hubert"], + "groups": ["cluster-admins"] +} +``` + +The actual Go objects look like this: + +```go +// ResourceAccessReview is a means to request a list of which users and groups +// are authorized to perform the action specified by spec +type ResourceAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// ResourceAccessReviewResponse describes who can perform the action +type ResourceAccessReviewResponse struct { + kapi.TypeMeta + + // Users is the list of users who can perform the action + Users []string + // Groups is the list of groups who can perform the action + Groups []string +} + +// LocalResourceAccessReview is a means to request a list of which users and +// groups are authorized to perform the action specified in a specific namespace +type LocalResourceAccessReview struct { + kapi.TypeMeta `json:",inline"` + + // AuthorizationAttributes describes the action being tested. + AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` +} + +// LocalResourceAccessReviewResponse describes who can perform the action +type LocalResourceAccessReviewResponse struct { + kapi.TypeMeta + + // Namespace is the namespace used for the access review + Namespace string + // Users is the list of users who can perform the action + Users []string + // Groups is the list of groups who can perform the action + Groups []string +} +``` + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() + diff --git a/design/event_compression.md b/design/event_compression.md new file mode 100644 index 00000000..7a1cbb33 --- /dev/null +++ b/design/event_compression.md @@ -0,0 +1,169 @@ +# Kubernetes Event Compression + +This document captures the design of event compression. + +## Background + +Kubernetes components can get into a state where they generate tons of events. + +The events can be categorized in one of two ways: + +1. same - The event is identical to previous events except it varies only on +timestamp. +2. similar - The event is identical to previous events except it varies on +timestamp and message. + +For example, when pulling a non-existing image, Kubelet will repeatedly generate +`image_not_existing` and `container_is_waiting` events until upstream components +correct the image. When this happens, the spam from the repeated events makes +the entire event mechanism useless. It also appears to cause memory pressure in +etcd (see [#3853](http://issue.k8s.io/3853)). + +The goal is introduce event counting to increment same events, and event +aggregation to collapse similar events. + +## Proposal + +Each binary that generates events (for example, `kubelet`) should keep track of +previously generated events so that it can collapse recurring events into a +single event instead of creating a new instance for each new event. In addition, +if many similar events are created, events should be aggregated into a single +event to reduce spam. + +Event compression should be best effort (not guaranteed). Meaning, in the worst +case, `n` identical (minus timestamp) events may still result in `n` event +entries. + +## Design + +Instead of a single Timestamp, each event object +[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following +fields: + * `FirstTimestamp unversioned.Time` + * The date/time of the first occurrence of the event. + * `LastTimestamp unversioned.Time` + * The date/time of the most recent occurrence of the event. + * On first occurrence, this is equal to the FirstTimestamp. + * `Count int` + * The number of occurrences of this event between FirstTimestamp and +LastTimestamp. + * On first occurrence, this is 1. + +Each binary that generates events: + * Maintains a historical record of previously generated events: + * Implemented with +["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) +in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). + * Implemented behind an `EventCorrelator` that manages two subcomponents: +`EventAggregator` and `EventLogger`. + * The `EventCorrelator` observes all incoming events and lets each +subcomponent visit and modify the event in turn. + * The `EventAggregator` runs an aggregation function over each event. This +function buckets each event based on an `aggregateKey` and identifies the event +uniquely with a `localKey` in that bucket. + * The default aggregation function groups similar events that differ only by +`event.Message`. Its `localKey` is `event.Message` and its aggregate key is +produced by joining: + * `event.Source.Component` + * `event.Source.Host` + * `event.InvolvedObject.Kind` + * `event.InvolvedObject.Namespace` + * `event.InvolvedObject.Name` + * `event.InvolvedObject.UID` + * `event.InvolvedObject.APIVersion` + * `event.Reason` + * If the `EventAggregator` observes a similar event produced 10 times in a 10 +minute window, it drops the event that was provided as input and creates a new +event that differs only on the message. The message denotes that this event is +used to group similar events that matched on reason. This aggregated `Event` is +then used in the event processing sequence. + * The `EventLogger` observes the event out of `EventAggregation` and tracks +the number of times it has observed that event previously by incrementing a key +in a cache associated with that matching event. + * The key in the cache is generated from the event object minus +timestamps/count/transient fields, specifically the following events fields are +used to construct a unique key for an event: + * `event.Source.Component` + * `event.Source.Host` + * `event.InvolvedObject.Kind` + * `event.InvolvedObject.Namespace` + * `event.InvolvedObject.Name` + * `event.InvolvedObject.UID` + * `event.InvolvedObject.APIVersion` + * `event.Reason` + * `event.Message` + * The LRU cache is capped at 4096 events for both `EventAggregator` and +`EventLogger`. That means if a component (e.g. kubelet) runs for a long period +of time and generates tons of unique events, the previously generated events +cache will not grow unchecked in memory. Instead, after 4096 unique events are +generated, the oldest events are evicted from the cache. + * When an event is generated, the previously generated events cache is checked +(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). + * If the key for the new event matches the key for a previously generated +event (meaning all of the above fields match between the new event and some +previously generated event), then the event is considered to be a duplicate and +the existing event entry is updated in etcd: + * The new PUT (update) event API is called to update the existing event +entry in etcd with the new last seen timestamp and count. + * The event is also updated in the previously generated events cache with +an incremented count, updated last seen timestamp, name, and new resource +version (all required to issue a future event update). + * If the key for the new event does not match the key for any previously +generated event (meaning none of the above fields match between the new event +and any previously generated events), then the event is considered to be +new/unique and a new event entry is created in etcd: + * The usual POST/create event API is called to create a new event entry in +etcd. + * An entry for the event is also added to the previously generated events +cache. + +## Issues/Risks + + * Compression is not guaranteed, because each component keeps track of event + history in memory + * An application restart causes event history to be cleared, meaning event +history is not preserved across application restarts and compression will not +occur across component restarts. + * Because an LRU cache is used to keep track of previously generated events, +if too many unique events are generated, old events will be evicted from the +cache, so events will only be compressed until they age out of the events cache, +at which point any new instance of the event will cause a new entry to be +created in etcd. + +## Example + +Sample kubectl output: + +```console +FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE +Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" +Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal +``` + +This demonstrates what would have been 20 separate entries (indicating +scheduling failure) collapsed/compressed down to 5 entries. + +## Related Pull Requests/Issues + + * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events. + * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API. + * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow +compressing multiple recurring events in to a single event. + * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a +single event to optimize etcd storage. + * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache +instead of map. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() + diff --git a/design/expansion.md b/design/expansion.md new file mode 100644 index 00000000..ace1faf0 --- /dev/null +++ b/design/expansion.md @@ -0,0 +1,417 @@ +# Variable expansion in pod command, args, and env + +## Abstract + +A proposal for the expansion of environment variables using a simple `$(var)` +syntax. + +## Motivation + +It is extremely common for users to need to compose environment variables or +pass arguments to their commands using the values of environment variables. +Kubernetes should provide a facility for the 80% cases in order to decrease +coupling and the use of workarounds. + +## Goals + +1. Define the syntax format +2. Define the scoping and ordering of substitutions +3. Define the behavior for unmatched variables +4. Define the behavior for unexpected/malformed input + +## Constraints and Assumptions + +* This design should describe the simplest possible syntax to accomplish the +use-cases. +* Expansion syntax will not support more complicated shell-like behaviors such +as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. + +## Use Cases + +1. As a user, I want to compose new environment variables for a container using +a substitution syntax to reference other variables in the container's +environment and service environment variables. +1. As a user, I want to substitute environment variables into a container's +command. +1. As a user, I want to do the above without requiring the container's image to +have a shell. +1. As a user, I want to be able to specify a default value for a service +variable which may not exist. +1. As a user, I want to see an event associated with the pod if an expansion +fails (ie, references variable names that cannot be expanded). + +### Use Case: Composition of environment variables + +Currently, containers are injected with docker-style environment variables for +the services in their pod's namespace. There are several variables for each +service, but users routinely need to compose URLs based on these variables +because there is not a variable for the exact format they need. Users should be +able to build new environment variables with the exact format they need. +Eventually, it should also be possible to turn off the automatic injection of +the docker-style variables into pods and let the users consume the exact +information they need via the downward API and composition. + +#### Expanding expanded variables + +It should be possible to reference an variable which is itself the result of an +expansion, if the referenced variable is declared in the container's environment +prior to the one referencing it. Put another way -- a container's environment is +expanded in order, and expanded variables are available to subsequent +expansions. + +### Use Case: Variable expansion in command + +Users frequently need to pass the values of environment variables to a +container's command. Currently, Kubernetes does not perform any expansion of +variables. The workaround is to invoke a shell in the container's command and +have the shell perform the substitution, or to write a wrapper script that sets +up the environment and runs the command. This has a number of drawbacks: + +1. Solutions that require a shell are unfriendly to images that do not contain +a shell. +2. Wrapper scripts make it harder to use images as base images. +3. Wrapper scripts increase coupling to Kubernetes. + +Users should be able to do the 80% case of variable expansion in command without +writing a wrapper script or adding a shell invocation to their containers' +commands. + +### Use Case: Images without shells + +The current workaround for variable expansion in a container's command requires +the container's image to have a shell. This is unfriendly to images that do not +contain a shell (`scratch` images, for example). Users should be able to perform +the other use-cases in this design without regard to the content of their +images. + +### Use Case: See an event for incomplete expansions + +It is possible that a container with incorrect variable values or command line +may continue to run for a long period of time, and that the end-user would have +no visual or obvious warning of the incorrect configuration. If the kubelet +creates an event when an expansion references a variable that cannot be +expanded, it will help users quickly detect problems with expansions. + +## Design Considerations + +### What features should be supported? + +In order to limit complexity, we want to provide the right amount of +functionality so that the 80% cases can be realized and nothing more. We felt +that the essentials boiled down to: + +1. Ability to perform direct expansion of variables in a string. +2. Ability to specify default values via a prioritized mapping function but +without support for defaults as a syntax-level feature. + +### What should the syntax be? + +The exact syntax for variable expansion has a large impact on how users perceive +and relate to the feature. We considered implementing a very restrictive subset +of the shell `${var}` syntax. This syntax is an attractive option on some level, +because many people are familiar with it. However, this syntax also has a large +number of lesser known features such as the ability to provide default values +for unset variables, perform inline substitution, etc. + +In the interest of preventing conflation of the expansion feature in Kubernetes +with the shell feature, we chose a different syntax similar to the one in +Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since +it is not required to implement the required use-cases. + +Nested references, ie, variable expansion within variable names, are not +supported. + +#### How should unmatched references be treated? + +Ideally, it should be extremely clear when a variable reference couldn't be +expanded. We decided the best experience for unmatched variable references would +be to have the entire reference, syntax included, show up in the output. As an +example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then +`$(VARIABLE_NAME)` should be present in the output. + +#### Escaping the operator + +Although the `$(var)` syntax does overlap with the `$(command)` form of command +substitution supported by many shells, because unexpanded variables are present +verbatim in the output, we expect this will not present a problem to many users. +If there is a collision between a variable name and command substitution syntax, +the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate +to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. + +## Design + +This design encompasses the variable expansion syntax and specification and the +changes needed to incorporate the expansion feature into the container's +environment and command. + +### Syntax and expansion mechanics + +This section describes the expansion syntax, evaluation of variable values, and +how unexpected or malformed inputs are handled. + +#### Syntax + +The inputs to the expansion feature are: + +1. A utf-8 string (the input string) which may contain variable references. +2. A function (the mapping function) that maps the name of a variable to the +variable's value, of type `func(string) string`. + +Variable references in the input string are indicated exclusively with the syntax +`$()`. The syntax tokens are: + +- `$`: the operator, +- `(`: the reference opener, and +- `)`: the reference closer. + +The operator has no meaning unless accompanied by the reference opener and +closer tokens. The operator can be escaped using `$$`. One literal `$` will be +emitted for each `$$` in the input. + +The reference opener and closer characters have no meaning when not part of a +variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME` +without a closing expression, the operator and expression opening characters are +treated as ordinary characters without special meanings. + +#### Scope and ordering of substitutions + +The scope in which variable references are expanded is defined by the mapping +function. Within the mapping function, any arbitrary strategy may be used to +determine the value of a variable name. The most basic implementation of a +mapping function is to use a `map[string]string` to lookup the value of a +variable. + +In order to support default values for variables like service variables +presented by the kubelet, which may not be bound because the service that +provides them does not yet exist, there should be a mapping function that uses a +list of `map[string]string` like: + +```go +func MakeMappingFunc(maps ...map[string]string) func(string) string { + return func(input string) string { + for _, context := range maps { + val, ok := context[input] + if ok { + return val + } + } + + return "" + } +} + +// elsewhere +containerEnv := map[string]string{ + "FOO": "BAR", + "ZOO": "ZAB", + "SERVICE2_HOST": "some-host", +} + +serviceEnv := map[string]string{ + "SERVICE_HOST": "another-host", + "SERVICE_PORT": "8083", +} + +// single-map variation +mapping := MakeMappingFunc(containerEnv) + +// default variables not found in serviceEnv +mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv) +``` + +### Implementation changes + +The necessary changes to implement this functionality are: + +1. Add a new interface, `ObjectEventRecorder`, which is like the +`EventRecorder` interface, but scoped to a single object, and a function that +returns an `ObjectEventRecorder` given an `ObjectReference` and an +`EventRecorder`. +2. Introduce `third_party/golang/expansion` package that provides: + 1. An `Expand(string, func(string) string) string` function. + 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` +function. +3. Make the kubelet expand environment correctly. +4. Make the kubelet expand command correctly. + +#### Event Recording + +In order to provide an event when an expansion references undefined variables, +the mapping function must be able to create an event. In order to facilitate +this, we should create a new interface in the `api/client/record` package which +is similar to `EventRecorder`, but scoped to a single object: + +```go +// ObjectEventRecorder knows how to record events about a single object. +type ObjectEventRecorder interface { + // Event constructs an event from the given information and puts it in the queue for sending. + // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will + // be used to automate handling of events, so imagine people writing switch statements to + // handle them. You want to make that easy. + // 'message' is intended to be human readable. + // + // The resulting event will be created in the same namespace as the reference object. + Event(reason, message string) + + // Eventf is just like Event, but with Sprintf for the message field. + Eventf(reason, messageFmt string, args ...interface{}) + + // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. + PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) +} +``` + +There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object` +and an `EventRecorder`: + +```go +type objectRecorderImpl struct { + object runtime.Object + recorder EventRecorder +} + +func (r *objectRecorderImpl) Event(reason, message string) { + r.recorder.Event(r.object, reason, message) +} + +func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder { + return &objectRecorderImpl{object, recorder} +} +``` + +#### Expansion package + +The expansion package should provide two methods: + +```go +// MappingFuncFor returns a mapping function for use with Expand that +// implements the expansion semantics defined in the expansion spec; it +// returns the input string wrapped in the expansion syntax if no mapping +// for the input is found. If no expansion is found for a key, an event +// is raised on the given recorder. +func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string { + // ... +} + +// Expand replaces variable references in the input string according to +// the expansion spec using the given mapping function to resolve the +// values of variables. +func Expand(input string, mapping func(string) string) string { + // ... +} +``` + +#### Kubelet changes + +The Kubelet should be made to correctly expand variables references in a +container's environment, command, and args. Changes will need to be made to: + +1. The `makeEnvironmentVariables` function in the kubelet; this is used by +`GenerateRunContainerOptions`, which is used by both the docker and rkt +container runtimes. +2. The docker manager `setEntrypointAndCommand` func has to be changed to +perform variable expansion. +3. The rkt runtime should be made to support expansion in command and args +when support for it is implemented. + +### Examples + +#### Inputs and outputs + +These examples are in the context of the mapping: + +| Name | Value | +|-------------|------------| +| `VAR_A` | `"A"` | +| `VAR_B` | `"B"` | +| `VAR_C` | `"C"` | +| `VAR_REF` | `$(VAR_A)` | +| `VAR_EMPTY` | `""` | + +No other variables are defined. + +| Input | Result | +|--------------------------------|----------------------------| +| `"$(VAR_A)"` | `"A"` | +| `"___$(VAR_B)___"` | `"___B___"` | +| `"___$(VAR_C)"` | `"___C"` | +| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` | +| `"$(VAR_A)-1"` | `"A-1"` | +| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` | +| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` | +| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` | +| `"f000-$$VAR_A"` | `"f000-$VAR_A"` | +| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` | +| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` | +| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` | +| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` | +| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` | +| `"$(VAR_REF)"` | `"$(VAR_A)"` | +| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` | +| `"foo$(VAR_EMPTY)bar"` | `"foobar"` | +| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` | +| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` | +| `"$?_boo_$!"` | `"$?_boo_$!"` | +| `"$VAR_A"` | `"$VAR_A"` | +| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` | +| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` | +| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` | +| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` | +| `"$$$$$$$(VAR_A)"` | `"$$$A"` | +| `"$VAR_A)"` | `"$VAR_A)"` | +| `"${VAR_A}"` | `"${VAR_A}"` | +| `"$(VAR_B)_______$(A"` | `"B_______$(A"` | +| `"$(VAR_C)_______$("` | `"C_______$("` | +| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` | +| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` | +| `"--$($($($($--"` | `"--$($($($($--"` | +| `"$($($($($--foo$("` | `"$($($($($--foo$("` | +| `"foo0--$($($($("` | `"foo0--$($($($("` | +| `"$(foo$$var)` | `$(foo$$var)` | + +#### In a pod: building a URL + +Notice the `$(var)` syntax. + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: expansion-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh", "-c", "env" ] + env: + - name: PUBLIC_URL + value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" + restartPolicy: Never +``` + +#### In a pod: building a URL using downward API + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: expansion-pod +spec: + containers: + - name: test-container + image: gcr.io/google_containers/busybox + command: [ "/bin/sh", "-c", "env" ] + env: + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: "metadata.namespace" + - name: PUBLIC_URL + value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" + restartPolicy: Never +``` + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() + diff --git a/design/extending-api.md b/design/extending-api.md new file mode 100644 index 00000000..45a07ca5 --- /dev/null +++ b/design/extending-api.md @@ -0,0 +1,203 @@ +# Adding custom resources to the Kubernetes API server + +This document describes the design for implementing the storage of custom API +types in the Kubernetes API Server. + + +## Resource Model + +### The ThirdPartyResource + +The `ThirdPartyResource` resource describes the multiple versions of a custom +resource that the user wants to add to the Kubernetes API. `ThirdPartyResource` +is a non-namespaced resource; attempting to place it in a namespace will return +an error. + +Each `ThirdPartyResource` resource has the following: + * Standard Kubernetes object metadata. + * ResourceKind - The kind of the resources described by this third party +resource. + * Description - A free text description of the resource. + * APIGroup - An API group that this resource should be placed into. + * Versions - One or more `Version` objects. + +### The `Version` Object + +The `Version` object describes a single concrete version of a custom resource. +The `Version` object currently only specifies: + * The `Name` of the version. + * The `APIGroup` this version should belong to. + +## Expectations about third party objects + +Every object that is added to a third-party Kubernetes object store is expected +to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata). +This requirement enables the Kubernetes API server to provide the following +features: + * Filtering lists of objects via label queries. + * `resourceVersion`-based optimistic concurrency via compare-and-swap. + * Versioned storage. + * Event recording. + * Integration with basic `kubectl` command line tooling. + * Watch for resource changes. + +The `Kind` for an instance of a third-party object (e.g. CronTab) below is +expected to be programmatically convertible to the name of the resource using +the following conversion. Kinds are expected to be of the form +``, and the `APIVersion` for the object is expected to be +`/`. To prevent collisions, it's expected that you'll +use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`. + +For example `mygroup.example.com/v1` + +'CamelCaseKind' is the specific type name. + +To convert this into the `metadata.name` for the `ThirdPartyResource` resource +instance, the `` is copied verbatim, the `CamelCaseKind` is then +converted using '-' instead of capitalization ('camel-case'), with the first +character being assumed to be capitalized. In pseudo code: + +```go +var result string +for ix := range kindName { + if isCapital(kindName[ix]) { + result = append(result, '-') + } + result = append(result, toLowerCase(kindName[ix]) +} +``` + +As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines +resources of Kind `CamelCaseKind`, in the APIGroup with the prefix +`mygroup.example.com/...`. + +The reason for this is to enable rapid lookup of a `ThirdPartyResource` object +given the kind information. This is also the reason why `ThirdPartyResource` is +not namespaced. + +## Usage + +When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts +by creating a new, namespaced RESTful resource path. For now, non-namespaced +objects are not supported. As with existing built-in objects, deleting a +namespace deletes all third party resources in that namespace. + +For example, if a user creates: + +```yaml +metadata: + name: cron-tab.mygroup.example.com +apiVersion: extensions/v1beta1 +kind: ThirdPartyResource +description: "A specification of a Pod to run on a cron style schedule" +versions: +- name: v1 +- name: v2 +``` + +Then the API server will program in the new RESTful resource path: + * `/apis/mygroup.example.com/v1/namespaces//crontabs/...` + +**Note: This may take a while before RESTful resource path registration happen, please +always check this before you create resource instances.** + +Now that this schema has been created, a user can `POST`: + +```json +{ + "metadata": { + "name": "my-new-cron-object" + }, + "apiVersion": "mygroup.example.com/v1", + "kind": "CronTab", + "cronSpec": "* * * * /5", + "image": "my-awesome-cron-image" +} +``` + +to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs` + +and the corresponding data will be stored into etcd by the APIServer, so that +when the user issues: + +``` +GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object` +``` + +And when they do that, they will get back the same data, but with additional +Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in. + +Likewise, to list all resources, a user can issue: + +``` +GET /apis/mygroup.example.com/v1/namespaces/default/crontabs +``` + +and get back: + +```json +{ + "apiVersion": "mygroup.example.com/v1", + "kind": "CronTabList", + "items": [ + { + "metadata": { + "name": "my-new-cron-object" + }, + "apiVersion": "mygroup.example.com/v1", + "kind": "CronTab", + "cronSpec": "* * * * /5", + "image": "my-awesome-cron-image" + } + ] +} +``` + +Because all objects are expected to contain standard Kubernetes metadata fields, +these list operations can also use label queries to filter requests down to +specific subsets. + +Likewise, clients can use watch endpoints to watch for changes to stored +objects. + +## Storage + +In order to store custom user data in a versioned fashion inside of etcd, we +need to also introduce a `Codec`-compatible object for persistent storage in +etcd. This object is `ThirdPartyResourceData` and it contains: + * Standard API Metadata. + * `Data`: The raw JSON data for this custom object. + +### Storage key specification + +Each custom object stored by the API server needs a custom key in storage, this +is described below: + +#### Definitions + + * `resource-namespace`: the namespace of the particular resource that is +being stored + * `resource-name`: the name of the particular resource being stored + * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` +resource that represents the type for the specific instance being stored + * `third-party-resource-name`: the name of the `ThirdPartyResource` resource +that represents the type for the specific instance being stored + +#### Key + +Given the definitions above, the key for a specific third-party object is: + +``` +${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name} +``` + +Thus, listing a third-party resource can be achieved by listing the directory: + +``` +${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/ +``` + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() + diff --git a/design/federated-replicasets.md b/design/federated-replicasets.md new file mode 100644 index 00000000..f1744ade --- /dev/null +++ b/design/federated-replicasets.md @@ -0,0 +1,513 @@ +# Federated ReplicaSets + +# Requirements & Design Document + +This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. + +Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) +Based on discussions with +Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) + +## Overview + +### Summary & Vision + +When running a global application on a federation of Kubernetes +clusters the owner currently has to start it in multiple clusters and +control whether he has both enough application replicas running +locally in each of the clusters (so that, for example, users are +handled by a nearby cluster, with low latency) and globally (so that +there is always enough capacity to handle all traffic). If one of the +clusters has issues or hasn’t enough capacity to run the given set of +replicas the replicas should be automatically moved to some other +cluster to keep the application responsive. + +In single cluster Kubernetes there is a concept of ReplicaSet that +manages the replicas locally. We want to expand this concept to the +federation level. + +### Goals + ++ Win large enterprise customers who want to easily run applications + across multiple clusters ++ Create a reference controller implementation to facilitate bringing + other Kubernetes concepts to Federated Kubernetes. + +## Glossary + +Federation Cluster - a cluster that is a member of federation. + +Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster +that is a member of federation. + +Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. + +Federated ReplicaSet Controller (FRSC) - A controller running inside +of Federated K8S server that controlls FRS. + +## User Experience + +### Critical User Journeys + ++ [CUJ1] User wants to create a ReplicaSet in each of the federation + cluster. They create a definition of federated ReplicaSet on the + federated master and (local) ReplicaSets are automatically created + in each of the federation clusters. The number of replicas is each + of the Local ReplicaSets is (perheps indirectly) configurable by + the user. ++ [CUJ2] When the current number of replicas in a cluster drops below + the desired number and new replicas cannot be scheduled then they + should be started in some other cluster. + +### Features Enabling Critical User Journeys + +Feature #1 -> CUJ1: +A component which looks for newly created Federated ReplicaSets and +creates the appropriate Local ReplicaSet definitions in the federated +clusters. + +Feature #2 -> CUJ2: +A component that checks how many replicas are actually running in each +of the subclusters and if the number matches to the +FederatedReplicaSet preferences (by default spread replicas evenly +across the clusters but custom preferences are allowed - see +below). If it doesn’t and the situation is unlikely to improve soon +then the replicas should be moved to other subclusters. + +### API and CLI + +All interaction with FederatedReplicaSet will be done by issuing +kubectl commands pointing on the Federated Master API Server. All the +commands would behave in a similar way as on the regular master, +however in the next versions (1.5+) some of the commands may give +slightly different output. For example kubectl describe on federated +replica set should also give some information about the subclusters. + +Moreover, for safety, some defaults will be different. For example for +kubectl delete federatedreplicaset cascade will be set to false. + +FederatedReplicaSet would have the same object as local ReplicaSet +(although it will be accessible in a different part of the +api). Scheduling preferences (how many replicas in which cluster) will +be passed as annotations. + +### FederateReplicaSet preferences + +The preferences are expressed by the following structure, passed as a +serialized json inside annotations. + +``` +type FederatedReplicaSetPreferences struct { + // If set to true then already scheduled and running replicas may be moved to other clusters to + // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, + // up and running replicas will not be moved. + Rebalance bool `json:"rebalance,omitempty"` + + // Map from cluster name to preferences for that cluster. It is assumed that if a cluster + // doesn’t have a matching entry then it should not have local replica. The cluster matches + // to "*" if there is no entry with the real cluster name. + Clusters map[string]LocalReplicaSetPreferences +} + +// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. +type ClusterReplicaSetPreferences struct { + // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. + MinReplicas int64 `json:"minReplicas,omitempty"` + + // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). + MaxReplicas *int64 `json:"maxReplicas,omitempty"` + + // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. + Weight int64 +} +``` + +How this works in practice: + +**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity. + Replica layout: A=16 B=17 C=17. ++ Clusters A,B,C and C has capacity for 6 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C. B and C are offline: + Replica layout: A=50 + +**Scenario 2**. I want to have only 2 replicas in each of the clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} + } +} +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } + } + } + +``` + +Or + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} + } +} +``` + +There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. + +**Scenario 3**. I want to have 20 replicas in each of 3 clusters. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} + } +} +``` + +There is a global target for 50, however clusters require 60. So some clusters will have less replicas. + Replica layout: A=20 B=20 C=10. + +**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C. + +``` +FederatedReplicaSetPreferences { + Rebalance : true + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=16 B=17 C=17. ++ B is offline/has no capacity + Replica layout: A=30 B=0 C=20 ++ A and B are offline: + Replica layout: C=20 + +**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 1000000} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ All have capacity. + Replica layout: A=50 B=0 C=0. ++ A has capacity for only 40 replicas + Replica layout: A=40 B=5 C=5 + +**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. + +``` +FederatedReplicaSetPreferences { + Clusters : map[string]LocalReplicaSet { + “A” : LocalReplicaSet{ Weight: 2} + “B” : LocalReplicaSet{ Weight: 1} + “C” : LocalReplicaSet{ Weight: 1} + } +} +``` + +**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there +are already some replicas, please do not move them. Config: + +``` +FederatedReplicaSetPreferences { + Rebalance : false + Clusters : map[string]LocalReplicaSet { + "*" : LocalReplicaSet{ Weight: 1} + } +} +``` + +Example: + ++ Clusters A,B,C, all have capacity, but A already has 20 replicas + Replica layout: A=20 B=15 C=15. ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. + Replica layout: A=22 B=22 C=6 ++ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. + Replica layout: A=30 B=14 C=6 + +## The Idea + +A new federated controller - Federated Replica Set Controller (FRSC) +will be created inside federated controller manager. Below are +enumerated the key idea elements: + ++ [I0] It is considered OK to have slightly higher number of replicas + globally for some time. + ++ [I1] FRSC starts an informer on the FederatedReplicaSet that listens + on FRS being created, updated or deleted. On each create/update the + scheduling code will be started to calculate where to put the + replicas. The default behavior is to start the same amount of + replicas in each of the cluster. While creating LocalReplicaSets + (LRS) the following errors/issues can occur: + + + [E1] Master rejects LRS creation (for known or unknown + reason). In this case another attempt to create a LRS should be + attempted in 1m or so. This action can be tied with + [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created + the situation is the same as [E5]. If this happens multiple + times all due replicas should be moved elsewhere and later moved + back once the LRS is created. + + + [E2] LRS with the same name but different configuration already + exists. The LRS is then overwritten and an appropriate event + created to explain what happened. Pods under the control of the + old LRS are left intact and the new LRS may adopt them if they + match the selector. + + + [E3] LRS is new but the pods that match the selector exist. The + pods are adopted by the RS (if not owned by some other + RS). However they may have a different image, configuration + etc. Just like with regular LRS. + ++ [I2] For each of the cluster FRSC starts a store and an informer on + LRS that will listen for status updates. These status changes are + only interesting in case of troubles. Otherwise it is assumed that + LRS runs trouble free and there is always the right number of pod + created but possibly not scheduled. + + + + [E4] LRS is manually deleted from the local cluster. In this case + a new LRS should be created. It is the same case as + [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind + won’t be killed and will be adopted after the LRS is recreated. + + + [E5] LRS fails to create (not necessary schedule) the desired + number of pods due to master troubles, admission control + etc. This should be considered as the same situation as replicas + unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). + + + [E6] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216). + ++ [I3] For each of the cluster start an store and informer to monitor + whether the created pods are eventually scheduled and what is the + current number of correctly running ready pods. Errors: + + + [E7] It is impossible to tell that an informer lost connection + with a remote cluster or has other synchronization problem so it + should be handled by cluster liveness probe and deletion + [[I6]](#heading=h.z90979gc2216) + ++ [I4] It is assumed that a not scheduled pod is a normal situation +and can last up to X min if there is a huge traffic on the +cluster. However if the replicas are not scheduled in that time then +FRSC should consider moving most of the unscheduled replicas +elsewhere. For that purpose FRSC will maintain a data structure +where for each FRS controlled LRS we store a list of pods belonging +to that LRS along with their current status and status change timestamp. + ++ [I5] If a new cluster is added to the federation then it doesn’t + have a LRS and the situation is equal to + [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). + ++ [I6] If a cluster is removed from the federation then the situation + is equal to multiple [E4]. It is assumed that if a connection with + a cluster is lost completely then the cluster is removed from the + the cluster list (or marked accordingly) so + [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) + don’t need to be handled. + ++ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), + checked against the current list of clusters, and all missing LRS + are created. This will be executed in combination with [I8]. + ++ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min + (configurable) to check whether some replica move between clusters + is needed or not. + ++ FRSC never moves replicas to LRS that have not scheduled/running +pods or that has pods that failed to be created. + + + When FRSC notices that a number of pods are not scheduler/running + or not_even_created in one LRS for more than Y minutes it takes + most of them from LRS, leaving couple still waiting so that once + they are scheduled FRSC will know that it is ok to put some more + replicas to that cluster. + ++ [I9] FRS becomes ToBeChecked if: + + It is newly created + + Some replica set inside changed its status + + Some pods inside cluster changed their status + + Some cluster is added or deleted. +> FRS stops ToBeChecked if is in desired configuration (or is stable enough). + +## (RE)Scheduling algorithm + +To calculate the (re)scheduling moves for a given FRS: + +1. For each cluster FRSC calculates the number of replicas that are placed +(not necessary up and running) in the cluster and the number of replicas that +failed to be scheduled. Cluster capacity is the difference between the +the placed and failed to be scheduled. + +2. Order all clusters by their weight and hash of the name so that every time +we process the same replica-set we process the clusters in the same order. +Include federated replica set name in the cluster name hash so that we get +slightly different ordering for different RS. So that not all RS of size 1 +end up on the same cluster. + +3. Assign minimum prefered number of replicas to each of the clusters, if +there is enough replicas and capacity. + +4. If rebalance = false, assign the previously present replicas to the clusters, +remember the number of extra replicas added (ER). Of course if there +is enough replicas and capacity. + +5. Distribute the remaining replicas with regard to weights and cluster capacity. +In multiple iterations calculate how many of the replicas should end up in the cluster. +For each of the cluster cap the number of assigned replicas by max number of replicas and +cluster capacity. If there were some extra replicas added to the cluster in step +4, don't really add the replicas but balance them gains ER from 4. + +## Goroutines layout + ++ [GR1] Involved in FRS informer (see + [[I1]]). Whenever a FRS is created and + updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with + delay 0. + ++ [GR2_1...GR2_N] Involved in informers/store on LRS (see + [[I2]]). On all changes the FRS is put on + FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR3_1...GR3_N] Involved in informers/store on Pods + (see [[I3]] and [[I4]]). They maintain the status store + so that for each of the LRS we know the number of pods that are + actually running and ready in O(1) time. They also put the + corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. + ++ [GR4] Involved in cluster informer (see + [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE + with delay 0. + ++ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on + FRS_CHANNEL after the given delay (and remove from + FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to + FRS_TO_CHECK_QUEUE the delays are compared and updated so that the + shorter delay is used. + ++ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever + a FRS is received it is put to a work queue. Work queue has no delay + and makes sure that a single replica set is process is processed by + only one goroutine. + ++ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. + Multiple replica set can be processed in parallel. Two Goroutines cannot + process the same FRS at the same time. + + +## Func DoFrsCheck + +The function does [[I7]] and[[I8]]. It is assumed that it is run on a +single thread/goroutine so we check and evaluate the same FRS on many +goroutines (however if needed the function can be parallelized for +different FRS). It takes data only from store maintained by GR2_* and +GR3_*. The external communication is only required to: + ++ Create LRS. If a LRS doesn’t exist it is created after the + rescheduling, when we know how much replicas should it have. + ++ Update LRS replica targets. + +If FRS is not in the desired state then it is put to +FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). + +## Monitoring and status reporting + +FRCS should expose a number of metrics form the run, like + ++ FRSC -> LRS communication latency ++ Total times spent in various elements of DoFrsCheck + +FRSC should also expose the status of FRS as an annotation on FRS and +as events. + +## Workflow + +Here is the sequence of tasks that need to be done in order for a +typical FRS to be split into a number of LRS’s and to be created in +the underlying federated clusters. + +Note a: the reason the workflow would be helpful at this phase is that +for every one or two steps we can create PRs accordingly to start with +the development. + +Note b: we assume that the federation is already in place and the +federated clusters are added to the federation. + +Step 1. the client sends an RS create request to the +federation-apiserver + +Step 2. federation-apiserver persists an FRS into the federation etcd + +Note c: federation-apiserver populates the clusterid field in the FRS +before persisting it into the federation etcd + +Step 3: the federation-level “informer” in FRSC watches federation +etcd for new/modified FRS’s, with empty clusterid or clusterid equal +to federation ID, and if detected, it calls the scheduling code + +Step 4. + +Note d: scheduler populates the clusterid field in the LRS with the +IDs of target clusters + +Note e: at this point let us assume that it only does the even +distribution, i.e., equal weights for all of the underlying clusters + +Step 5. As soon as the scheduler function returns the control to FRSC, +the FRSC starts a number of cluster-level “informer”s, one per every +target cluster, to watch changes in every target cluster etcd +regarding the posted LRS’s and if any violation from the scheduled +number of replicase is detected the scheduling code is re-called for +re-scheduling purposes. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() + diff --git a/design/federated-services.md b/design/federated-services.md new file mode 100644 index 00000000..b9d51c43 --- /dev/null +++ b/design/federated-services.md @@ -0,0 +1,517 @@ +# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") + +## Cross-cluster Load Balancing and Service Discovery + +### Requirements and System Design + +### by Quinton Hoole, Dec 3 2015 + +## Requirements + +### Discovery, Load-balancing and Failover + +1. **Internal discovery and connection**: Pods/containers (running in + a Kubernetes cluster) must be able to easily discover and connect + to endpoints for Kubernetes services on which they depend in a + consistent way, irrespective of whether those services exist in a + different kubernetes cluster within the same cluster federation. + Hence-forth referred to as "cluster-internal clients", or simply + "internal clients". +1. **External discovery and connection**: External clients (running + outside a Kubernetes cluster) must be able to discover and connect + to endpoints for Kubernetes services on which they depend. + 1. **External clients predominantly speak HTTP(S)**: External + clients are most often, but not always, web browsers, or at + least speak HTTP(S) - notable exceptions include Enterprise + Message Busses (Java, TLS), DNS servers (UDP), + SIP servers and databases) +1. **Find the "best" endpoint:** Upon initial discovery and + connection, both internal and external clients should ideally find + "the best" endpoint if multiple eligible endpoints exist. "Best" + in this context implies the closest (by network topology) endpoint + that is both operational (as defined by some positive health check) + and not overloaded (by some published load metric). For example: + 1. An internal client should find an endpoint which is local to its + own cluster if one exists, in preference to one in a remote + cluster (if both are operational and non-overloaded). + Similarly, one in a nearby cluster (e.g. in the same zone or + region) is preferable to one further afield. + 1. An external client (e.g. in New York City) should find an + endpoint in a nearby cluster (e.g. U.S. East Coast) in + preference to one further away (e.g. Japan). +1. **Easy fail-over:** If the endpoint to which a client is connected + becomes unavailable (no network response/disconnected) or + overloaded, the client should reconnect to a better endpoint, + somehow. + 1. In the case where there exist one or more connection-terminating + load balancers between the client and the serving Pod, failover + might be completely automatic (i.e. the client's end of the + connection remains intact, and the client is completely + oblivious of the fail-over). This approach incurs network speed + and cost penalties (by traversing possibly multiple load + balancers), but requires zero smarts in clients, DNS libraries, + recursing DNS servers etc, as the IP address of the endpoint + remains constant over time. + 1. In a scenario where clients need to choose between multiple load + balancer endpoints (e.g. one per cluster), multiple DNS A + records associated with a single DNS name enable even relatively + dumb clients to try the next IP address in the list of returned + A records (without even necessarily re-issuing a DNS resolution + request). For example, all major web browsers will try all A + records in sequence until a working one is found (TBD: justify + this claim with details for Chrome, IE, Safari, Firefox). + 1. In a slightly more sophisticated scenario, upon disconnection, a + smarter client might re-issue a DNS resolution query, and + (modulo DNS record TTL's which can typically be set as low as 3 + minutes, and buggy DNS resolvers, caches and libraries which + have been known to completely ignore TTL's), receive updated A + records specifying a new set of IP addresses to which to + connect. + +### Portability + +A Kubernetes application configuration (e.g. for a Pod, Replication +Controller, Service etc) should be able to be successfully deployed +into any Kubernetes Cluster or Federation of Clusters, +without modification. More specifically, a typical configuration +should work correctly (although possibly not optimally) across any of +the following environments: + +1. A single Kubernetes Cluster on one cloud provider (e.g. Google + Compute Engine, GCE). +1. A single Kubernetes Cluster on a different cloud provider + (e.g. Amazon Web Services, AWS). +1. A single Kubernetes Cluster on a non-cloud, on-premise data center +1. A Federation of Kubernetes Clusters all on the same cloud provider + (e.g. GCE). +1. A Federation of Kubernetes Clusters across multiple different cloud + providers and/or on-premise data centers (e.g. one cluster on + GCE/GKE, one on AWS, and one on-premise). + +### Trading Portability for Optimization + +It should be possible to explicitly opt out of portability across some +subset of the above environments in order to take advantage of +non-portable load balancing and DNS features of one or more +environments. More specifically, for example: + +1. For HTTP(S) applications running on GCE-only Federations, + [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + should be usable. These provide single, static global IP addresses + which load balance and fail over globally (i.e. across both regions + and zones). These allow for really dumb clients, but they only + work on GCE, and only for HTTP(S) traffic. +1. For non-HTTP(S) applications running on GCE-only Federations within + a single region, + [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + should be usable. These provide TCP (i.e. both HTTP/S and + non-HTTP/S) load balancing and failover, but only on GCE, and only + within a single region. + [Google Cloud DNS](https://cloud.google.com/dns) can be used to + route traffic between regions (and between different cloud + providers and on-premise clusters, as it's plain DNS, IP only). +1. For applications running on AWS-only Federations, + [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + should be usable. These provide both L7 (HTTP(S)) and L4 load + balancing, but only within a single region, and only on AWS + ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be + used to load balance and fail over across multiple regions, and is + also capable of resolving to non-AWS endpoints). + +## Component Cloud Services + +Cross-cluster Federated load balancing is built on top of the following: + +1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) + provide single, static global IP addresses which load balance and + fail over globally (i.e. across both regions and zones). These + allow for really dumb clients, but they only work on GCE, and only + for HTTP(S) traffic. +1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) + provide both HTTP(S) and non-HTTP(S) load balancing and failover, + but only on GCE, and only within a single region. +1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) + provide both L7 (HTTP(S)) and L4 load balancing, but only within a + single region, and only on AWS. +1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other + programmable DNS service, like + [CloudFlare](http://www.cloudflare.com) can be used to route + traffic between regions (and between different cloud providers and + on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS + doesn't provide any built-in geo-DNS, latency-based routing, health + checking, weighted round robin or other advanced capabilities. + It's plain old DNS. We would need to build all the aforementioned + on top of it. It can provide internal DNS services (i.e. serve RFC + 1918 addresses). + 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can + be used to load balance and fail over across regions, and is also + capable of routing to non-AWS endpoints). It provides built-in + geo-DNS, latency-based routing, health checking, weighted + round robin and optional tight integration with some other + AWS services (e.g. Elastic Load Balancers). +1. Kubernetes L4 Service Load Balancing: This provides both a + [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) + and a + [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) + service IP which is load-balanced (currently simple round-robin) + across the healthy pods comprising a service within a single + Kubernetes cluster. +1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): +A generic wrapper around cloud-provided L4 and L7 load balancing services, and +roll-your-own load balancers run in pods, e.g. HA Proxy. + +## Cluster Federation API + +The Cluster Federation API for load balancing should be compatible with the equivalent +Kubernetes API, to ease porting of clients between Kubernetes and +federations of Kubernetes clusters. +Further details below. + +## Common Client Behavior + +To be useful, our load balancing solution needs to work properly with real +client applications. There are a few different classes of those... + +### Browsers + +These are the most common external clients. These are all well-written. See below. + +### Well-written clients + +1. Do a DNS resolution every time they connect. +1. Don't cache beyond TTL (although a small percentage of the DNS + servers on which they rely might). +1. Do try multiple A records (in order) to connect. +1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. + +Examples: + ++ all common browsers (except for SRV records) ++ ... + +### Dumb clients + +1. Don't do a DNS resolution every time they connect (or do cache beyond the +TTL). +1. Do try multiple A records + +Examples: + ++ ... + +### Dumber clients + +1. Only do a DNS lookup once on startup. +1. Only try the first returned DNS A record. + +Examples: + ++ ... + +### Dumbest clients + +1. Never do a DNS lookup - are pre-configured with a single (or possibly +multiple) fixed server IP(s). Nothing else matters. + +## Architecture and Implementation + +### General Control Plane Architecture + +Each cluster hosts one or more Cluster Federation master components (Federation API +servers, controller managers with leader election, and etcd quorum members. This +is documented in more detail in a separate design doc: +[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). + +In the description below, assume that 'n' clusters, named 'cluster-1'... +'cluster-n' have been registered against a Cluster Federation "federation-1", +each with their own set of Kubernetes API endpoints,so, +"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), +[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) +... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . + +### Federated Services + +Federated Services are pretty straight-forward. They're comprised of multiple +equivalent underlying Kubernetes Services, each with their own external +endpoint, and a load balancing mechanism across them. Let's work through how +exactly that works in practice. + +Our user creates the following Federated Service (against a Federation +API endpoint): + + $ kubectl create -f my-service.yaml --context="federation-1" + +where service.yaml contains the following: + + kind: Service + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + ports: + - port: 2379 + protocol: TCP + targetPort: 2379 + name: client + - port: 2380 + protocol: TCP + targetPort: 2380 + name: peer + selector: + run: my-service + type: LoadBalancer + +The Cluster Federation control system in turn creates one equivalent service (identical config to the above) +in each of the underlying Kubernetes clusters, each of which results in +something like this: + + $ kubectl get -o yaml --context="cluster-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:25Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "147365" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00002 + spec: + clusterIP: 10.0.153.185 + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - ip: 104.197.117.10 + +Similar services are created in `cluster-2` and `cluster-3`, each of which are +allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. + +In the Cluster Federation `federation-1`, the resulting federated service looks as follows: + + $ kubectl get -o yaml --context="federation-1" service my-service + + apiVersion: v1 + kind: Service + metadata: + creationTimestamp: 2015-11-25T23:35:23Z + labels: + run: my-service + name: my-service + namespace: my-namespace + resourceVersion: "157333" + selfLink: /api/v1/namespaces/my-namespace/services/my-service + uid: 33bfc927-93cd-11e5-a38c-42010af00007 + spec: + clusterIP: + ports: + - name: client + nodePort: 31333 + port: 2379 + protocol: TCP + targetPort: 2379 + - name: peer + nodePort: 31086 + port: 2380 + protocol: TCP + targetPort: 2380 + selector: + run: my-service + sessionAffinity: None + type: LoadBalancer + status: + loadBalancer: + ingress: + - hostname: my-service.my-namespace.my-federation.my-domain.com + +Note that the federated service: + +1. Is API-compatible with a vanilla Kubernetes service. +1. has no clusterIP (as it is cluster-independent) +1. has a federation-wide load balancer hostname + +In addition to the set of underlying Kubernetes services (one per cluster) +described above, the Cluster Federation control system has also created a DNS name (e.g. on +[Google Cloud DNS](https://cloud.google.com/dns) or +[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) +which provides load balancing across all of those services. For example, in a +very basic configuration: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +Each of the above IP addresses (which are just the external load balancer +ingress IP's of each cluster service) is of course load balanced across the pods +comprising the service in each cluster. + +In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster +Federation control system +automatically creates a +[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) +which exposes a single, globally load-balanced IP: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 + +Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) +in each Kubernetes cluster to preferentially return the local +clusterIP for the service in that cluster, with other clusters' +external service IP's (or a global load-balanced IP) also configured +for failover purposes: + + $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com + my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 + my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 + +If Cluster Federation Global Service Health Checking is enabled, multiple service health +checkers running across the federated clusters collaborate to monitor the health +of the service endpoints, and automatically remove unhealthy endpoints from the +DNS record (e.g. a majority quorum is required to vote a service endpoint +unhealthy, to avoid false positives due to individual health checker network +isolation). + +### Federated Replication Controllers + +So far we have a federated service defined, with a resolvable load balancer +hostname by which clients can reach it, but no pods serving traffic directed +there. So now we need a Federated Replication Controller. These are also fairly +straight-forward, being comprised of multiple underlying Kubernetes Replication +Controllers which do the hard work of keeping the desired number of Pod replicas +alive in each Kubernetes cluster. + + $ kubectl create -f my-service-rc.yaml --context="federation-1" + +where `my-service-rc.yaml` contains the following: + + kind: ReplicationController + metadata: + labels: + run: my-service + name: my-service + namespace: my-namespace + spec: + replicas: 6 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + +The Cluster Federation control system in turn creates one equivalent replication controller +(identical config to the above, except for the replica count) in each +of the underlying Kubernetes clusters, each of which results in +something like this: + + $ ./kubectl get -o yaml rc my-service --context="cluster-1" + kind: ReplicationController + metadata: + creationTimestamp: 2015-12-02T23:00:47Z + labels: + run: my-service + name: my-service + namespace: my-namespace + selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service + uid: 86542109-9948-11e5-a38c-42010af00002 + spec: + replicas: 2 + selector: + run: my-service + template: + metadata: + labels: + run: my-service + spec: + containers: + image: gcr.io/google_samples/my-service:v1 + name: my-service + ports: + - containerPort: 2379 + protocol: TCP + - containerPort: 2380 + protocol: TCP + resources: {} + dnsPolicy: ClusterFirst + restartPolicy: Always + status: + replicas: 2 + +The exact number of replicas created in each underlying cluster will of course +depend on what scheduling policy is in force. In the above example, the +scheduler created an equal number of replicas (2) in each of the three +underlying clusters, to make up the total of 6 replicas required. To handle +entire cluster failures, various approaches are possible, including: +1. **simple overprovisioning**, such that sufficient replicas remain even if a + cluster fails. This wastes some resources, but is simple and reliable. +2. **pod autoscaling**, where the replication controller in each + cluster automatically and autonomously increases the number of + replicas in its cluster in response to the additional traffic + diverted from the failed cluster. This saves resources and is relatively + simple, but there is some delay in the autoscaling. +3. **federated replica migration**, where the Cluster Federation + control system detects the cluster failure and automatically + increases the replica count in the remainaing clusters to make up + for the lost replicas in the failed cluster. This does not seem to + offer any benefits relative to pod autoscaling above, and is + arguably more complex to implement, but we note it here as a + possibility. + +### Implementation Details + +The implementation approach and architecture is very similar to Kubernetes, so +if you're familiar with how Kubernetes works, none of what follows will be +surprising. One additional design driver not present in Kubernetes is that +the Cluster Federation control system aims to be resilient to individual cluster and availability zone +failures. So the control plane spans multiple clusters. More specifically: + ++ Cluster Federation runs it's own distinct set of API servers (typically one + or more per underlying Kubernetes cluster). These are completely + distinct from the Kubernetes API servers for each of the underlying + clusters. ++ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, + by default). Approximately 1 quorum member runs in each underlying + cluster ("approximately" because we aim for an odd number of quorum + members, and typically don't want more than 5 quorum members, even + if we have a larger number of federated clusters, so 2 clusters->3 + quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). + +Cluster Controllers in the Federation control system watch against the +Federation API server/etcd +state, and apply changes to the underlying kubernetes clusters accordingly. They +also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" +state against kubernetes "actual desired" state. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() + diff --git a/design/federation-phase-1.md b/design/federation-phase-1.md new file mode 100644 index 00000000..0a3a8f50 --- /dev/null +++ b/design/federation-phase-1.md @@ -0,0 +1,407 @@ +# Ubernetes Design Spec (phase one) + +**Huawei PaaS Team** + +## INTRODUCTION + +In this document we propose a design for the “Control Plane” of +Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of +this work please refer to +[this proposal](../../docs/proposals/federation.md). +The document is arranged as following. First we briefly list scenarios +and use cases that motivate K8S federation work. These use cases drive +the design and they also verify the design. We summarize the +functionality requirements from these use cases, and define the “in +scope” functionalities that will be covered by this design (phase +one). After that we give an overview of the proposed architecture, API +and building blocks. And also we go through several activity flows to +see how these building blocks work together to support use cases. + +## REQUIREMENTS + +There are many reasons why customers may want to build a K8S +federation: + ++ **High Availability:** Customers want to be immune to the outage of + a single availability zone, region or even a cloud provider. ++ **Sensitive workloads:** Some workloads can only run on a particular + cluster. They cannot be scheduled to or migrated to other clusters. ++ **Capacity overflow:** Customers prefer to run workloads on a + primary cluster. But if the capacity of the cluster is not + sufficient, workloads should be automatically distributed to other + clusters. ++ **Vendor lock-in avoidance:** Customers want to spread their + workloads on different cloud providers, and can easily increase or + decrease the workload proportion of a specific provider. ++ **Cluster Size Enhancement:** Currently K8S cluster can only support +a limited size. While the community is actively improving it, it can +be expected that cluster size will be a problem if K8S is used for +large workloads or public PaaS infrastructure. While we can separate +different tenants to different clusters, it would be good to have a +unified view. + +Here are the functionality requirements derived from above use cases: + ++ Clients of the federation control plane API server can register and deregister +clusters. ++ Workloads should be spread to different clusters according to the + workload distribution policy. ++ Pods are able to discover and connect to services hosted in other + clusters (in cases where inter-cluster networking is necessary, + desirable and implemented). ++ Traffic to these pods should be spread across clusters (in a manner + similar to load balancing, although it might not be strictly + speaking balanced). ++ The control plane needs to know when a cluster is down, and migrate + the workloads to other clusters. ++ Clients have a unified view and a central control point for above + activities. + +## SCOPE + +It’s difficult to have a perfect design with one click that implements +all the above requirements. Therefore we will go with an iterative +approach to design and build the system. This document describes the +phase one of the whole work. In phase one we will cover only the +following objectives: + ++ Define the basic building blocks and API objects of control plane ++ Implement a basic end-to-end workflow + + Clients register federated clusters + + Clients submit a workload + + The workload is distributed to different clusters + + Service discovery + + Load balancing + +The following parts are NOT covered in phase one: + ++ Authentication and authorization (other than basic client + authentication against the ubernetes API, and from ubernetes control + plane to the underlying kubernetes clusters). ++ Deployment units other than replication controller and service ++ Complex distribution policy of workloads ++ Service affinity and migration + +## ARCHITECTURE + +The overall architecture of a control plane is shown as following: + +![Ubernetes Architecture](ubernetes-design.png) + +Some design principles we are following in this architecture: + +1. Keep the underlying K8S clusters independent. They should have no + knowledge of control plane or of each other. +1. Keep the Ubernetes API interface compatible with K8S API as much as + possible. +1. Re-use concepts from K8S as much as possible. This reduces +customers’ learning curve and is good for adoption. Below is a brief +description of each module contained in above diagram. + +## Ubernetes API Server + +The API Server in the Ubernetes control plane works just like the API +Server in K8S. It talks to a distributed key-value store to persist, +retrieve and watch API objects. This store is completely distinct +from the kubernetes key-value stores (etcd) in the underlying +kubernetes clusters. We still use `etcd` as the distributed +storage so customers don’t need to learn and manage a different +storage system, although it is envisaged that other storage systems +(consol, zookeeper) will probably be developedand supported over +time. + +## Ubernetes Scheduler + +The Ubernetes Scheduler schedules resources onto the underlying +Kubernetes clusters. For example it watches for unscheduled Ubernetes +replication controllers (those that have not yet been scheduled onto +underlying Kubernetes clusters) and performs the global scheduling +work. For each unscheduled replication controller, it calls policy +engine to decide how to spit workloads among clusters. It creates a +Kubernetes Replication Controller on one ore more underlying cluster, +and post them back to `etcd` storage. + +One sublety worth noting here is that the scheduling decision is arrived at by +combining the application-specific request from the user (which might +include, for example, placement constraints), and the global policy specified +by the federation administrator (for example, "prefer on-premise +clusters over AWS clusters" or "spread load equally across clusters"). + +## Ubernetes Cluster Controller + +The cluster controller +performs the following two kinds of work: + +1. It watches all the sub-resources that are created by Ubernetes + components, like a sub-RC or a sub-service. And then it creates the + corresponding API objects on the underlying K8S clusters. +1. It periodically retrieves the available resources metrics from the + underlying K8S cluster, and updates them as object status of the + `cluster` API object. An alternative design might be to run a pod + in each underlying cluster that reports metrics for that cluster to + the Ubernetes control plane. Which approach is better remains an + open topic of discussion. + +## Ubernetes Service Controller + +The Ubernetes service controller is a federation-level implementation +of K8S service controller. It watches service resources created on +control plane, creates corresponding K8S services on each involved K8S +clusters. Besides interacting with services resources on each +individual K8S clusters, the Ubernetes service controller also +performs some global DNS registration work. + +## API OBJECTS + +## Cluster + +Cluster is a new first-class API object introduced in this design. For +each registered K8S cluster there will be such an API resource in +control plane. The way clients register or deregister a cluster is to +send corresponding REST requests to following URL: +`/api/{$version}/clusters`. Because control plane is behaving like a +regular K8S client to the underlying clusters, the spec of a cluster +object contains necessary properties like K8S cluster address and +credentials. The status of a cluster API object will contain +following information: + +1. Which phase of its lifecycle +1. Cluster resource metrics for scheduling decisions. +1. Other metadata like the version of cluster + +$version.clusterSpec + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Address
+
address of the cluster
+
yes
+
address
+

Credential
+
the type (e.g. bearer token, client +certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)
+
yes
+
string
+

+ +$version.clusterStatus + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Name
+
Description
+
Required
+
Schema
+
Default
+
Phase
+
the recently observed lifecycle phase of the cluster
+
yes
+
enum
+

Capacity
+
represents the available resources of a cluster
+
yes
+
any
+

ClusterMeta
+
Other cluster metadata like the version
+
yes
+
ClusterMeta
+

+ +**For simplicity we didn’t introduce a separate “cluster metrics” API +object here**. The cluster resource metrics are stored in cluster +status section, just like what we did to nodes in K8S. In phase one it +only contains available CPU resources and memory resources. The +cluster controller will periodically poll the underlying cluster API +Server to get cluster capability. In phase one it gets the metrics by +simply aggregating metrics from all nodes. In future we will improve +this with more efficient ways like leveraging heapster, and also more +metrics will be supported. Similar to node phases in K8S, the “phase” +field includes following values: + ++ pending: newly registered clusters or clusters suspended by admin + for various reasons. They are not eligible for accepting workloads ++ running: clusters in normal status that can accept workloads ++ offline: clusters temporarily down or not reachable ++ terminated: clusters removed from federation + +Below is the state transition diagram. + +![Cluster State Transition Diagram](ubernetes-cluster-state.png) + +## Replication Controller + +A global workload submitted to control plane is represented as a + replication controller in the Cluster Federation control plane. When a replication controller +is submitted to control plane, clients need a way to express its +requirements or preferences on clusters. Depending on different use +cases it may be complex. For example: + ++ This workload can only be scheduled to cluster Foo. It cannot be + scheduled to any other clusters. (use case: sensitive workloads). ++ This workload prefers cluster Foo. But if there is no available + capacity on cluster Foo, it’s OK to be scheduled to cluster Bar + (use case: workload ) ++ Seventy percent of this workload should be scheduled to cluster Foo, + and thirty percent should be scheduled to cluster Bar (use case: + vendor lock-in avoidance). In phase one, we only introduce a + _clusterSelector_ field to filter acceptable clusters. In default + case there is no such selector and it means any cluster is + acceptable. + +Below is a sample of the YAML to create such a replication controller. + +``` +apiVersion: v1 +kind: ReplicationController +metadata: + name: nginx-controller +spec: + replicas: 5 + selector: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: nginx + ports: + - containerPort: 80 + clusterSelector: + name in (Foo, Bar) +``` + +Currently clusterSelector (implemented as a +[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) +only supports a simple list of acceptable clusters. Workloads will be +evenly distributed on these acceptable clusters in phase one. After +phase one we will define syntax to represent more advanced +constraints, like cluster preference ordering, desired number of +splitted workloads, desired ratio of workloads spread on different +clusters, etc. + +Besides this explicit “clusterSelector” filter, a workload may have +some implicit scheduling restrictions. For example it defines +“nodeSelector” which can only be satisfied on some particular +clusters. How to handle this will be addressed after phase one. + +## Federated Services + +The Service API object exposed by the Cluster Federation is similar to service +objects on Kubernetes. It defines the access to a group of pods. The +federation service controller will create corresponding Kubernetes +service objects on underlying clusters. These are detailed in a +separate design document: [Federated Services](federated-services.md). + +## Pod + +In phase one we only support scheduling replication controllers. Pod +scheduling will be supported in later phase. This is primarily in +order to keep the Cluster Federation API compatible with the Kubernetes API. + +## ACTIVITY FLOWS + +## Scheduling + +The below diagram shows how workloads are scheduled on the Cluster Federation control\ +plane: + +1. A replication controller is created by the client. +1. APIServer persists it into the storage. +1. Cluster controller periodically polls the latest available resource + metrics from the underlying clusters. +1. Scheduler is watching all pending RCs. It picks up the RC, make + policy-driven decisions and split it into different sub RCs. +1. Each cluster control is watching the sub RCs bound to its + corresponding cluster. It picks up the newly created sub RC. +1. The cluster controller issues requests to the underlying cluster +API Server to create the RC. In phase one we don’t support complex +distribution policies. The scheduling rule is basically: + 1. If a RC does not specify any nodeSelector, it will be scheduled + to the least loaded K8S cluster(s) that has enough available + resources. + 1. If a RC specifies _N_ acceptable clusters in the + clusterSelector, all replica will be evenly distributed among + these clusters. + +There is a potential race condition here. Say at time _T1_ the control +plane learns there are _m_ available resources in a K8S cluster. As +the cluster is working independently it still accepts workload +requests from other K8S clients or even another Cluster Federation control +plane. The Cluster Federation scheduling decision is based on this data of +available resources. However when the actual RC creation happens to +the cluster at time _T2_, the cluster may don’t have enough resources +at that time. We will address this problem in later phases with some +proposed solutions like resource reservation mechanisms. + +![Federated Scheduling](ubernetes-scheduling.png) + +## Service Discovery + +This part has been included in the section “Federated Service” of +document +“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. +Please refer to that document for details. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() + diff --git a/design/ha_master.md b/design/ha_master.md new file mode 100644 index 00000000..d4cf26a9 --- /dev/null +++ b/design/ha_master.md @@ -0,0 +1,236 @@ +# Automated HA master deployment + +**Author:** filipg@, jsz@ + +# Introduction + +We want to allow users to easily replicate kubernetes masters to have highly available cluster, +initially using `kube-up.sh` and `kube-down.sh`. + +This document describes technical design of this feature. It assumes that we are using aforementioned +scripts for cluster deployment. All of the ideas described in the following sections should be easy +to implement on GCE, AWS and other cloud providers. + +It is a non-goal to design a specific setup for bare-metal environment, which +might be very different. + +# Overview + +In a cluster with replicated master, we will have N VMs, each running regular master components +such as apiserver, etcd, scheduler or controller manager. These components will interact in the +following way: +* All etcd replicas will be clustered together and will be using master election + and quorum mechanism to agree on the state. All of these mechanisms are integral + parts of etcd and we will only have to configure them properly. +* All apiserver replicas will be working independently talking to an etcd on + 127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master + (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)). +* We will introduce provider specific solutions to load balance traffic between master replicas + (see section `load balancing`) +* Controller manager, scheduler & cluster autoscaler will use lease mechanism and + only a single instance will be an active master. All other will be waiting in a standby mode. +* All add-on managers will work independently and each of them will try to keep add-ons in sync + +# Detailed design + +## Components + +### etcd + +``` +Note: This design for etcd clustering is quite pet-set like - each etcd +replica has its name which is explicitly used in etcd configuration etc. In +medium-term future we would like to have the ability to run masters as part of +autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas +automatically. This is pretty tricky and this design does not cover this. +It will be covered in a separate doc. +``` + +All etcd instances will be clustered together and one of them will be an elected master. +In order to commit any change quorum of the cluster will have to confirm it. Etcd will be +configured in such a way that all writes and reads will go through the master (requests +will be forwarded by the local etcd server such that it’s invisible for the user). It will +affect latency for all operations, but it should not increase by much more than the network +latency between master replicas (latency between GCE zones with a region is < 10ms). + +Currently etcd exposes port only using localhost interface. In order to allow clustering +and inter-VM communication we will also have to use public interface. To secure the +communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)). + +When generating command line for etcd we will always assume it’s part of a cluster +(initially of size 1) and list all existing kubernetes master replicas. +Based on that, we will set the following flags: +* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one) +* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one): + * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty + * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty. + +This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs +with master replicas will be generated in `kube-up.sh` script and passed to as a env variable +`INITIAL_ETCD_CLUSTER`. + +### apiservers + +All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact +etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the +etcd leader. This functionality is completely hidden from the client (apiserver +in our case). + +Caching mechanism, which is implemented in apiserver, will not be affected by +replicating master because: +* GET requests go directly to etcd +* LIST requests go either directly to etcd or to cache populated via watch + (depending on the ResourceVersion in ListOptions). In the second scenario, + after a PUT/POST request, changes might not be visible in LIST response. + This is however not worse than it is with the current single master. +* WATCH does not give any guarantees when change will be delivered. + +#### load balancing + +With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud +providers have different capabilities and limitations, we will not try to find a common lowest +denominator that will work everywhere. Instead we will document various options and apply different +solution for different deployments. Below we list possible approaches: + +1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed +automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS) +or Google Cloud DNS (GCP). For load balancing we will have two options: + 1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately + 1.2. use round-robin DNS technique to access all apiservers directly +2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries +will be manually managed by the user. We will provide detailed documentation for the entries we +expect. +3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static +external IP address that is later assigned to the master VM. When creating additional replicas we +will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer +instead of a single master. When removing second to last replica we will reverse this operation (assign +IP address to the remaining master VM and delete load balancer). That way user will not have to provide +a domain name and all client configurations will keep working. + +This will also impact `kubelet <-> master` communication as it should use load +balancing for it. Depending on the chosen method we will use it to properly configure +kubelet. + +#### `kubernetes` service + +Kubernetes maintains a special service called `kubernetes`. Currently it keeps a +list of IP addresses for all apiservers. As it uses a command line flag +`--apiserver-count` it is not very dynamic and would require restarting all +masters to change number of master replicas. + +To allow dynamic changes to the number of apiservers in the cluster, we will +introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration +time for each apiserver (keyed by IP). Each apiserver will do three things: + +1. periodically update expiration time for it's own IP address +2. remove all the stale IP addresses from the endpoints list +3. add it's own IP address if it's not on the list yet. + +That way we will not only solve the problem of dynamically changing number +of apiservers in the cluster, but also the problem of non-responsive apiservers +that should be removed from the `kubernetes` service endpoints list. + +#### Certificates + +Certificate generation will work as today. In particular, on GCE, we will +generate it for the public IP used to access the cluster (see `load balancing` +section) and local IP of the master replica VM. + +That means that with multiple master replicas and a load balancer in front +of them, accessing one of the replicas directly (using it's ephemeral public +IP) will not work on GCE without appropriate flags: + +- `kubectl --insecure-skip-tls-verify=true` +- `curl --insecure` +- `wget --no-check-certificate` + +For other deployment tools and providers the details of certificate generation +may be different, but it must be possible to access the cluster by using either +the main cluster endpoint (DNS name or IP address) or internal service called +`kubernetes` that points directly to the apiservers. + +### controller manager, scheduler & cluster autoscaler + +Controller manager and scheduler will by default use a lease mechanism to choose an active instance +among all masters. Only one instance will be performing any operations. +All other will be waiting in standby mode. + +We will use the same configuration in non-replicated mode to simplify deployment scripts. + +### add-on manager + +All add-on managers will be working independently. Each of them will observe current state of +add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on +can be updated multiple times in a row after upgrading the master. Long-term we should fix this +by using a similar mechanisms as controller manager or scheduler. However, currently add-on +manager is just a bash script and adding a master election mechanism would not be easy. + +## Adding replica + +Command to add new replica on GCE using kube-up script: + +``` +KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh +``` + +A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following: + +``` +1. If there is no load balancer for this cluster: + 1. Create load balancer using ephemeral IP address + 2. Add existing apiserver to the load balancer + 3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) + 4. Update DNS to point to the load balancer. +2. Clone existing master (create a new VM with the same configuration) including + all env variables (certificates, IP ranges etc), with the exception of + `INITIAL_ETCD_CLUSTER`. +3. SSH to an existing master and run the following command to extend etcd cluster + with the new instance: + `curl :4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://:2380"]}'` +4. Add IP address of the new apiserver to the load balancer. +``` + +A simplified algorithm for adding a new master replica and promoting master IP to the load balancer +is identical to the one when using DNS, with a different step to setup load balancer: + +``` +1. If there is no load balancer for this cluster: + 1. Unassign IP from the existing master replica + 2. Create load balancer using static IP reclaimed in the previous step + 3. Add existing apiserver to the load balancer + 4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) +... +``` + +## Deleting replica + +Command to delete one replica on GCE using kube-up script: + +``` +KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh +``` + +A pseudo-code for deleting an existing replica for the master is the following: + +``` +1. Remove replica IP address from the load balancer or DNS configuration +2. SSH to one of the remaining masters and run the following command to remove replica from the cluster: + `curl etcd-0:4001/v2/members/ -XDELETE -L` +3. Delete replica VM +4. If load balancer has only a single target instance, then delete load balancer +5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM. +``` + +## Upgrades + +Upgrading replicated master will be possible by upgrading them one by one using existing tools +(e.g. upgrade.sh for GCE). This will work out of the box because: +* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible. +* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components +will be in the same version. +* Apiserver talks only to a local etcd replica which will be in a compatible version +* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]() + diff --git a/design/horizontal-pod-autoscaler.md b/design/horizontal-pod-autoscaler.md new file mode 100644 index 00000000..1ac9c24b --- /dev/null +++ b/design/horizontal-pod-autoscaler.md @@ -0,0 +1,263 @@ +

Warning! This document might be outdated.

+ +# Horizontal Pod Autoscaling + +## Preface + +This document briefly describes the design of the horizontal autoscaler for +pods. The autoscaler (implemented as a Kubernetes API resource and controller) +is responsible for dynamically controlling the number of replicas of some +collection (e.g. the pods of a ReplicationController) to meet some objective(s), +for example a target per-pod CPU utilization. + +This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). + +## Overview + +The resource usage of a serving application usually varies over time: sometimes +the demand for the application rises, and sometimes it drops. In Kubernetes +version 1.0, a user can only manually set the number of serving pods. Our aim is +to provide a mechanism for the automatic adjustment of the number of pods based +on CPU utilization statistics (a future version will allow autoscaling based on +other resources/metrics). + +## Scale Subresource + +In Kubernetes version 1.1, we are introducing Scale subresource and implementing +horizontal autoscaling of pods based on it. Scale subresource is supported for +replication controllers and deployments. Scale subresource is a Virtual Resource +(does not correspond to an object stored in etcd). It is only present in the API +as an interface that a controller (in this case the HorizontalPodAutoscaler) can +use to dynamically scale the number of replicas controlled by some other API +object (currently ReplicationController and Deployment) and to learn the current +number of replicas. Scale is a subresource of the API object that it serves as +the interface for. The Scale subresource is useful because whenever we introduce +another type we want to autoscale, we just need to implement the Scale +subresource for it. The wider discussion regarding Scale took place in issue +[#1629](https://github.com/kubernetes/kubernetes/issues/1629). + +Scale subresource is in API for replication controller or deployment under the +following paths: + +`apis/extensions/v1beta1/replicationcontrollers/myrc/scale` + +`apis/extensions/v1beta1/deployments/mydeployment/scale` + +It has the following structure: + +```go +// represents a scaling request for a resource. +type Scale struct { + unversioned.TypeMeta + api.ObjectMeta + + // defines the behavior of the scale. + Spec ScaleSpec + + // current status of the scale. + Status ScaleStatus +} + +// describes the attributes of a scale subresource +type ScaleSpec struct { + // desired number of instances for the scaled object. + Replicas int `json:"replicas,omitempty"` +} + +// represents the current status of a scale subresource. +type ScaleStatus struct { + // actual number of observed instances of the scaled object. + Replicas int `json:"replicas"` + + // label query over pods that should match the replicas count. + Selector map[string]string `json:"selector,omitempty"` +} +``` + +Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment +associated with the given Scale subresource. `ScaleStatus.Replicas` reports how +many pods are currently running in the replication controller/deployment, and +`ScaleStatus.Selector` returns selector for the pods. + +## HorizontalPodAutoscaler Object + +In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It +is accessible under: + +`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler` + +It has the following structure: + +```go +// configuration of a horizontal pod autoscaler. +type HorizontalPodAutoscaler struct { + unversioned.TypeMeta + api.ObjectMeta + + // behavior of autoscaler. + Spec HorizontalPodAutoscalerSpec + + // current information about the autoscaler. + Status HorizontalPodAutoscalerStatus +} + +// specification of a horizontal pod autoscaler. +type HorizontalPodAutoscalerSpec struct { + // reference to Scale subresource; horizontal pod autoscaler will learn the current resource + // consumption from its status,and will set the desired number of pods by modifying its spec. + ScaleRef SubresourceReference + // lower limit for the number of pods that can be set by the autoscaler, default 1. + MinReplicas *int + // upper limit for the number of pods that can be set by the autoscaler. + // It cannot be smaller than MinReplicas. + MaxReplicas int + // target average CPU utilization (represented as a percentage of requested CPU) over all the pods; + // if not specified it defaults to the target CPU utilization at 80% of the requested resources. + CPUUtilization *CPUTargetUtilization +} + +type CPUTargetUtilization struct { + // fraction of the requested CPU that should be utilized/used, + // e.g. 70 means that 70% of the requested CPU should be in use. + TargetPercentage int +} + +// current status of a horizontal pod autoscaler +type HorizontalPodAutoscalerStatus struct { + // most recent generation observed by this autoscaler. + ObservedGeneration *int64 + + // last time the HorizontalPodAutoscaler scaled the number of pods; + // used by the autoscaler to control how often the number of pods is changed. + LastScaleTime *unversioned.Time + + // current number of replicas of pods managed by this autoscaler. + CurrentReplicas int + + // desired number of replicas of pods managed by this autoscaler. + DesiredReplicas int + + // current average CPU utilization over all pods, represented as a percentage of requested CPU, + // e.g. 70 means that an average pod is using now 70% of its requested CPU. + CurrentCPUUtilizationPercentage *int +} +``` + +`ScaleRef` is a reference to the Scale subresource. +`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler +configuration. We are also introducing HorizontalPodAutoscalerList object to +enable listing all autoscalers in a namespace: + +```go +// list of horizontal pod autoscaler objects. +type HorizontalPodAutoscalerList struct { + unversioned.TypeMeta + unversioned.ListMeta + + // list of horizontal pod autoscaler objects. + Items []HorizontalPodAutoscaler +} +``` + +## Autoscaling Algorithm + +The autoscaler is implemented as a control loop. It periodically queries pods +described by `Status.PodSelector` of Scale subresource, and collects their CPU +utilization. Then, it compares the arithmetic mean of the pods' CPU utilization +with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of +the Scale if needed to match the target (preserving condition: MinReplicas <= +Replicas <= MaxReplicas). + +The period of the autoscaler is controlled by the +`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The +default value is 30 seconds. + + +CPU utilization is the recent CPU usage of a pod (average across the last 1 +minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU +usage is taken directly from Heapster. In future, there will be API on master +for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). + +The target number of pods is calculated from the following formula: + +``` +TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target) +``` + +Starting and stopping pods may introduce noise to the metric (for instance, +starting may temporarily increase CPU). So, after each action, the autoscaler +should wait some time for reliable data. Scale-up can only happen if there was +no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from +the last rescaling. Moreover any scaling will only be made if: +`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 +(10% tolerance). Such approach has two benefits: + +* Autoscaler works in a conservative way. If new user load appears, it is +important for us to rapidly increase the number of pods, so that user requests +will not be rejected. Lowering the number of pods is not that urgent. + +* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting +decision if the load is not stable. + +## Relative vs. absolute metrics + +We chose values of the target metric to be relative (e.g. 90% of requested CPU +resource) rather than absolute (e.g. 0.6 core) for the following reason. If we +choose absolute metric, user will need to guarantee that the target is lower +than the request. Otherwise, overloaded pods may not be able to consume more +than the autoscaler's absolute target utilization, thereby preventing the +autoscaler from seeing high enough utilization to trigger it to scale up. This +may be especially troublesome when user changes requested resources for a pod +because they would need to also change the autoscaler utilization threshold. +Therefore, we decided to choose relative metric. For user, it is enough to set +it to a value smaller than 100%, and further changes of requested resources will +not invalidate it. + +## Support in kubectl + +To make manipulation of HorizontalPodAutoscaler object simpler, we added support +for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In +addition, in future, we are planning to add kubectl support for the following +use-cases: +* When creating a replication controller or deployment with +`kubectl create [-f]`, there should be a possibility to specify an additional +autoscaler object. (This should work out-of-the-box when creation of autoscaler +is supported by kubectl as we may include multiple objects in the same config +file). +* *[future]* When running an image with `kubectl run`, there should be an +additional option to create an autoscaler for it. +* *[future]* We will add a new command `kubectl autoscale` that will allow for +easy creation of an autoscaler object for already existing replication +controller/deployment. + +## Next steps + +We list here some features that are not supported in Kubernetes version 1.1. +However, we want to keep them in mind, as they will most probably be needed in +the future. +Our design is in general compatible with them. +* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. +memory, network traffic, qps). This includes scaling based on a custom/application metric. +* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler, +instead of computing average for a target metric across pods, will use a single, +external, metric (e.g. qps metric from load balancer). The metric will be +aggregated while the target will remain per-pod (e.g. when observing 100 qps on +load balancer while the target is 20 qps per pod, autoscaler will set the number +of replicas to 5). +* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers +of pods for different metrics are different, choose the largest target number of +pods. +* *[future]* **Scale the number of pods starting from 0.** All pods can be +turned-off, and then turned-on when there is a demand for them. When a request +to service with no pods arrives, kube-proxy will generate an event for +autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247). +* *[future]* **When scaling down, make more educated decision which pods to +kill.** E.g.: if two or more pods from the same replication controller are on +the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301). + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() + diff --git a/design/identifiers.md b/design/identifiers.md new file mode 100644 index 00000000..a37411f9 --- /dev/null +++ b/design/identifiers.md @@ -0,0 +1,113 @@ +# Identifiers and Names in Kubernetes + +A summarization of the goals and recommendations for identifiers in Kubernetes. +Described in GitHub issue [#199](http://issue.k8s.io/199). + + +## Definitions + +`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time +and space; intended to distinguish between historical occurrences of similar +entities. + +`Name`: A non-empty string guaranteed to be unique within a given scope at a +particular time; used in resource URLs; provided by clients at creation time and +encouraged to be human friendly; intended to facilitate creation idempotence and +space-uniqueness of singleton objects, distinguish distinct entities, and +reference particular entities across operations. + +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL): +An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, +with the '-' character allowed anywhere except the first or last character, +suitable for use as a hostname or segment in a domain name. + +[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN): +One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum +length of 253 characters. + +[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID): +A 128 bit generated value that is extremely unlikely to collide across time and +space and requires no central coordination. + +[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME): +An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, +with the '-' character allowed anywhere except the first or the last character +or adjacent to another '-' character, it must contain at least a (a-z) +character. + +## Objectives for names and UIDs + +1. Uniquely identify (via a UID) an object across space and time. +2. Uniquely name (via a name) an object across space. +3. Provide human-friendly names in API operations and/or configuration files. +4. Allow idempotent creation of API resources (#148) and enforcement of +space-uniqueness of singleton objects. +5. Allow DNS names to be automatically generated for some objects. + + +## General design + +1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must +be specified. Name must be non-empty and unique within the apiserver. This +enables idempotent and space-unique creation operations. Parts of the system +(e.g. replication controller) may join strings (e.g. a base name and a random +suffix) to create a unique Name. For situations where generating a name is +impractical, some or all objects may support a param to auto-generate a name. +Generating random names will defeat idempotency. + * Examples: "guestbook.user", "backend-x4eb1" +2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? +format TBD via #1114) may be specified. Depending on the API receiver, +namespaces might be validated (e.g. apiserver might ensure that the namespace +actually exists). If a namespace is not specified, one will be assigned by the +API receiver. This assignment policy might vary across API receivers (e.g. +apiserver might have a default, kubelet might generate something semi-random). + * Example: "api.k8s.example.com" +3. Upon acceptance of an object via an API, the object is assigned a UID +(a UUID). UID must be non-empty and unique across space and time. + * Example: "01234567-89ab-cdef-0123-456789abcdef" + +## Case study: Scheduling a pod + +Pods can be placed onto a particular node in a number of ways. This case study +demonstrates how the above design can be applied to satisfy the objectives. + +### A pod scheduled by a user through the apiserver + +1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver. +2. The apiserver validates the input. + 1. A default Namespace is assigned. + 2. The pod name must be space-unique within the Namespace. + 3. Each container within the pod has a name which must be space-unique within +the pod. +3. The pod is accepted. + 1. A new UID is assigned. +4. The pod is bound to a node. + 1. The kubelet on the node is passed the pod's UID, Namespace, and Name. +5. Kubelet validates the input. +6. Kubelet runs the pod. + 1. Each container is started up with enough metadata to distinguish the pod +from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is +unique across time. * This may correspond to Docker's container ID. + +### A pod placed by a config file on the node + +1. A config file is stored on the node, containing a pod with UID="", +Namespace="", and Name="cadvisor". +2. Kubelet validates the input. + 1. Since UID is not provided, kubelet generates one. + 2. Since Namespace is not provided, kubelet generates one. + 1. The generated namespace should be deterministic and cluster-unique for +the source, such as a hash of the hostname and file path. + * E.g. Namespace="file-f4231812554558a718a01ca942782d81" +3. Kubelet runs the pod. + 1. Each container is started up with enough metadata to distinguish the pod +from whence it came. + 2. Each attempt to run a container is assigned a UID (a string) that is +unique across time. + 1. This may correspond to Docker's container ID. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() + diff --git a/design/indexed-job.md b/design/indexed-job.md new file mode 100644 index 00000000..5a089c22 --- /dev/null +++ b/design/indexed-job.md @@ -0,0 +1,900 @@ +# Design: Indexed Feature of Job object + + +## Summary + +This design extends kubernetes with user-friendly support for +running embarrassingly parallel jobs. + +Here, *parallel* means on multiple nodes, which means multiple pods. +By *embarrassingly parallel*, it is meant that the pods +have no dependencies between each other. In particular, neither +ordering between pods nor gang scheduling are supported. + +Users already have two other options for running embarrassingly parallel +Jobs (described in the next section), but both have ease-of-use issues. + +Therefore, this document proposes extending the Job resource type to support +a third way to run embarrassingly parallel programs, with a focus on +ease of use. + +This new style of Job is called an *indexed job*, because each Pod of the Job +is specialized to work on a particular *index* from a fixed length array of work +items. + +## Background + +The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports +the embarrassingly parallel use case through *workqueue jobs*. +While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very +flexible, they can be difficult to use. They: (1) typically require running a +message queue or other database service, (2) typically require modifications +to existing binaries and images and (3) subtle race conditions are easy to + overlook. + +Users also have another option for parallel jobs: creating [multiple Job objects +from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of +Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job +objects at once. But, that approach also has its drawbacks: (1) for large levels +of parallelism (hundreds or thousands of pods) this approach means that listing +all jobs presents too much information, (2) users want a single source of +information about the success or failure of what the user views as a single +logical process. + +Indexed job fills provides a third option with better ease-of-use for common +use cases. + +## Requirements + +### User Requirements + +- Users want an easy way to run a Pod to completion *for each* item within a +[work list](#example-use-cases). + +- Users want to run these pods in parallel for speed, but to vary the level of +parallelism as needed, independent of the number of work items. + +- Users want to do this without requiring changes to existing images, +or source-to-image pipelines. + +- Users want a single object that encompasses the lifetime of the parallel +program. Deleting it should delete all dependent objects. It should report the +status of the overall process. Users should be able to wait for it to complete, +and can refer to it from other resource types, such as +[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). + + +### Example Use Cases + +Here are several examples of *work lists*: lists of command lines that the user +wants to run, each line its own Pod. (Note that in practice, a work list may not +ever be written out in this form, but it exists in the mind of the Job creator, +and it is a useful way to talk about the intent of the user when discussing +alternatives for specifying Indexed Jobs). + +Note that we will not have the user express their requirements in work list +form; it is just a format for presenting use cases. Subsequent discussion will +reference these work lists. + +#### Work List 1 + +Process several files with the same program: + +``` +/usr/local/bin/process_file 12342.dat +/usr/local/bin/process_file 97283.dat +/usr/local/bin/process_file 38732.dat +``` + +#### Work List 2 + +Process a matrix (or image, etc) in rectangular blocks: + +``` +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 +``` + +#### Work List 3 + +Build a program at several different git commits: + +``` +HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH +HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH +HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH +``` + +#### Work List 4 + +Render several frames of a movie: + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1 +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2 +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3 +``` + +#### Work List 5 + +Render several blocks of frames (Render blocks to avoid Pod startup overhead for +every frame): + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100 +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200 +./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300 +``` + +## Design Discussion + +### Converting Work Lists into Indexed Jobs. + +Given a work list, like in the [work list examples](#work-list-examples), +the information from the work list needs to get into each Pod of the Job. + +Users will typically not want to create a new image for each job they +run. They will want to use existing images. So, the image is not the place +for the work list. + +A work list can be stored on networked storage, and mounted by pods of the job. +Also, as a shortcut, for small worklists, it can be included in an annotation on +the Job object, which is then exposed as a volume in the pod via the downward +API. + +### What Varies Between Pods of a Job + +Pods need to differ in some way to do something different. (They do not differ +in the work-queue style of Job, but that style has ease-of-use issues). + +A general approach would be to allow pods to differ from each other in arbitrary +ways. For example, the Job object could have a list of PodSpecs to run. +However, this is so general that it provides little value. It would: + +- make the Job Spec very verbose, especially for jobs with thousands of work +items +- Job becomes such a vague concept that it is hard to explain to users +- in practice, we do not see cases where many pods which differ across many +fields of their specs, and need to run as a group, with no ordering constraints. +- CLIs and UIs need to support more options for creating Job +- it is useful for monitoring and accounting databases want to aggregate data +for pods with the same controller. However, pods with very different Specs may +not make sense to aggregate. +- profiling, debugging, accounting, auditing and monitoring tools cannot assume +common images/files, behaviors, provenance and so on between Pods of a Job. + +Also, variety has another cost. Pods which differ in ways that affect scheduling +(node constraints, resource requirements, labels) prevent the scheduler from +treating them as fungible, which is an important optimization for the scheduler. + +Therefore, we will not allow Pods from the same Job to differ arbitrarily +(anyway, users can use multiple Job objects for that case). We will try to +allow as little as possible to differ between pods of the same Job, while still +allowing users to express common parallel patterns easily. For users who need to +run jobs which differ in other ways, they can create multiple Jobs, and manage +them as a group using labels. + +From the above work lists, we see a need for Pods which differ in their command +lines, and in their environment variables. These work lists do not require the +pods to differ in other ways. + +Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) +has shown this model to be applicable to a very broad range of problems, despite +this restriction. + +Therefore we to allow pods in the same Job to differ **only** in the following + aspects: +- command line +- environment variables + +### Composition of existing images + +The docker image that is used in a job may not be maintained by the person +running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD. +If we require people to specify the complete command line to use Indexed Job, +then they will not automatically pick up changes in the default +command or args. + +This needs more thought. + +### Running Ad-Hoc Jobs using kubectl + +A user should be able to easily start an Indexed Job using `kubectl`. For +example to run [work list 1](#work-list-1), a user should be able to type +something simple like: + +``` +kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + -- \ + /usr/local/bin/process_file '$F' +``` + +In the above example: + +- `--restart=OnFailure` implies creating a job instead of replicationController. +- Each pods command line is `/usr/local/bin/process_file $F`. +- `--per-completion-env=` implies the jobs `.spec.completions` is set to the +length of the argument array (3 in the example). +- `--per-completion-env=F=` causes env var with `F` to be available in +the environment when the command line is evaluated. + +How exactly this happens is discussed later in the doc: this is a sketch of the +user experience. + +In practice, the list of files might be much longer and stored in a file on the +users local host, like: + +``` +$ cat files-to-process.txt +12342.dat +97283.dat +38732.dat +... +``` + +So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`. + +However, `kubectl` should also support a format like: + `--per-completion-env=F=@files-to-process.txt`. +That allows `kubectl` to parse the file, point out any syntax errors, and would +not run up against command line length limits (2MB is common, as low as 4kB is +POSIX compliant). + +One case we do not try to handle is where the file of work is stored on a cloud +filesystem, and not accessible from the users local host. Then we cannot easily +use indexed job, because we do not know the number of completions. The user +needs to copy the file locally first or use the Work-Queue style of Job (already +supported). + +Another case we do not try to handle is where the input file does not exist yet +because this Job is to be run at a future time, or depends on another job. The +workflow and scheduled job proposal need to consider this case. For that case, +you could use an indexed job which runs a program which shards the input file +(map-reduce-style). + +#### Multiple parameters + +The user may also have multiple parameters, like in [work list 2](#work-list-2). +One way is to just list all the command lines already expanded, one per line, in +a file, like this: + +``` +$ cat matrix-commandlines.txt +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 +/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 +/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 +``` + +and run the Job like this: + +``` +kubectl run process-matrix --image=my/matrix \ + --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \ + --restart=OnFailure \ + -- \ + 'eval "$COMMAND_LINE"' +``` + +However, this may have some subtleties with shell escaping. Also, it depends on +the user knowing all the correct arguments to the docker image being used (more +on this later). + +Instead, kubectl should support multiple instances of the `--per-completion-env` +flag. For example, to implement work list 2, a user could do: + +``` +kubectl run process-matrix --image=my/matrix \ + --per-completion-env=SR="0 16 0 16" \ + --per-completion-env=ER="15 31 15 31" \ + --per-completion-env=SC="0 0 16 16" \ + --per-completion-env=EC="15 15 31 31" \ + --restart=OnFailure \ + -- \ + /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC +``` + +### Composition With Workflows and ScheduledJob + +A user should be able to create a job (Indexed or not) which runs at a specific +time(s). For example: + +``` +$ kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + --runAt=2015-07-21T14:00:00Z + -- \ + /usr/local/bin/process_file '$F' +created "scheduledJob/process-files-37dt3" +``` + +Kubectl should build the same JobSpec, and then put it into a ScheduledJob +(#11980) and create that. + +For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a +complete workflow from a single command line would be messy, because of the need +to specify all the arguments multiple times. + +For that use case, the user could create a workflow message by hand. Or the user +could create a job template, and then make a workflow from the templates, +perhaps like this: + +``` +$ kubectl run process-files --image=myfileprocessor \ + --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ + --restart=OnFailure \ + --asTemplate \ + -- \ + /usr/local/bin/process_file '$F' +created "jobTemplate/process-files" +$ kubectl run merge-files --image=mymerger \ + --restart=OnFailure \ + --asTemplate \ + -- \ + /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \ +created "jobTemplate/merge-files" +$ kubectl create-workflow process-and-merge \ + --job=jobTemplate/process-files + --job=jobTemplate/merge-files + --dependency=process-files:merge-files +created "workflow/process-and-merge" +``` + +### Completion Indexes + +A JobSpec specifies the number of times a pod needs to complete successfully, +through the `job.Spec.Completions` field. The number of completions will be +equal to the number of work items in the work list. + +Each pod that the job controller creates is intended to complete one work item +from the work list. Since a pod may fail, several pods may, serially, attempt to +complete the same index. Therefore, we call it a *completion index* (or just +*index*), but not a *pod index*. + +For each completion index, in the range 1 to `.job.Spec.Completions`, the job +controller will create a pod with that index, and keep creating them on failure, +until each index is completed. + +An dense integer index, rather than a sparse string index (e.g. using just +`metadata.generate-name`) makes it easy to use the index to lookup parameters +in, for example, an array in shared storage. + +### Pod Identity and Template Substitution in Job Controller + +The JobSpec contains a single pod template. When the job controller creates a +particular pod, it copies the pod template and modifies it in some way to make +that pod distinctive. Whatever is distinctive about that pod is its *identity*. + +We consider several options. + +#### Index Substitution Only + +The job controller substitutes only the *completion index* of the pod into the +pod template when creating it. The JSON it POSTs differs only in a single +fields. + +We would put the completion index as a stringified integer, into an annotation +of the pod. The user can extract it from the annotation into an env var via the +downward API, or put it in a file via a Downward API volume, and parse it +himself. + +Once it is an environment variable in the pod (say `$INDEX`), then one of two +things can happen. + +First, the main program can know how to map from an integer index to what it +needs to do. For example, from Work List 4 above: + +``` +./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX +``` + +Second, a shell script can be prepended to the original command line which maps +the index to one or more string parameters. For example, to implement Work List +5 above, you could do: + +``` +/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME +``` + +In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` +and exports `$START_FRAME` and `$END_FRAME`. + +The shell could be part of the image, but more usefully, it could be generated +by a program and stuffed in an annotation or a configMap, and from there added +to a volume. + +The first approach may require the user to modify an existing image (see next +section) to be able to accept an `$INDEX` env var or argument. The second +approach requires that the image have a shell. We think that together these two +options cover a wide range of use cases (though not all). + +#### Multiple Substitution + +In this option, the JobSpec is extended to include a list of values to +substitute, and which fields to substitute them into. For example, a worklist +like this: + +``` +FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds +FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt +FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit +``` + +Can be broken down into a template like this, with three parameters: + +``` +; process-fruit -a -b -c +``` + +and a list of parameter tuples, like this: + +``` +("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds") +("FRUIT_COLOR=yellow", "-f banana.txt", "") +("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit") +``` + +The JobSpec can be extended to hold a list of parameter tuples (which are more +easily expressed as a list of lists of individual parameters). For example: + +``` +apiVersion: extensions/v1beta1 +kind: Job +... +spec: + completions: 3 + ... + template: + ... + perCompletionArgs: + container: 0 + - + - "-f apple.txt" + - "-f banana.txt" + - "-f cherry.txt" + - + - "--remove-seeds" + - "" + - "--remove-pit" + perCompletionEnvVars: + - name: "FRUIT_COLOR" + - "green" + - "yellow" + - "red" +``` + +However, just providing custom env vars, and not arguments, is sufficient for +many use cases: parameter can be put into env vars, and then substituted on the +command line. + +#### Comparison + +The multiple substitution approach: + +- keeps the *per completion parameters* in the JobSpec. +- Drawback: makes the job spec large for job with thousands of completions. (But +for very large jobs, the work-queue style or another type of controller, such as +map-reduce or spark, may be a better fit.) +- Drawback: is a form of server-side templating, which we want in Kubernetes but +have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). + +The index-only approach: + +- Requires that the user keep the *per completion parameters* in a separate +storage, such as a configData or networked storage. +- Makes no changes to the JobSpec. +- Drawback: while in separate storage, they could be mutated, which would have +unexpected effects. +- Drawback: Logic for using index to lookup parameters needs to be in the Pod. +- Drawback: CLIs and UIs are limited to using the "index" as the identity of a +pod from a job. They cannot easily say, for example `repeated failures on the +pod processing banana.txt`. + +Index-only approach relies on at least one of the following being true: + +1. Image containing a shell and certain shell commands (not all images have +this). +1. Use directly consumes the index from annotations (file or env var) and +expands to specific behavior in the main program. + +Also Using the index-only approach from non-kubectl clients requires that they +mimic the script-generation step, or only use the second style. + +#### Decision + +It is decided to implement the Index-only approach now. Once the server-side +templating design is complete for Kubernetes, and we have feedback from users, +we can consider if Multiple Substitution. + +## Detailed Design + +#### Job Resource Schema Changes + +No changes are made to the JobSpec. + + +The JobStatus is also not changed. The user can gauge the progress of the job by +the `.status.succeeded` count. + + +#### Job Spec Compatilibity + +A job spec written before this change will work exactly the same as before with +the new controller. The Pods it creates will have the same environment as +before. They will have a new annotation, but pod are expected to tolerate +unfamiliar annotations. + +However, if the job controller version is reverted, to a version before this +change, the jobs whose pod specs depend on the new annotation will fail. +This is okay for a Beta resource. + +#### Job Controller Changes + +The Job controller will maintain for each Job a data structed which +indicates the status of each completion index. We call this the +*scoreboard* for short. It is an array of length `.spec.completions`. +Elements of the array are `enum` type with possible values including +`complete`, `running`, and `notStarted`. + +The scoreboard is stored in Job Controller memory for efficiency. In either +case, the Status can be reconstructed from watching pods of the job (such as on +a controller manager restart). The index of the pods can be extracted from the +pod annotation. + +When Job controller sees that the number of running pods is less than the +desired parallelism of the job, it finds the first index in the scoreboard with +value `notRunning`. It creates a pod with this creation index. + +When it creates a pod with creation index `i`, it makes a copy of the +`.spec.template`, and sets +`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to +`i`. It does this in both the index-only and multiple-substitutions options. + +Then it creates the pod. + +When the controller notices that a pod has completed or is running or failed, +it updates the scoreboard. + +When all entries in the scoreboard are `complete`, then the job is complete. + + +#### Downward API Changes + +The downward API is changed to support extracting specific key names into a +single environment variable. So, the following would be supported: + +``` +kind: Pod +version: v1 +spec: + containers: + - name: foo + env: + - name: MY_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations[kubernetes.io/job/completion-index] +``` + +This requires kubelet changes. + +Users who fail to upgrade their kubelets at the same time as they upgrade their +controller manager will see a failure for pods to run when they are created by +the controller. The Kubelet will send an event about failure to create the pod. +The `kubectl describe job` will show many failed pods. + + +#### Kubectl Interface Changes + +The `--completions` and `--completion-index-var-name` flags are added to +kubectl. + +For example, this command: + +``` +kubectl run say-number --image=busybox \ + --completions=3 \ + --completion-index-var-name=I \ + -- \ + sh -c 'echo "My index is $I" && sleep 5' +``` + +will run 3 pods to completion, each printing one of the following lines: + +``` +My index is 1 +My index is 2 +My index is 0 +``` + +Kubectl would create the following pod: + + + +Kubectl will also support the `--per-completion-env` flag, as described +previously. For example, this command: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="apple banana cherry" \ + --per-completion-env=COLOR="green yellow red" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +or equivalently: + +``` +echo "apple banana cherry" > fruits.txt +echo "green yellow red" > colors.txt + +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="$(cat fruits.txt)" \ + --per-completion-env=COLOR="$(cat fruits.txt)" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +or similarly: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT=@fruits.txt \ + --per-completion-env=COLOR=@fruits.txt \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +will all run 3 pods in parallel. Index 0 pod will log: + +``` +Have a nice grenn apple +``` + +and so on. + + +Notes: + +- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a +quoted space separated list or `@` and the name of a text file containing a +list. +- `--per-completion-env=` can be specified several times, but all must have the +same length list. +- `--completions=N` with `N` equal to list length is implied. +- The flag `--completions=3` sets `job.spec.completions=3`. +- The flag `--completion-index-var-name=I` causes an env var to be created named +I in each pod, with the index in it. +- The flag `--restart=OnFailure` is implied by `--completions` or any +job-specific arguments. The user can also specify `--restart=Never` if they +desire but may not specify `--restart=Always` with job-related flags. +- Setting any of these flags in turn tells kubectl to create a Job, not a +replicationController. + +#### How Kubectl Creates Job Specs. + +To pass in the parameters, kubectl will generate a shell script which +can: +- parse the index from the annotation +- hold all the parameter lists. +- lookup the correct index in each parameter list and set an env var. + +For example, consider this command: + +``` +kubectl run say-fruit --image=busybox \ + --per-completion-env=FRUIT="apple banana cherry" \ + --per-completion-env=COLOR="green yellow red" \ + -- \ + sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' +``` + +First, kubectl generates the PodSpec as it normally does for `kubectl run`. + +But, then it will generate this script: + +```sh +#!/bin/sh +# Generated by kubectl run ... +# Check for needed commands +if [[ ! type cat ]] +then + echo "$0: Image does not include required command: cat" + exit 2 +fi +if [[ ! type grep ]] +then + echo "$0: Image does not include required command: grep" + exit 2 +fi +# Check that annotations are mounted from downward API +if [[ ! -e /etc/annotations ]] +then + echo "$0: Cannot find /etc/annotations" + exit 2 +fi +# Get our index from annotations file +I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index" +export I + +# Our parameter lists are stored inline in this script. +FRUIT_0="apple" +FRUIT_1="banana" +FRUIT_2="cherry" +# Extract the right parameter value based on our index. +# This works on any Bourne-based shell. +FRUIT=$(eval echo \$"FRUIT_$I") +export FRUIT + +COLOR_0="green" +COLOR_1="yellow" +COLOR_2="red" + +COLOR=$(eval echo \$"FRUIT_$I") +export COLOR +``` + +Then it POSTs this script, encoded, inside a ConfigData. +It attaches this volume to the PodSpec. + +Then it will edit the command line of the Pod to run this script before the rest of +the command line. + +Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this: +It also appends the Secret (later configData) volume with the script in it. + +So, the Pod template that kubectl creates (inside the job template) looks like this: + +``` +apiVersion: v1 +kind: Job +... +spec: + ... + template: + ... + spec: + containers: + - name: c + image: gcr.io/google_containers/busybox + command: + - 'sh' + - '-c' + - '/etc/job-params.sh; echo "this is the rest of the command"' + volumeMounts: + - name: annotations + mountPath: /etc + - name: script + mountPath: /etc + volumes: + - name: annotations + downwardAPI: + items: + - path: "annotations" + ieldRef: + fieldPath: metadata.annotations + - name: script + secret: + secretName: jobparams-abc123 +``` + +###### Alternatives + +Kubectl could append a `valueFrom` line like this to +get the index into the environment: + +```yaml +apiVersion: extensions/v1beta1 +kind: Job +metadata: + ... +spec: + ... + template: + ... + spec: + containers: + - name: foo + ... + env: + # following block added: + - name: I + valueFrom: + fieldRef: + fieldPath: metadata.annotations."kubernetes.io/job-idx" +``` + +However, in order to inject other env vars from parameter list, +kubectl still needs to edit the command line. + +Parameter lists could be passed via a configData volume instead of a secret. +Kubectl can be changed to work that way once the configData implementation is +complete. + +Parameter lists could be passed inside an EnvVar. This would have length +limitations, would pollute the output of `kubectl describe pods` and `kubectl +get pods -o json`. + +Parameter lists could be passed inside an annotation. This would have length +limitations, would pollute the output of `kubectl describe pods` and `kubectl +get pods -o json`. Also, currently annotations can only be extracted into a +single file. Complex logic is then needed to filter out exactly the desired +annotation data. + +Bash array variables could simplify extraction of a particular parameter from a +list of parameters. However, some popular base images do not include +`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation +that does not support array syntax. + +Kubelet does support [expanding variables without a +shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not +allow for recursive substitution, which is required to extract the correct +parameter from a list based on the completion index of the pod. The syntax +could be extended, but doing so seems complex and will be an unfamiliar syntax +for users. + +Putting all the command line editing into a script and running that causes +the least pollution to the original command line, and it allows +for complex error handling. + +Kubectl could store the script in an [Inline Volume]( +https://github.com/kubernetes/kubernetes/issues/13610) if that proposal +is approved. That would remove the need to manage the lifetime of the +configData/secret, and prevent the case where someone changes the +configData mid-job, and breaks things in a hard-to-debug way. + + +## Interactions with other features + +#### Supporting Work Queue Jobs too + +For Work Queue Jobs, completions has no meaning. Parallelism should be allowed +to be greater than it, and pods have no identity. So, the job controller should +not create a scoreboard in the JobStatus, just a count. Therefore, we need to +add one of the following to JobSpec: + +- allow unset `.spec.completions` to indicate no scoreboard, and no index for +tasks (identical tasks). +- allow `.spec.completions=-1` to indicate the same. +- add `.spec.indexed` to job to indicate need for scoreboard. + +#### Interaction with vertical autoscaling + +Since pods of the same job will not be created with different resources, +a vertical autoscaler will need to: + +- if it has index-specific initial resource suggestions, suggest those at +admission time; it will need to understand indexes. +- mutate resource requests on already created pods based on usage trend or +previous container failures. +- modify the job template, affecting all indexes. + +#### Comparison to StatefulSets (previously named PetSets) + +The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b. +The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more +restrictive and thus less verbose. + +It would be easier for users if Indexed Job and StatefulSet are similar where +possible. However, StatefulSet differs in several key respects: + +- StatefulSet is for ones to tens of instances. Indexed job should work with tens of +thousands of instances. +- When you have few instances, you may want to give them names. When you have many instances, +integer indexes make more sense. +- When you have thousands of instances, storing the work-list in the JobSpec +is verbose. For StatefulSet, this is less of a problem. +- StatefulSets (apparently) need to differ in more fields than indexed Jobs. + +This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is +intended to support ones to tens of things. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() + diff --git a/design/metadata-policy.md b/design/metadata-policy.md new file mode 100644 index 00000000..57416f11 --- /dev/null +++ b/design/metadata-policy.md @@ -0,0 +1,137 @@ +# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system + +## Introduction + +This document describes a new API resource, `MetadataPolicy`, that configures an +admission controller to take one or more actions based on an object's metadata. +Initially the metadata fields that the predicates can examine are labels and +annotations, and the actions are to add one or more labels and/or annotations, +or to reject creation/update of the object. In the future other actions might be +supported, such as applying an initializer. + +The first use of `MetadataPolicy` will be to decide which scheduler should +schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md) +Kubernetes system. In particular, the policy will add the scheduler name +annotation to a pod based on an annotation that is already on the pod that +indicates the QoS of the pod. (That annotation was presumably set by a simpler +admission controller that uses code, rather than configuration, to map the +resource requests and limits of a pod to QoS, and attaches the corresponding +annotation.) + +We anticipate a number of other uses for `MetadataPolicy`, such as defaulting +for labels and annotations, prohibiting/requiring particular labels or +annotations, or choosing a scheduling policy within a scheduler. We do not +discuss them in this doc. + + +## API + +```go +// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource. +// Every rule is applied, in an unspecified order, but if the action for any rule +// that matches is to reject the object, then the object is rejected without being mutated. +type MetadataPolicySpec struct { + Rules []MetadataPolicyRule `json:"rules,omitempty"` +} + +// If the PolicyPredicate is met, then the PolicyAction is applied. +// Example rules: +// reject object if label with key X is present (i.e. require X) +// reject object if label with key X is not present (i.e. forbid X) +// add label X=Y if label with key X is not present (i.e. default X) +// add annotation A=B if object has annotation C=D or E=F +type MetadataPolicyRule struct { + PolicyPredicate PolicyPredicate `json:"policyPredicate"` + PolicyAction PolicyAction `json:policyAction"` +} + +// All criteria must be met for the PolicyPredicate to be considered met. +type PolicyPredicate struct { + // Note that Namespace is not listed here because MetadataPolicy is per-Namespace. + LabelSelector *LabelSelector `json:"labelSelector,omitempty"` + AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"` +} + +// Apply the indicated Labels and/or Annotations (if present), unless Reject is set +// to true, in which case reject the object without mutating it. +type PolicyAction struct { + // If true, the object will be rejected and not mutated. + Reject bool `json:"reject"` + // The labels to add or update, if any. + UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"` + // The annotations to add or update, if any. + UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"` +} + +// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying +// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's +// are applied to all objects in the namespace; the order of evaluation is not guaranteed, +// but if any of the matching policies have an action of rejecting the object, then the object +// will be rejected without being mutated. +type MetadataPolicy struct { + unversioned.TypeMeta `json:",inline"` + // Standard object's metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata + ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the metadata policy that should be enforced. + // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status + Spec MetadataPolicySpec `json:"spec,omitempty"` +} + +// MetadataPolicyList is a list of MetadataPolicy items. +type MetadataPolicyList struct { + unversioned.TypeMeta `json:",inline"` + // Standard list metadata. + // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds + unversioned.ListMeta `json:"metadata,omitempty"` + + // Items is a list of MetadataPolicy objects. + // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota + Items []MetadataPolicy `json:"items"` +} +``` + +## Implementation plan + +1. Create `MetadataPolicy` API resource +1. Create admission controller that implements policies defined in +`MetadataPolicy` +1. Create admission controller that sets annotation +`scheduler.alpha.kubernetes.io/qos: ` +(where `QOS` is one of `Guaranteed, Burstable, BestEffort`) +based on pod's resource request and limit. + +## Future work + +Longer-term we will have QoS be set on create and update by the registry, +similar to `Pending` phase today, instead of having an admission controller +(that runs before the one that takes `MetadataPolicy` as input) do it. + +We plan to eventually move from having an admission controller set the scheduler +name as a pod annotation, to using the initializer concept. In particular, the +scheduler will be an initializer, and the admission controller that decides +which scheduler to use will add the scheduler's name to the list of initializers +for the pod (presumably the scheduler will be the last initializer to run on +each pod). The admission controller would still be configured using the +`MetadataPolicy` described here, only the mechanism the admission controller +uses to record its decision of which scheduler to use would change. + +## Related issues + +The main issue for multiple schedulers is #11793. There was also a lot of +discussion in PRs #17197 and #17865. + +We could use the approach described here to choose a scheduling policy within a +single scheduler, as opposed to choosing a scheduler, a desire mentioned in + +# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where + +`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized +API for matching "claims" to "service classes"; matching a pod to a scheduler +would be one use for such an API. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() + diff --git a/design/monitoring_architecture.md b/design/monitoring_architecture.md new file mode 100644 index 00000000..b819eeca --- /dev/null +++ b/design/monitoring_architecture.md @@ -0,0 +1,203 @@ +# Kubernetes monitoring architecture + +## Executive Summary + +Monitoring is split into two pipelines: + +* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down +Heapster called metrics-server, and the API server serving the master metrics API. These +metrics are used by core system components, such as scheduling logic (e.g. scheduler and +horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components +(e.g. `kubectl top`). This pipeline is not intended for integration with third-party +monitoring systems. +* A **monitoring pipeline** used for collecting various metrics from the system and exposing +them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore +via adapters. Users can choose from many monitoring system vendors, or run none at all. In +open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options +will be easy to install. We expect that such pipelines will typically consist of a per-node +agent and a cluster-level aggregator. + +The architecture is illustrated in the diagram in the Appendix of this doc. + +## Introduction and Objectives + +This document proposes a high-level monitoring architecture for Kubernetes. It covers +a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc, +specifically focusing on an architecture (components and their interactions) that +hopefully meets the numerous requirements. We do not specify any particular timeframe +for implementing this architecture, nor any particular roadmap for getting there. + +### Terminology + +There are two types of metrics, system metrics and service metrics. System metrics are +generic metrics that are generally available from every entity that is monitored (e.g. +usage of CPU and memory by container and node). Service metrics are explicitly defined +in application code and exported (e.g. number of 500s served by the API server). Both +system metrics and service metrics can originate from users’ containers or from system +infrastructure components (master components like the API server, addon pods running on +the master, and addon pods running on user nodes). + +We divide system metrics into + +* *core metrics*, which are metrics that Kubernetes understands and uses for operation +of its internal components and core utilities -- for example, metrics used for scheduling +(including the inputs to the algorithms for resource estimation, initial resources/vertical +autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics), +the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage, +memory instantaneous usage, disk usage of pods, disk usage of containers +* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they +include the core metrics (though not necessarily in a format Kubernetes understands) plus +additional metrics. + +Service metrics can be divided into those produced by Kubernetes infrastructure components +(and thus useful for operation of the Kubernetes cluster) and those produced by user applications. +Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics. +Of course horizontal pod autoscaling also uses core metrics. + +We consider logging to be separate from monitoring, so logging is outside the scope of +this doc. + +### Requirements + +The monitoring architecture should + +* include a solution that is part of core Kubernetes and + * makes core system metrics about nodes, pods, and containers available via a standard + master API (today the master metrics API), such that core Kubernetes features do not + depend on non-core components + * requires Kubelet to only export a limited set of metrics, namely those required for + core Kubernetes components to correctly operate (this is related to #18770) + * can scale up to at least 5000 nodes + * is small enough that we can require that all of its components be running in all deployment + configurations +* include an out-of-the-box solution that can serve historical data, e.g. to support Initial +Resources and vertical pod autoscaling as well as cluster analytics queries, that depends +only on core Kubernetes +* allow for third-party monitoring solutions that are not part of core Kubernetes and can +be integrated with components like Horizontal Pod Autoscaler that require service metrics + +## Architecture + +We divide our description of the long-term architecture plan into the core metrics pipeline +and the monitoring pipeline. For each, it is necessary to think about how to deal with each +type of metric (core metrics, non-core metrics, and service metrics) from both the master +and minions. + +### Core metrics pipeline + +The core metrics pipeline collects a set of core system metrics. There are two sources for +these metrics + +* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that +is part of Kubelet will be slimmed down to provide only core system metrics) +* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from +Kubelet into resource estimates (values used by scheduler for a more advanced usage-based +scheduler) + +These sources are scraped by a component we call *metrics-server* which is like a slimmed-down +version of today's Heapster. metrics-server stores locally only latest values and has no sinks. +metrics-server exposes the master metrics API. (The configuration described here is similar +to the current Heapster in “standalone” mode.) +[Discovery summarizer](../../docs/proposals/federated-api-servers.md) +makes the master metrics API available to external clients such that from the client’s perspective +it looks the same as talking to the API server. + +Core (system) metrics are handled as described above in all deployment environments. The only +easily replaceable part is resource estimator, which could be replaced by power users. In +theory, metric-server itself can also be substituted, but it’d be similar to substituting +apiserver itself or controller-manager - possible, but not recommended and not supported. + +Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon +themselves (e.g. CPU usage of Kubelet), even though they do not run in containers. + +The core metrics pipeline is intentionally small and not designed for third-party integrations. +“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline +(see next section) and can run on Kubernetes without having to make changes to upstream components. +In this way we can remove the burden we have today that comes with maintaining Heapster as the +integration point for every possible metrics source, sink, and feature. + +#### Infrastore + +We will build an open-source Infrastore component (most likely reusing existing technologies) +for serving historical queries over core system metrics and events, which it will fetch from +the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries -- +this is TBD) to handle the following use cases + +* initial resources +* vertical autoscaling +* oldtimer API +* decision-support queries for debugging, capacity planning, etc. +* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) + +In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes +infrastructure containers), described in the upcoming sections. + +### Monitoring pipeline + +One of the goals of building a dedicated metrics pipeline for core metrics, as described in the +previous section, is to allow for a separate monitoring pipeline that can be very flexible +because core Kubernetes components do not need to rely on it. By default we will not provide +one, but we will provide an easy way to install one (using a single command, most likely using +Helm). We described the monitoring pipeline in this section. + +Data collected by the monitoring pipeline may contain any sub- or superset of the following groups +of metrics: + +* core system metrics +* non-core system metrics +* service metrics from user application containers +* service metrics from Kubernetes infrastructure containers; these metrics are exposed using +Prometheus instrumentation + +It is up to the monitoring solution to decide which of these are collected. + +In order to enable horizontal pod autoscaling based on custom metrics, the provider of the +monitoring pipeline would also have to create a stateless API adapter that pulls the custom +metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such +API will be a well defined, versioned API similar to regular APIs. Details of how it will be +exposed or discovered will be covered in a detailed design doc for this component. + +The same approach applies if it is desired to make monitoring pipeline metrics available in +Infrastore. These adapters could be standalone components, libraries, or part of the monitoring +solution itself. + +There are many possible combinations of node and cluster-level agents that could comprise a +monitoring pipeline, including +cAdvisor + Heapster + InfluxDB (or any other sink) +* cAdvisor + collectd + Heapster +* cAdvisor + Prometheus +* snapd + Heapster +* snapd + SNAP cluster-level agent +* Sysdig + +As an example we’ll describe a potential integration with cAdvisor + Prometheus. + +Prometheus has the following metric sources on a node: +* core and non-core system metrics from cAdvisor +* service metrics exposed by containers via HTTP handler in Prometheus format +* [optional] metrics about node itself from Node Exporter (a Prometheus component) + +All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus +cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a +standalone API adapter that proxies/translates between the Prometheus Query Language endpoint +on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be +used to make the metrics from the monitoring pipeline available in Infrastore. Neither +adapter is necessary if the user does not need the corresponding feature. + +The command that installs cAdvisor+Prometheus should also automatically set up collection +of the metrics from infrastructure containers. This is possible because the names of the +infrastructure containers and metrics of interest are part of the Kubernetes control plane +configuration itself, and because the infrastructure containers export their metrics in +Prometheus format. + +## Appendix: Architecture diagram + +### Open-source monitoring pipeline + +![Architecture Diagram](monitoring_architecture.png?raw=true "Architecture overview") + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/monitoring_architecture.md?pixel)]() + diff --git a/design/monitoring_architecture.png b/design/monitoring_architecture.png new file mode 100644 index 00000000..570996b7 Binary files /dev/null and b/design/monitoring_architecture.png differ diff --git a/design/namespaces.md b/design/namespaces.md new file mode 100644 index 00000000..8a9c97c8 --- /dev/null +++ b/design/namespaces.md @@ -0,0 +1,370 @@ +# Namespaces + +## Abstract + +A Namespace is a mechanism to partition resources created by users into +a logically named group. + +## Motivation + +A single cluster should be able to satisfy the needs of multiple user +communities. + +Each user community wants to be able to work in isolation from other +communities. + +Each user community has its own: + +1. resources (pods, services, replication controllers, etc.) +2. policies (who can or cannot perform actions in their community) +3. constraints (this community is allowed this much quota, etc.) + +A cluster operator may create a Namespace for each unique user community. + +The Namespace provides a unique scope for: + +1. named resources (to avoid basic naming collisions) +2. delegated management authority to trusted users +3. ability to limit community resource consumption + +## Use cases + +1. As a cluster operator, I want to support multiple user communities on a +single cluster. +2. As a cluster operator, I want to delegate authority to partitions of the +cluster to trusted users in those communities. +3. As a cluster operator, I want to limit the amount of resources each +community can consume in order to limit the impact to other communities using +the cluster. +4. As a cluster user, I want to interact with resources that are pertinent to +my user community in isolation of what other user communities are doing on the +cluster. + +## Design + +### Data Model + +A *Namespace* defines a logically named group for multiple *Kind*s of resources. + +```go +type Namespace struct { + TypeMeta `json:",inline"` + ObjectMeta `json:"metadata,omitempty"` + + Spec NamespaceSpec `json:"spec,omitempty"` + Status NamespaceStatus `json:"status,omitempty"` +} +``` + +A *Namespace* name is a DNS compatible label. + +A *Namespace* must exist prior to associating content with it. + +A *Namespace* must not be deleted if there is content associated with it. + +To associate a resource with a *Namespace* the following conditions must be +satisfied: + +1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with +the server +2. The resource's *TypeMeta.Namespace* field must have a value that references +an existing *Namespace* + +The *Name* of a resource associated with a *Namespace* is unique to that *Kind* +in that *Namespace*. + +It is intended to be used in resource URLs; provided by clients at creation +time, and encouraged to be human friendly; intended to facilitate idempotent +creation, space-uniqueness of singleton objects, distinguish distinct entities, +and reference particular entities across operations. + +### Authorization + +A *Namespace* provides an authorization scope for accessing content associated +with the *Namespace*. + +See [Authorization plugins](../admin/authorization.md) + +### Limit Resource Consumption + +A *Namespace* provides a scope to limit resource consumption. + +A *LimitRange* defines min/max constraints on the amount of resources a single +entity can consume in a *Namespace*. + +See [Admission control: Limit Range](admission_control_limit_range.md) + +A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and +allows cluster operators to define *Hard* resource usage limits that a +*Namespace* may consume. + +See [Admission control: Resource Quota](admission_control_resource_quota.md) + +### Finalizers + +Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* +objects. + +```go +type FinalizerName string + +// These are internal finalizers to Kubernetes, must be qualified name unless defined here +const ( + FinalizerKubernetes FinalizerName = "kubernetes" +) + +// NamespaceSpec describes the attributes on a Namespace +type NamespaceSpec struct { + // Finalizers is an opaque list of values that must be empty to permanently remove object from storage + Finalizers []FinalizerName +} +``` + +A *FinalizerName* is a qualified name. + +The API Server enforces that a *Namespace* can only be deleted from storage if +and only if it's *Namespace.Spec.Finalizers* is empty. + +A *finalize* operation is the only mechanism to modify the +*Namespace.Spec.Finalizers* field post creation. + +Each *Namespace* created has *kubernetes* as an item in its list of initial +*Namespace.Spec.Finalizers* set by default. + +### Phases + +A *Namespace* may exist in the following phases. + +```go +type NamespacePhase string +const( + NamespaceActive NamespacePhase = "Active" + NamespaceTerminating NamespaceTerminating = "Terminating" +) + +type NamespaceStatus struct { + ... + Phase NamespacePhase +} +``` + +A *Namespace* is in the **Active** phase if it does not have a +*ObjectMeta.DeletionTimestamp*. + +A *Namespace* is in the **Terminating** phase if it has a +*ObjectMeta.DeletionTimestamp*. + +**Active** + +Upon creation, a *Namespace* goes in the *Active* phase. This means that content +may be associated with a namespace, and all normal interactions with the +namespace are allowed to occur in the cluster. + +If a DELETE request occurs for a *Namespace*, the +*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A +*namespace controller* observes the change, and sets the +*Namespace.Status.Phase* to *Terminating*. + +**Terminating** + +A *namespace controller* watches for *Namespace* objects that have a +*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to +initiate graceful termination of the *Namespace* associated content that are +known to the cluster. + +The *namespace controller* enumerates each known resource type in that namespace +and deletes it one by one. + +Admission control blocks creation of new resources in that namespace in order to +prevent a race-condition where the controller could believe all of a given +resource type had been deleted from the namespace, when in fact some other rogue +client agent had created new objects. Using admission control in this scenario +allows each of registry implementations for the individual objects to not need +to take into account Namespace life-cycle. + +Once all objects known to the *namespace controller* have been deleted, the +*namespace controller* executes a *finalize* operation on the namespace that +removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list. + +If the *namespace controller* sees a *Namespace* whose +*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers* +list is empty, it will signal the server to permanently remove the *Namespace* +from storage by sending a final DELETE action to the API server. + +### REST API + +To interact with the Namespace API: + +| Action | HTTP Verb | Path | Description | +| ------ | --------- | ---- | ----------- | +| CREATE | POST | /api/{version}/namespaces | Create a namespace | +| LIST | GET | /api/{version}/namespaces | List all namespaces | +| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} | +| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} | +| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} | +| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces | + +This specification reserves the name *finalize* as a sub-resource to namespace. + +As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*. + +To interact with content associated with a Namespace: + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} | +| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} | +| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} | +| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} | +| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} | +| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} | +| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | +| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | + +The API server verifies the *Namespace* on resource creation matches the +*{namespace}* on the path. + +The API server will associate a resource with a *Namespace* if not populated by +the end-user based on the *Namespace* context of the incoming request. If the +*Namespace* of the resource being created, or updated does not match the +*Namespace* on the request, then the API server will reject the request. + +### Storage + +A namespace provides a unique identifier space and therefore must be in the +storage path of a resource. + +In etcd, we want to continue to still support efficient WATCH across namespaces. + +Resources that persist content in etcd will have storage paths as follows: + +/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} + +This enables consumers to WATCH /registry/{resourceType} for changes across +namespace of a particular {resourceType}. + +### Kubelet + +The kubelet will register pod's it sources from a file or http source with a +namespace associated with the *cluster-id* + +### Example: OpenShift Origin managing a Kubernetes Namespace + +In this example, we demonstrate how the design allows for agents built on-top of +Kubernetes that manage their own set of resource types associated with a +*Namespace* to take part in Namespace termination. + +OpenShift creates a Namespace in Kubernetes + +```json +{ + "apiVersion":"v1", + "kind": "Namespace", + "metadata": { + "name": "development", + "labels": { + "name": "development" + } + }, + "spec": { + "finalizers": ["openshift.com/origin", "kubernetes"] + }, + "status": { + "phase": "Active" + } +} +``` + +OpenShift then goes and creates a set of resources (pods, services, etc) +associated with the "development" namespace. It also creates its own set of +resources in its own storage associated with the "development" namespace unknown +to Kubernetes. + +User deletes the Namespace in Kubernetes, and Namespace now has following state: + +```json +{ + "apiVersion":"v1", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "...", + "labels": { + "name": "development" + } + }, + "spec": { + "finalizers": ["openshift.com/origin", "kubernetes"] + }, + "status": { + "phase": "Terminating" + } +} +``` + +The Kubernetes *namespace controller* observes the namespace has a +*deletionTimestamp* and begins to terminate all of the content in the namespace +that it knows about. Upon success, it executes a *finalize* action that modifies +the *Namespace* by removing *kubernetes* from the list of finalizers: + +```json +{ + "apiVersion":"v1", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "...", + "labels": { + "name": "development" + } + }, + "spec": { + "finalizers": ["openshift.com/origin"] + }, + "status": { + "phase": "Terminating" + } +} +``` + +OpenShift Origin has its own *namespace controller* that is observing cluster +state, and it observes the same namespace had a *deletionTimestamp* assigned to +it. It too will go and purge resources from its own storage that it manages +associated with that namespace. Upon completion, it executes a *finalize* action +and removes the reference to "openshift.com/origin" from the list of finalizers. + +This results in the following state: + +```json +{ + "apiVersion":"v1", + "kind": "Namespace", + "metadata": { + "name": "development", + "deletionTimestamp": "...", + "labels": { + "name": "development" + } + }, + "spec": { + "finalizers": [] + }, + "status": { + "phase": "Terminating" + } +} +``` + +At this point, the Kubernetes *namespace controller* in its sync loop will see +that the namespace has a deletion timestamp and that its list of finalizers is +empty. As a result, it knows all content associated from that namespace has been +purged. It performs a final DELETE action to remove that Namespace from the +storage. + +At this point, all content associated with that Namespace, and the Namespace +itself are gone. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() + diff --git a/design/networking.md b/design/networking.md new file mode 100644 index 00000000..6e269481 --- /dev/null +++ b/design/networking.md @@ -0,0 +1,190 @@ +# Networking + +There are 4 distinct networking problems to solve: + +1. Highly-coupled container-to-container communications +2. Pod-to-Pod communications +3. Pod-to-Service communications +4. External-to-internal communications + +## Model and motivation + +Kubernetes deviates from the default Docker networking model (though as of +Docker 1.8 their network plugins are getting closer). The goal is for each pod +to have an IP in a flat shared networking namespace that has full communication +with other physical computers and containers across the network. IP-per-pod +creates a clean, backward-compatible model where pods can be treated much like +VMs or physical hosts from the perspectives of port allocation, networking, +naming, service discovery, load balancing, application configuration, and +migration. + +Dynamic port allocation, on the other hand, requires supporting both static +ports (e.g., for externally accessible services) and dynamically allocated +ports, requires partitioning centrally allocated and locally acquired dynamic +ports, complicates scheduling (since ports are a scarce resource), is +inconvenient for users, complicates application configuration, is plagued by +port conflicts and reuse and exhaustion, requires non-standard approaches to +naming (e.g. consul or etcd rather than DNS), requires proxies and/or +redirection for programs using standard naming/addressing mechanisms (e.g. web +browsers), requires watching and cache invalidation for address/port changes +for instances in addition to watching group membership changes, and obstructs +container/pod migration (e.g. using CRIU). NAT introduces additional complexity +by fragmenting the addressing space, which breaks self-registration mechanisms, +among other problems. + +## Container to container + +All containers within a pod behave as if they are on the same host with regard +to networking. They can all reach each other’s ports on localhost. This offers +simplicity (static ports know a priori), security (ports bound to localhost +are visible within the pod but never outside it), and performance. This also +reduces friction for applications moving from the world of uncontainerized apps +on physical or virtual hosts. People running application stacks together on +the same host have already figured out how to make ports not conflict and have +arranged for clients to find them. + +The approach does reduce isolation between containers within a pod — +ports could conflict, and there can be no container-private ports, but these +seem to be relatively minor issues with plausible future workarounds. Besides, +the premise of pods is that containers within a pod share some resources +(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. +Additionally, the user can control what containers belong to the same pod +whereas, in general, they don't control what pods land together on a host. + +## Pod to pod + +Because every pod gets a "real" (not machine-private) IP address, pods can +communicate without proxies or translations. The pod can use well-known port +numbers and can avoid the use of higher-level service discovery systems like +DNS-SD, Consul, or Etcd. + +When any container calls ioctl(SIOCGIFADDR) (get the address of an interface), +it sees the same IP that any peer container would see them coming from — +each pod has its own IP address that other pods can know. By making IP addresses +and ports the same both inside and outside the pods, we create a NAT-less, flat +address space. Running "ip addr show" should work as expected. This would enable +all existing naming/discovery mechanisms to work out of the box, including +self-registration mechanisms and applications that distribute IP addresses. We +should be optimizing for inter-pod network communication. Within a pod, +containers are more likely to use communication through volumes (e.g., tmpfs) or +IPC. + +This is different from the standard Docker model. In that mode, each container +gets an IP in the 172-dot space and would only see that 172-dot address from +SIOCGIFADDR. If these containers connect to another container the peer would see +the connect coming from a different IP than the container itself knows. In short +— you can never self-register anything from a container, because a +container can not be reached on its private IP. + +An alternative we considered was an additional layer of addressing: pod-centric +IP per container. Each container would have its own local IP address, visible +only within that pod. This would perhaps make it easier for containerized +applications to move from physical/virtual hosts to pods, but would be more +complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) +and to reason about, due to the additional layer of address translation, and +would break self-registration and IP distribution mechanisms. + +Like Docker, ports can still be published to the host node's interface(s), but +the need for this is radically diminished. + +## Implementation + +For the Google Compute Engine cluster configuration scripts, we use [advanced +routing rules](https://developers.google.com/compute/docs/networking#routing) +and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that +get routed to it. This is in addition to the 'main' IP address assigned to the +VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to +differentiate it from `docker0`) is set up outside of Docker proper. + +Example of GCE's advanced routing rules: + +```sh +gcloud compute routes add "${NODE_NAMES[$i]}" \ + --project "${PROJECT}" \ + --destination-range "${NODE_IP_RANGES[$i]}" \ + --network "${NETWORK}" \ + --next-hop-instance "${NODE_NAMES[$i]}" \ + --next-hop-instance-zone "${ZONE}" & +``` + +GCE itself does not know anything about these IPs, though. This means that when +a pod tries to egress beyond GCE's project the packets must be SNAT'ed +(masqueraded) to the VM's IP, which GCE recognizes and allows. + +### Other implementations + +With the primary aim of providing IP-per-pod-model, other implementations exist +to serve the purpose outside of GCE. + - [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md) + - [Flannel](https://github.com/coreos/flannel#flannel) + - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) + ("With Linux Bridge devices" section) + - [Weave](https://github.com/zettio/weave) is yet another way to build an + overlay network, primarily aiming at Docker integration. + - [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real + container IPs. + +## Pod to service + +The [service](../user-guide/services.md) abstraction provides a way to group pods under a +common access policy (e.g. load-balanced). The implementation of this creates a +virtual IP which clients can access and which is transparently proxied to the +pods in a Service. Each node runs a kube-proxy process which programs +`iptables` rules to trap access to service IPs and redirect them to the correct +backends. This provides a highly-available load-balancing solution with low +performance overhead by balancing client traffic from a node on that same node. + +## External to internal + +So far the discussion has been about how to access a pod or service from within +the cluster. Accessing a pod from outside the cluster is a bit more tricky. We +want to offer highly-available, high-performance load balancing to target +Kubernetes Services. Most public cloud providers are simply not flexible enough +yet. + +The way this is generally implemented is to set up external load balancers (e.g. +GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When +traffic arrives at a node it is recognized as being part of a particular Service +and routed to an appropriate backend Pod. This does mean that some traffic will +get double-bounced on the network. Once cloud providers have better offerings +we can take advantage of those. + +## Challenges and future work + +### Docker API + +Right now, docker inspect doesn't show the networking configuration of the +containers, since they derive it from another container. That information should +be exposed somehow. + +### External IP assignment + +We want to be able to assign IP addresses externally from Docker +[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need +to statically allocate fixed-size IP ranges to each node, so that IP addresses +can be made stable across pod infra container restarts +([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate +pod migration. Right now, if the pod infra container dies, all the user +containers must be stopped and restarted because the netns of the pod infra +container will change on restart, and any subsequent user container restart +will join that new netns, thereby not being able to see its peers. +Additionally, a change in IP address would encounter DNS caching/TTL problems. +External IP assignment would also simplify DNS support (see below). + +### IPv6 + +IPv6 support would be nice but requires significant internal changes in a few +areas. First pods should be able to report multiple IP addresses +[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398) +and the network plugin architecture Kubernetes uses needs to allow returning +IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245). +Kubernetes code that deals with IP addresses must then be audited and fixed to +support both IPv4 and IPv6 addresses and not assume IPv4. +Additionally, direct ipv6 assignment to instances doesn't appear to be supported +by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull +requests from people running Kubernetes on bare metal, though. :-) + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() + diff --git a/design/nodeaffinity.md b/design/nodeaffinity.md new file mode 100644 index 00000000..61e04169 --- /dev/null +++ b/design/nodeaffinity.md @@ -0,0 +1,246 @@ +# Node affinity and NodeSelector + +## Introduction + +This document proposes a new label selector representation, called +`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit +more flexible and is intended to be used only for selecting nodes. + +In addition, we propose to replace the `map[string]string` in `PodSpec` that the +scheduler currently uses as part of restricting the set of nodes onto which a +pod is eligible to schedule, with a field of type `Affinity` that contains one +or more affinity specifications. In this document we discuss `NodeAffinity`, +which contains one or more of the following: +* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be +represented by a `NodeSelector`, and thus generalizes the scheduling behavior of +the current `map[string]string` but still serves the purpose of restricting +the set of nodes onto which the pod can schedule. In addition, unlike the +behavior of the current `map[string]string`, when it becomes violated the system +will try to eventually evict the pod from its node. +* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is +identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the +system may or may not try to eventually evict the pod from its node. +* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that +specifies which nodes are preferred for scheduling among those that meet all +scheduling requirements. + +(In practice, as discussed later, we will actually *add* the `Affinity` field +rather than replacing `map[string]string`, due to backward compatibility +requirements.) + +The affinity specifications described above allow a pod to request various +properties that are inherent to nodes, for example "run this pod on a node with +an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." +([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes +some of the properties that a node might publish as labels, which affinity +expressions can match against.) They do *not* allow a pod to request to schedule +(or not schedule) on a node based on what other pods are running on the node. +That feature is called "inter-pod topological affinity/anti-affinity" and is +described [here](https://github.com/kubernetes/kubernetes/pull/18265). + +## API + +### NodeSelector + +```go +// A node selector represents the union of the results of one or more label queries +// over a set of nodes; that is, it represents the OR of the selectors represented +// by the nodeSelectorTerms. +type NodeSelector struct { + // nodeSelectorTerms is a list of node selector terms. The terms are ORed. + NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` +} + +// An empty node selector term matches all objects. A null node selector term +// matches no objects. +type NodeSelectorTerm struct { + // matchExpressions is a list of node selector requirements. The requirements are ANDed. + MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` +} + +// A node selector requirement is a selector that contains values, a key, and an operator +// that relates the key and values. +type NodeSelectorRequirement struct { + // key is the label key that the selector applies to. + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + // operator represents a key's relationship to a set of values. + // Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. + Operator NodeSelectorOperator `json:"operator"` + // values is an array of string values. If the operator is In or NotIn, + // the values array must be non-empty. If the operator is Exists or DoesNotExist, + // the values array must be empty. If the operator is Gt or Lt, the values + // array must have a single element, which will be interpreted as an integer. + // This array is replaced during a strategic merge patch. + Values []string `json:"values,omitempty"` +} + +// A node selector operator is the set of operators that can be used in +// a node selector requirement. +type NodeSelectorOperator string + +const ( + NodeSelectorOpIn NodeSelectorOperator = "In" + NodeSelectorOpNotIn NodeSelectorOperator = "NotIn" + NodeSelectorOpExists NodeSelectorOperator = "Exists" + NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist" + NodeSelectorOpGt NodeSelectorOperator = "Gt" + NodeSelectorOpLt NodeSelectorOperator = "Lt" +) +``` + +### NodeAffinity + +We will add one field to `PodSpec` + +```go +Affinity *Affinity `json:"affinity,omitempty"` +``` + +The `Affinity` type is defined as follows + +```go +type Affinity struct { + NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"` +} + +type NodeAffinity struct { + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a node label update), + // the system will try to eventually evict the pod from its node. + RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a node label update), + // the system may or may not try to eventually evict the pod from its node. + RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +// An empty preferred scheduling term matches all objects with implicit weight 0 +// (i.e. it's a no-op). A null preferred scheduling term matches no objects. +type PreferredSchedulingTerm struct { + // weight is in the range 1-100 + Weight int `json:"weight"` + // matchExpressions is a list of node selector requirements. The requirements are ANDed. + MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` +} +``` + +Unfortunately, the name of the existing `map[string]string` field in PodSpec is +`NodeSelector` and we can't change it since this name is part of the API. +Hopefully this won't cause too much confusion. + +## Examples + +** TODO: fill in this section ** + +* Run this pod on a node with an Intel or AMD CPU + +* Run this pod on a node in availability zone Z + + +## Backward compatibility + +When we add `Affinity` to PodSpec, we will deprecate, but not remove, the +current field in PodSpec + +```go +NodeSelector map[string]string `json:"nodeSelector,omitempty"` +``` + +Old version of the scheduler will ignore the `Affinity` field. New versions of +the scheduler will apply their scheduling predicates to both `Affinity` and +`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets +of requirements. We will not attempt to convert between `Affinity` and +`nodeSelector`. + +Old versions of non-scheduling clients will not know how to do anything +semantically meaningful with `Affinity`, but we don't expect that this will +cause a problem. + +See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) +for more discussion. + +Users should not start using `NodeAffinity` until the full implementation has +been in Kubelet and the master for enough binary versions that we feel +comfortable that we will not need to roll back either Kubelet or master to a +version that does not support them. Longer-term we will use a programatic +approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). + +## Implementation plan + +1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, +`PreferredDuringSchedulingIgnoredDuringExecution`, and +`RequiredDuringSchedulingIgnoredDuringExecution` types to the API. +2. Implement a scheduler predicate that takes +`RequiredDuringSchedulingIgnoredDuringExecution` into account. +3. Implement a scheduler priority function that takes +`PreferredDuringSchedulingIgnoredDuringExecution` into account. +4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be +marked as deprecated. +5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API. +6. Modify the scheduler predicate from step 2 to also take +`RequiredDuringSchedulingRequiredDuringExecution` into account. +7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission +decision. +8. Implement code in Kubelet *or* the controllers that evicts a pod that no +longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). + +We assume Kubelet publishes labels describing the node's membership in all of +the relevant scheduling domains (e.g. node name, rack name, availability zone +name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). + +## Extensibility + +The design described here is the result of careful analysis of use cases, a +decade of experience with Borg at Google, and a review of similar features in +other open-source container orchestration systems. We believe that it properly +balances the goal of expressiveness against the goals of simplicity and +efficiency of implementation. However, we recognize that use cases may arise in +the future that cannot be expressed using the syntax described here. Although we +are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire +for Kubernetes users to get a consistent experience, etc.), the regular +Kubernetes annotation mechanism can be used to add or replace affinity rules. +The way this work would is: + +1. Define one or more annotations to describe the new affinity rule(s) +1. User (or an admission controller) attaches the annotation(s) to pods to +request the desired scheduling behavior. If the new rule(s) *replace* one or +more fields of `Affinity` then the user would omit those fields from `Affinity`; +if they are *additional rules*, then the user would fill in `Affinity` as well +as the annotation(s). +1. Scheduler takes the annotation(s) into account when scheduling. + +If some particular new syntax becomes popular, we would consider upstreaming it +by integrating it into the standard `Affinity`. + +## Future work + +Are there any other fields we should convert from `map[string]string` to +`NodeSelector`? + +## Related issues + +The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261). + +The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341). +Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related. +Those issues reference other related issues. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() + diff --git a/design/persistent-storage.md b/design/persistent-storage.md new file mode 100644 index 00000000..70bcde97 --- /dev/null +++ b/design/persistent-storage.md @@ -0,0 +1,292 @@ +# Persistent Storage + +This document proposes a model for managing persistent, cluster-scoped storage +for applications requiring long lived data. + +### Abstract + +Two new API kinds: + +A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. +It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) +for how to use it. + +A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to +use in a pod. It is analogous to a pod. + +One new system component: + +`PersistentVolumeClaimBinder` is a singleton running in master that watches all +PersistentVolumeClaims in the system and binds them to the closest matching +available PersistentVolume. The volume manager watches the API for newly created +volumes to manage. + +One new volume: + +`PersistentVolumeClaimVolumeSource` references the user's PVC in the same +namespace. This volume finds the bound PV and mounts that volume for the pod. A +`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another +type of volume that is owned by someone else (the system). + +Kubernetes makes no guarantees at runtime that the underlying storage exists or +is available. High availability is left to the storage provider. + +### Goals + +* Allow administrators to describe available storage. +* Allow pod authors to discover and request persistent volumes to use with pods. +* Enforce security through access control lists and securing storage to the same +namespace as the pod volume. +* Enforce quotas through admission control. +* Enforce scheduler rules by resource counting. +* Ensure developers can rely on storage being available without being closely +bound to a particular disk, server, network, or storage device. + +#### Describe available storage + +Cluster administrators use the API to manage *PersistentVolumes*. A custom store +`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by +storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for +storage and binds them to an available volume by matching the volume's +characteristics (AccessModes and storage size) to the user's request. + +PVs are system objects and, thus, have no namespace. + +Many means of dynamic provisioning will be eventually be implemented for various +storage types. + + +##### PersistentVolume API + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume | +| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} | +| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} | +| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} | +| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume | +| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume | + + +#### Request Storage + +Kubernetes users request persistent storage for their pod by creating a +```PersistentVolumeClaim```. Their request for storage is described by their +requirements for resources and mount capabilities. + +Requests for volumes are bound to available volumes by the volume manager, if a +suitable match is found. Requests for resources can go unfulfilled. + +Users attach their claim to their pod using a new +```PersistentVolumeClaimVolumeSource``` volume source. + + +##### PersistentVolumeClaim API + + +| Action | HTTP Verb | Path | Description | +| ---- | ---- | ---- | ---- | +| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | +| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | +| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | +| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | +| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | +| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | + + + +#### Scheduling constraints + +Scheduling constraints are to be handled similar to pod resource constraints. +Pods will need to be annotated or decorated with the number of resources it +requires on a node. Similarly, a node will need to list how many it has used or +available. + +TBD + + +#### Events + +The implementation of persistent storage will not require events to communicate +to the user the state of their claim. The CLI for bound claims contains a +reference to the backing persistent volume. This is always present in the API +and CLI, making an event to communicate the same unnecessary. + +Events that communicate the state of a mounted volume are left to the volume +plugins. + +### Example + +#### Admin provisions storage + +An administrator provisions storage by posting PVs to the API. Various ways to +automate this task can be scripted. Dynamic provisioning is a future feature +that can maintain levels of PVs. + +```yaml +POST: + +kind: PersistentVolume +apiVersion: v1 +metadata: + name: pv0001 +spec: + capacity: + storage: 10 + persistentDisk: + pdName: "abc123" + fsType: "ext4" +``` + +```console +$ kubectl get pv + +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON +pv0001 map[] 10737418240 RWO Pending +``` + +#### Users request storage + +A user requests storage by posting a PVC to the API. Their request contains the +AccessModes they wish their volume to have and the minimum size needed. + +The user must be within a namespace to create PVCs. + +```yaml +POST: + +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: myclaim-1 +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 3 +``` + +```console +$ kubectl get pvc + +NAME LABELS STATUS VOLUME +myclaim-1 map[] pending +``` + + +#### Matching and binding + +The ```PersistentVolumeClaimBinder``` attempts to find an available volume that +most closely matches the user's request. If one exists, they are bound by +putting a reference on the PV to the PVC. Requests can go unfulfilled if a +suitable match is not found. + +```console +$ kubectl get pv + +NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON +pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e + + +kubectl get pvc + +NAME LABELS STATUS VOLUME +myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e +``` + +A claim must request access modes and storage capacity. This is because internally PVs are +indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity. +A claim may request one of more of the following attributes to better match a PV: volume name, selectors, +and volume class (currently implemented as an annotation). + +A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which +PVC it will match. +A PV may also define labels, annotations, and a volume class (currently implemented as an +annotation) to better target PVCs. + +As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is +matched to a PV: + +1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered. +"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines +the mode requested by the claim. + +1. The potential PVs above are considered in order of the closest access mode match, with the best case +being an exact match, and a worse case being more modes than requested by the claim. + +1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity +is not less than the storage being requested by the claim then this PV will bind to the claim. Done. + +1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is +skipped and will be handled by Dynamic Provisioning. + +1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a +placeholder, then the PV is skipped. + +1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the +PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match +since capacities may differ. + +1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder +for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped. +If the annotations for the PV and PVC are empty they are treated as being equal. + +1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs, +the PV with the smallest capacity that is also equal to or greater than the claim's requested storage +is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all +of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner. + +*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default +`StorageClass` has been defined) then a volume will be dynamically provisioned. + +#### Claim usage + +The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim +and mount its volume for a pod. + +The claim holder owns the claim and its data for as long as the claim exists. +The pod using the claim can be deleted, but the claim remains in the user's +namespace. It can be used again and again by many pods. + +```yaml +POST: + +kind: Pod +apiVersion: v1 +metadata: + name: mypod +spec: + containers: + - image: nginx + name: myfrontend + volumeMounts: + - mountPath: "/var/www/html" + name: mypd + volumes: + - name: mypd + source: + persistentVolumeClaim: + accessMode: ReadWriteOnce + claimRef: + name: myclaim-1 +``` + +#### Releasing a claim and Recycling a volume + +When a claim holder is finished with their data, they can delete their claim. + +```console +$ kubectl delete pvc myclaim-1 +``` + +The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim +reference from the PV and change the PVs status to 'Released'. + +Admins can script the recycling of released volumes. Future dynamic provisioners +will understand how a volume should be recycled. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() + diff --git a/design/podaffinity.md b/design/podaffinity.md new file mode 100644 index 00000000..9291b8b9 --- /dev/null +++ b/design/podaffinity.md @@ -0,0 +1,673 @@ +# Inter-pod topological affinity and anti-affinity + +## Introduction + +NOTE: It is useful to read about [node affinity](nodeaffinity.md) first. + +This document describes a proposal for specifying and implementing inter-pod +topological affinity and anti-affinity. By that we mean: rules that specify that +certain pods should be placed in the same topological domain (e.g. same node, +same rack, same zone, same power domain, etc.) as some other pods, or, +conversely, should *not* be placed in the same topological domain as some other +pods. + +Here are a few example rules; we explain how to express them using the API +described in this doc later, in the section "Examples." +* Affinity + * Co-locate the pods from a particular service or Job in the same availability +zone, without specifying which zone that should be. + * Co-locate the pods from service S1 with pods from service S2 because S1 uses +S2 and thus it is useful to minimize the network latency between them. +Co-location might mean same nodes and/or same availability zone. +* Anti-affinity + * Spread the pods of a service across nodes and/or availability zones, e.g. to +reduce correlated failures. + * Give a pod "exclusive" access to a node to guarantee resource isolation -- +it must never share the node with other pods. + * Don't schedule the pods of a particular service on the same nodes as pods of +another service that are known to interfere with the performance of the pods of +the first service. + +For both affinity and anti-affinity, there are three variants. Two variants have +the property of requiring the affinity/anti-affinity to be satisfied for the pod +to be allowed to schedule onto a node; the difference between them is that if +the condition ceases to be met later on at runtime, for one of them the system +will try to eventually evict the pod, while for the other the system may not try +to do so. The third variant simply provides scheduling-time *hints* that the +scheduler will try to satisfy but may not be able to. These three variants are +directly analogous to the three variants of [node affinity](nodeaffinity.md). + +Note that this proposal is only about *inter-pod* topological affinity and +anti-affinity. There are other forms of topological affinity and anti-affinity. +For example, you can use [node affinity](nodeaffinity.md) to require (prefer) +that a set of pods all be scheduled in some specific zone Z. Node affinity is +not capable of expressing inter-pod dependencies, and conversely the API we +describe in this document is not capable of expressing node affinity rules. For +simplicity, we will use the terms "affinity" and "anti-affinity" to mean +"inter-pod topological affinity" and "inter-pod topological anti-affinity," +respectively, in the remainder of this document. + +## API + +We will add one field to `PodSpec` + +```go +Affinity *Affinity `json:"affinity,omitempty"` +``` + +The `Affinity` type is defined as follows + +```go +type Affinity struct { + PodAffinity *PodAffinity `json:"podAffinity,omitempty"` + PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` +} + +type PodAffinity struct { + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system will try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system may or may not try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +type PodAntiAffinity struct { + // If the anti-affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the anti-affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system will try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` + // If the anti-affinity requirements specified by this field are not met at + // scheduling time, the pod will not be scheduled onto the node. + // If the anti-affinity requirements specified by this field cease to be met + // at some point during pod execution (e.g. due to a pod label update), the + // system may or may not try to eventually evict the pod from its node. + // When there are multiple elements, the lists of nodes corresponding to each + // PodAffinityTerm are intersected, i.e. all terms must be satisfied. + RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` + // The scheduler will prefer to schedule pods to nodes that satisfy + // the anti-affinity expressions specified by this field, but it may choose + // a node that violates one or more of the expressions. The node that is + // most preferred is the one with the greatest sum of weights, i.e. + // for each node that meets all of the scheduling requirements (resource + // request, RequiredDuringScheduling anti-affinity expressions, etc.), + // compute a sum by iterating through the elements of this field and adding + // "weight" to the sum if the node matches the corresponding MatchExpressions; the + // node(s) with the highest sum are the most preferred. + PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` +} + +type WeightedPodAffinityTerm struct { + // weight is in the range 1-100 + Weight int `json:"weight"` + PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"` +} + +type PodAffinityTerm struct { + LabelSelector *LabelSelector `json:"labelSelector,omitempty"` + // namespaces specifies which namespaces the LabelSelector applies to (matches against); + // nil list means "this pod's namespace," empty list means "all namespaces" + // The json tag here is not "omitempty" since we need to distinguish nil and empty. + // See https://golang.org/pkg/encoding/json/#Marshal for more details. + Namespaces []api.Namespace `json:"namespaces,omitempty"` + // empty topology key is interpreted by the scheduler as "all topologies" + TopologyKey string `json:"topologyKey,omitempty"` +} +``` + +Note that the `Namespaces` field is necessary because normal `LabelSelector` is +scoped to the pod's namespace, but we need to be able to match against all pods +globally. + +To explain how this API works, let's say that the `PodSpec` of a pod `P` has an +`Affinity` that is configured as follows (note that we've omitted and collapsed +some fields for simplicity, but this should sufficiently convey the intent of +the design): + +```go +PodAffinity { + RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}}, + PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}}, +} +PodAntiAffinity { + RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}}, + PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}} +} +``` + +Then when scheduling pod P, the scheduler: +* Can only schedule P onto nodes that are running pods that satisfy `P1`. +(Assumes all nodes have a label with key `node` and value specifying their node +name.) +* Should try to schedule P onto zones that are running pods that satisfy `P2`. +(Assumes all nodes have a label with key `zone` and value specifying their +zone.) +* Cannot schedule P onto any racks that are running pods that satisfy `P3`. +(Assumes all nodes have a label with key `rack` and value specifying their rack +name.) +* Should try not to schedule P onto any power domains that are running pods that +satisfy `P4`. (Assumes all nodes have a label with key `power` and value +specifying their power domain.) + +When `RequiredDuringScheduling` has multiple elements, the requirements are +ANDed. For `PreferredDuringScheduling` the weights are added for the terms that +are satisfied for each node, and the node(s) with the highest weight(s) are the +most preferred. + +In reality there are two variants of `RequiredDuringScheduling`: one suffixed +with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`. +For the first variant, if the affinity/anti-affinity ceases to be met at some +point during pod execution (e.g. due to a pod label update), the system will try +to eventually evict the pod from its node. In the second variant, the system may +or may not try to eventually evict the pod from its node. + +## A comment on symmetry + +One thing that makes affinity and anti-affinity tricky is symmetry. + +Imagine a cluster that is running pods from two services, S1 and S2. Imagine +that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not +run me on nodes that are running pods from S2." It is not sufficient just to +check that there are no S2 pods on a node when you are scheduling a S1 pod. You +also need to ensure that there are no S1 pods on a node when you are scheduling +a S2 pod, *even though the S2 pod does not have any anti-affinity rules*. +Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's +RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving +S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling +anti-affinity rule, then: +* if a node is empty, you can schedule S1 or S2 onto the node +* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node + +Note that while RequiredDuringScheduling anti-affinity is symmetric, +RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 +have a RequiredDuringScheduling affinity rule "run me on nodes that are running +pods from S2," it is not required that there be S1 pods on a node in order to +schedule a S2 pod onto that node. More specifically, if S1 has the +aforementioned RequiredDuringScheduling affinity rule, then: +* if a node is empty, you can schedule S2 onto the node +* if a node is empty, you cannot schedule S1 onto the node +* if a node is running S2, you can schedule S1 onto the node +* if a node is running S1+S2 and S1 terminates, S2 continues running +* if a node is running S1+S2 and S2 terminates, the system terminates S1 +(eventually) + +However, although RequiredDuringScheduling affinity is not symmetric, there is +an implicit PreferredDuringScheduling affinity rule corresponding to every +RequiredDuringScheduling affinity rule: if the pods of S1 have a +RequiredDuringScheduling affinity rule "run me on nodes that are running pods +from S2" then it is not required that there be S1 pods on a node in order to +schedule a S2 pod onto that node, but it would be better if there are. + +PreferredDuringScheduling is symmetric. If the pods of S1 had a +PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that +are running pods from S2" then we would prefer to keep a S1 pod that we are +scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that +we are scheduling off of nodes that are running S1 pods. Likewise if the pods of +S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that +are running pods from S2" then we would prefer to place a S1 pod that we are +scheduling onto a node that is running a S2 pod, and also to place a S2 pod that +we are scheduling onto a node that is running a S1 pod. + +## Examples + +Here are some examples of how you would express various affinity and +anti-affinity rules using the API we described. + +### Affinity + +In the examples below, the word "put" is intentionally ambiguous; the rules are +the same whether "put" means "must put" (RequiredDuringScheduling) or "try to +put" (PreferredDuringScheduling)--all that changes is which field the rule goes +into. Also, we only discuss scheduling-time, and ignore the execution-time. +Finally, some of the examples use "zone" and some use "node," just to make the +examples more interesting; any of the examples with "zone" will also work for +"node" if you change the `TopologyKey`, and vice-versa. + +* **Put the pod in zone Z**: +Tricked you! It is not possible express this using the API described here. For +this you should use node affinity. + +* **Put the pod in a zone that is running at least one pod from service S**: +`{LabelSelector: , TopologyKey: "zone"}` + +* **Put the pod on a node that is already running a pod that requires a license +for software package P**: Assuming pods that require a license for software +package P have a label `{key=license, value=P}`: +`{LabelSelector: "license" In "P", TopologyKey: "node"}` + +* **Put this pod in the same zone as other pods from its same service**: +Assuming pods from this pod's service have some label `{key=service, value=S}`: +`{LabelSelector: "service" In "S", TopologyKey: "zone"}` + +This last example illustrates a small issue with this API when it is used with a +scheduler that processes the pending queue one pod at a time, like the current +Kubernetes scheduler. The RequiredDuringScheduling rule +`{LabelSelector: "service" In "S", TopologyKey: "zone"}` +only "works" once one pod from service S has been scheduled. But if all pods in +service S have this RequiredDuringScheduling rule in their PodSpec, then the +RequiredDuringScheduling rule will block the first pod of the service from ever +scheduling, since it is only allowed to run in a zone with another pod from the +same service. And of course that means none of the pods of the service will be +able to schedule. This problem *only* applies to RequiredDuringScheduling +affinity, not PreferredDuringScheduling affinity or any variant of +anti-affinity. There are at least three ways to solve this problem: +* **short-term**: have the scheduler use a rule that if the +RequiredDuringScheduling affinity requirement matches a pod's own labels, and +there are no other such pods anywhere, then disregard the requirement. This +approach has a corner case when running parallel schedulers that are allowed to +schedule pods from the same replicated set (e.g. a single PodTemplate): both +schedulers may try to schedule pods from the set at the same time and think +there are no other pods from that set scheduled yet (e.g. they are trying to +schedule the first two pods from the set), but by the time the second binding is +committed, the first one has already been committed, leaving you with two pods +running that do not respect their RequiredDuringScheduling affinity. There is no +simple way to detect this "conflict" at scheduling time given the current system +implementation. +* **longer-term**: when a controller creates pods from a PodTemplate, for +exactly *one* of those pods, it should omit any RequiredDuringScheduling +affinity rules that select the pods of that PodTemplate. +* **very long-term/speculative**: controllers could present the scheduler with a +group of pods from the same PodTemplate as a single unit. This is similar to the +first approach described above but avoids the corner case. No special logic is +needed in the controllers. Moreover, this would allow the scheduler to do proper +[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since +it could receive an entire gang simultaneously as a single unit. + +### Anti-affinity + +As with the affinity examples, the examples here can be RequiredDuringScheduling +or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as +"must not" or as "try not to" depending on whether the rule appears in +`RequiredDuringScheduling` or `PreferredDuringScheduling`. + +* **Spread the pods of this service S across nodes and zones**: +`{{LabelSelector: , TopologyKey: "node"}, +{LabelSelector: , TopologyKey: "zone"}}` +(note that if this is specified as a RequiredDuringScheduling anti-affinity, +then the first clause is redundant, since the second clause will force the +scheduler to not put more than one pod from S in the same zone, and thus by +definition it will not put more than one pod from S on the same node, assuming +each node is in one zone. This rule is more useful as PreferredDuringScheduling +anti-affinity, e.g. one might expect it to be common in +[Cluster Federation](../../docs/proposals/federation.md) clusters.) + +* **Don't co-locate pods of this service with pods from service "evilService"**: +`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` + +* **Don't co-locate pods of this service with any other pods including pods of this service**: +`{LabelSelector: empty, TopologyKey: "node"}` + +* **Don't co-locate pods of this service with any other pods except other pods of this service**: +Assuming pods from the service have some label `{key=service, value=S}`: +`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}` +Note that this works because `"service" NotIn "S"` matches pods with no key +"service" as well as pods with key "service" and a corresponding value that is +not "S." + +## Algorithm + +An example algorithm a scheduler might use to implement affinity and +anti-affinity rules is as follows. There are certainly more efficient ways to +do it; this is just intended to demonstrate that the API's semantics are +implementable. + +Terminology definition: We say a pod P is "feasible" on a node N if P meets all +of the scheduler predicates for scheduling P onto N. Note that this algorithm is +only concerned about scheduling time, thus it makes no distinction between +RequiredDuringExecution and IgnoredDuringExecution. + +To make the algorithm slightly more readable, we use the term "HardPodAffinity" +as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and +"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity." +Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." + +** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} +into account; currently it assumes all terms have weight 1. ** + +``` +Z = the pod you are scheduling +{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z +// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction +X = {Z's PodSpec's HardPodAffinity} +foreach element H of {X} + P = {all pods in the system that match H.LabelSelector} + M map[string]int // topology value -> number of pods running on nodes with that topology value + foreach pod Q of {P} + L = {labels of the node on which Q is running, represented as a map from label key to label value} + M[L[H.TopologyKey]]++ + {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]} +// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity +// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0 +X = {Z's PodSpec's HardPodAntiAffinity} +foreach element H of {X} + P = {all pods in the system that match H.LabelSelector} + M map[string]int // topology value -> number of pods running on nodes with that topology value + foreach pod Q of {P} + L = {labels of the node on which Q is running, represented as a map from label key to label value} + M[L[H.TopologyKey]]++ + {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]} +// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity +foreach node A of {N} + foreach pod B that is bound to A + if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N} +// At this point, all node in {N} are feasible for Z. +// Step 3a: Soft version of Step 1a +Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node +Initialize the keys of Y to all of the nodes in {N}, and the values to 0 +X = {Z's PodSpec's SoftPodAffinity} +Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" +// Step 3b: Soft version of Step 1b +X = {Z's PodSpec's SoftPodAntiAffinity} +Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" +// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft +foreach node A of {N} + foreach pod B that is bound to A + increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A +// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is +// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with +// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better). +``` + +## Special considerations for RequiredDuringScheduling anti-affinity + +In this section we discuss three issues with RequiredDuringScheduling +anti-affinity: Denial of Service (DoS), co-existing with daemons, and +determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265) +for additional discussion of these topics. + +### Denial of Service + +Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity +can intentionally or unintentionally cause various problems for other pods, due +to the symmetry property of anti-affinity. + +The most notable danger is the ability for a pod that arrives first to some +topology domain, to block all other pods from scheduling there by stating a +conflict with all other pods. The standard approach to preventing resource +hogging is quota, but simple resource quota cannot prevent this scenario because +the pod may request very little resources. Addressing this using quota requires +a quota scheme that charges based on "opportunity cost" rather than based simply +on requested resources. For example, when handling a pod that expresses +RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` +(i.e. exclusive access to a node), it could charge for the resources of the +average or largest node in the cluster. Likewise if a pod expresses +RequiredDuringScheduling anti-affinity for all pods using a "cluster" +`TopologyKey`, it could charge for the resources of the entire cluster. If node +affinity is used to constrain the pod to a particular topology domain, then the +admission-time quota charging should take that into account (e.g. not charge for +the average/largest machine if the PodSpec constrains the pod to a specific +machine with a known size; instead charge for the size of the actual machine +that the pod was constrained to). In all cases once the pod is scheduled, the +quota charge should be adjusted down to the actual amount of resources allocated +(e.g. the size of the actual machine that was assigned, not the +average/largest). If a cluster administrator wants to overcommit quota, for +example to allow more than N pods across all users to request exclusive node +access in a cluster with N nodes, then a priority/preemption scheme should be +added so that the most important pods run when resource demand exceeds supply. + +An alternative approach, which is a bit of a blunt hammer, is to use a +capability mechanism to restrict use of RequiredDuringScheduling anti-affinity +to trusted users. A more complex capability mechanism might only restrict it +when using a non-"node" TopologyKey. + +Our initial implementation will use a variant of the capability approach, which +requires no configuration: we will simply reject ALL requests, regardless of +user, that specify "all namespaces" with non-"node" TopologyKey for +RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use +case while prohibiting the more dangerous ones. + +A weaker variant of the problem described in the previous paragraph is a pod's +ability to use anti-affinity to degrade the scheduling quality of another pod, +but not completely block it from scheduling. For example, a set of pods S1 could +use node affinity to request to schedule onto a set of nodes that some other set +of pods S2 prefers to schedule onto. If the pods in S1 have +RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for +S2, then due to the symmetry property of anti-affinity, they can prevent the +pods in S2 from scheduling onto their preferred nodes if they arrive first (for +sure in the RequiredDuringScheduling case, and with some probability that +depends on the weighting scheme for the PreferredDuringScheduling case). A very +sophisticated priority and/or quota scheme could mitigate this, or alternatively +we could eliminate the symmetry property of the implementation of +PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling +anti-affinity could affect scheduling quality of another pod, and as we +described in the previous paragraph, such pods could be charged quota for the +full topology domain, thereby reducing the potential for abuse. + +We won't try to address this issue in our initial implementation; we can +consider one of the approaches mentioned above if it turns out to be a problem +in practice. + +### Co-existing with daemons + +A cluster administrator may wish to allow pods that express anti-affinity +against all pods, to nonetheless co-exist with system daemon pods, such as those +run by DaemonSet. In principle, we would like the specification for +RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or +more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263) +for a more detailed explanation of the toleration concept). +There are at least two ways to accomplish this: + +* Scheduler special-cases the namespace(s) where daemons live, in the + sense that it ignores pods in those namespaces when it is + determining feasibility for pods with anti-affinity. The name(s) of + the special namespace(s) could be a scheduler configuration + parameter, and default to `kube-system`. We could allow + multiple namespaces to be specified if we want cluster admins to be + able to give their own daemons this special power (they would add + their namespace to the list in the scheduler configuration). And of + course this would be symmetric, so daemons could schedule onto a node + that is already running a pod with anti-affinity. + +* We could add an explicit "toleration" concept/field to allow the + user to specify namespaces that are excluded when they use + RequiredDuringScheduling anti-affinity, and use an admission + controller/defaulter to ensure these namespaces are always listed. + +Our initial implementation will use the first approach. + +### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution) + +Because anti-affinity is symmetric, in the case of +RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must +determine which pod(s) to kill when a pod's labels are updated in such as way as +to cause them to conflict with one or more other pods' +RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the +absence of a priority/preemption scheme, our rule will be that the pod with the +anti-affinity rule that becomes violated should be the one killed. A pod should +only specify constraints that apply to namespaces it trusts to not do malicious +things. Once we have priority/preemption, we can change the rule to say that the +lowest-priority pod(s) are killed until all +RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied. + +## Special considerations for RequiredDuringScheduling affinity + +The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its +symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with +conflicting pods, and pods that conflict with P cannot schedule onto the node +one P has been scheduled there. The design we have described says that the +symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P +says it can only schedule onto nodes running pod Q, this does not mean Q can +only run on a node that is running P, but the scheduler will try to schedule Q +onto a node that is running P (i.e. treats the reverse direction as preferred). +This raises the same scheduling quality concern as we mentioned at the end of +the Denial of Service section above, and can be addressed in similar ways. + +The nature of affinity (as opposed to anti-affinity) means that there is no +issue of determining which pod(s) to kill when a pod's labels change: it is +obviously the pod with the affinity rule that becomes violated that must be +killed. (Killing a pod never "fixes" violation of an affinity rule; it can only +"fix" violation an anti-affinity rule.) However, affinity does have a different +question related to killing: how long should the system wait before declaring +that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met +at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q +is temporarily killed so that it can be updated to a new binary version, should +that trigger killing of P? More generally, how long should the system wait +before declaring that P's affinity is violated? (Of course affinity is expressed +in terms of label selectors, not for a specific pod, but the scenario is easier +to describe using a concrete pod.) This is closely related to the concept of +forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)). +In theory we could make this time duration be configurable by the user on a per-pod +basis, but for the first version of this feature we will make it a configurable +property of whichever component does the killing and that applies across all pods +using the feature. Making it configurable by the user would require a nontrivial +change to the API syntax (since the field would only apply to +RequiredDuringSchedulingRequiredDuringExecution affinity). + +## Implementation plan + +1. Add the `Affinity` field to PodSpec and the `PodAffinity` and +`PodAntiAffinity` types to the API along with all of their descendant types. +2. Implement a scheduler predicate that takes +`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into +account. Include a workaround for the issue described at the end of the Affinity +section of the Examples section (can't schedule first pod). +3. Implement a scheduler priority function that takes +`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity +into account. +4. Implement admission controller that rejects requests that specify "all +namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` +anti-affinity. This admission controller should be enabled by default. +5. Implement the recommended solution to the "co-existing with daemons" issue +6. At this point, the feature can be deployed. +7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity +and anti-affinity, and make sure the pieces of the system already implemented +for `RequiredDuringSchedulingIgnoredDuringExecution` also take +`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the +scheduler predicate, the quota mechanism, the "co-existing with daemons" +solution). +8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" +`TopologyKey` to Kubelet's admission decision. +9. Implement code in Kubelet *or* the controllers that evicts a pod that no +longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet +then only for "node" `TopologyKey`; if controller then potentially for all +`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). +Do so in a way that addresses the "determining which pod(s) to kill" issue. + +We assume Kubelet publishes labels describing the node's membership in all of +the relevant scheduling domains (e.g. node name, rack name, availability zone +name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). + +## Backward compatibility + +Old versions of the scheduler will ignore `Affinity`. + +Users should not start using `Affinity` until the full implementation has been +in Kubelet and the master for enough binary versions that we feel comfortable +that we will not need to roll back either Kubelet or master to a version that +does not support them. Longer-term we will use a programmatic approach to +enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). + +## Extensibility + +The design described here is the result of careful analysis of use cases, a +decade of experience with Borg at Google, and a review of similar features in +other open-source container orchestration systems. We believe that it properly +balances the goal of expressiveness against the goals of simplicity and +efficiency of implementation. However, we recognize that use cases may arise in +the future that cannot be expressed using the syntax described here. Although we +are not implementing an affinity-specific extensibility mechanism for a variety +of reasons (simplicity of the codebase, simplicity of cluster deployment, desire +for Kubernetes users to get a consistent experience, etc.), the regular +Kubernetes annotation mechanism can be used to add or replace affinity rules. +The way this work would is: +1. Define one or more annotations to describe the new affinity rule(s) +1. User (or an admission controller) attaches the annotation(s) to pods to +request the desired scheduling behavior. If the new rule(s) *replace* one or +more fields of `Affinity` then the user would omit those fields from `Affinity`; +if they are *additional rules*, then the user would fill in `Affinity` as well +as the annotation(s). +1. Scheduler takes the annotation(s) into account when scheduling. + +If some particular new syntax becomes popular, we would consider upstreaming it +by integrating it into the standard `Affinity`. + +## Future work and non-work + +One can imagine that in the anti-affinity RequiredDuringScheduling case one +might want to associate a number with the rule, for example "do not allow this +pod to share a rack with more than three other pods (in total, or from the same +service as the pod)." We could allow this to be specified by adding an integer +`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case. +However, this flexibility complicates the system and we do not intend to +implement it. + +It is likely that the specification and implementation of pod anti-affinity +can be unified with [taints and tolerations](taint-toleration-dedicated.md), +and likewise that the specification and implementation of pod affinity +can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod +labels would be "inherited" by the node, and pods would only be able to specify +affinity and anti-affinity for a node's labels. Our main motivation for not +unifying taints and tolerations with pod anti-affinity is that we foresee taints +and tolerations as being a concept that only cluster administrators need to +understand (and indeed in some setups taints and tolerations wouldn't even be +directly manipulated by a cluster administrator, instead they would only be set +by an admission controller that is implementing the administrator's high-level +policy about different classes of special machines and the users who belong to +the groups allowed to access them). Moreover, the concept of nodes "inheriting" +labels from pods seems complicated; it seems conceptually simpler to separate +rules involving relatively static properties of nodes from rules involving which +other pods are running on the same node or larger topology domain. + +Data/storage affinity is related to pod affinity, and is likely to draw on some +of the ideas we have used for pod affinity. Today, data/storage affinity is +expressed using node affinity, on the assumption that the pod knows which +node(s) store(s) the data it wants. But a more flexible approach would allow the +pod to name the data rather than the node. + +## Related issues + +The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265). + +The topic of affinity/anti-affinity has generated a lot of discussion. The main +issue is [#367](https://github.com/kubernetes/kubernetes/issues/367) +but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485), +[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369), +[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707), +[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341), +[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906) +all have additional discussion and use cases. + +As the examples in this document have demonstrated, topological affinity is very +useful in clusters that are spread across availability zones, e.g. to co-locate +pods of a service in the same zone to avoid a wide-area network hop, or to +spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059), +[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063), +and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant. + +Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related. + +This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816). + +## Related work + +** TODO: cite references ** + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() + diff --git a/design/principles.md b/design/principles.md new file mode 100644 index 00000000..4e0b663c --- /dev/null +++ b/design/principles.md @@ -0,0 +1,101 @@ +# Design Principles + +Principles to follow when extending Kubernetes. + +## API + +See also the [API conventions](../devel/api-conventions.md). + +* All APIs should be declarative. +* API objects should be complementary and composable, not opaque wrappers. +* The control plane should be transparent -- there are no hidden internal APIs. +* The cost of API operations should be proportional to the number of objects +intentionally operated upon. Therefore, common filtered lookups must be indexed. +Beware of patterns of multiple API calls that would incur quadratic behavior. +* Object status must be 100% reconstructable by observation. Any history kept +must be just an optimization and not required for correct operation. +* Cluster-wide invariants are difficult to enforce correctly. Try not to add +them. If you must have them, don't enforce them atomically in master components, +that is contention-prone and doesn't provide a recovery path in the case of a +bug allowing the invariant to be violated. Instead, provide a series of checks +to reduce the probability of a violation, and make every component involved able +to recover from an invariant violation. +* Low-level APIs should be designed for control by higher-level systems. +Higher-level APIs should be intent-oriented (think SLOs) rather than +implementation-oriented (think control knobs). + +## Control logic + +* Functionality must be *level-based*, meaning the system must operate correctly +given the desired state and the current/observed state, regardless of how many +intermediate state updates may have been missed. Edge-triggered behavior must be +just an optimization. +* Assume an open world: continually verify assumptions and gracefully adapt to +external events and/or actors. Example: we allow users to kill pods under +control of a replication controller; it just replaces them. +* Do not define comprehensive state machines for objects with behaviors +associated with state transitions and/or "assumed" states that cannot be +ascertained by observation. +* Don't assume a component's decisions will not be overridden or rejected, nor +for the component to always understand why. For example, etcd may reject writes. +Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, +but back off and/or make alternative decisions. +* Components should be self-healing. For example, if you must keep some state +(e.g., cache) the content needs to be periodically refreshed, so that if an item +does get erroneously stored or a deletion event is missed etc, it will be soon +fixed, ideally on timescales that are shorter than what will attract attention +from humans. +* Component behavior should degrade gracefully. Prioritize actions so that the +most important activities can continue to function even when overloaded and/or +in states of partial failure. + +## Architecture + +* Only the apiserver should communicate with etcd/store, and not other +components (scheduler, kubelet, etc.). +* Compromising a single node shouldn't compromise the cluster. +* Components should continue to do what they were last told in the absence of +new instructions (e.g., due to network partition or component outage). +* All components should keep all relevant state in memory all the time. The +apiserver should write through to etcd/store, other components should write +through to the apiserver, and they should watch for updates made by other +clients. +* Watch is preferred over polling. + +## Extensibility + +TODO: pluggability + +## Bootstrapping + +* [Self-hosting](http://issue.k8s.io/246) of all components is a goal. +* Minimize the number of dependencies, particularly those required for +steady-state operation. +* Stratify the dependencies that remain via principled layering. +* Break any circular dependencies by converting hard dependencies to soft +dependencies. + * Also accept that data from other components from another source, such as +local files, which can then be manually populated at bootstrap time and then +continuously updated once those other components are available. + * State should be rediscoverable and/or reconstructable. + * Make it easy to run temporary, bootstrap instances of all components in +order to create the runtime state needed to run the components in the steady +state; use a lock (master election for distributed components, file lock for +local components like Kubelet) to coordinate handoff. We call this technique +"pivoting". + * Have a solution to restart dead components. For distributed components, +replication works well. For local components such as Kubelet, a process manager +or even a simple shell loop works. + +## Availability + +TODO + +## General principles + +* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() + diff --git a/design/resource-qos.md b/design/resource-qos.md new file mode 100644 index 00000000..cfbe4faf --- /dev/null +++ b/design/resource-qos.md @@ -0,0 +1,218 @@ +# Resource Quality of Service in Kubernetes + +**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) +**Last Updated**: 5/17/2016 + +**Status**: Implemented + +*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.* + +## Introduction + +This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*. +Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee. + +Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. +The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. +When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers. +This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees. +Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes. + +## Requests and Limits + +For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](../proposals/node-allocatable.md) & `request <= limit <= Infinity`. +If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. +Scheduling is based on `requests` and not `limits`. +The pods and its containers will not be allowed to exceed the specified limit. +How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md). + +### Compressible Resource Guarantees + +- For now, we are only supporting CPU. +- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal. +- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections). +- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available. + +### Incompressible Resource Guarantees + +- For now, we are only supporting memory. +- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory). +- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel. + +### Admission/Scheduling Policy + +- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU). + +## QoS Classes + +In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority. + +The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section. + +Pods can be of one of 3 different classes: + +- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**. + +Examples: + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + name: bar + resources: + limits: + cpu: 100m + memory: 100Mi +``` + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + requests: + cpu: 10m + memory: 1Gi + + name: bar + resources: + limits: + cpu: 100m + memory: 100Mi + requests: + cpu: 100m + memory: 100Mi +``` + +- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**. +When `limits` are not specified, they default to the node capacity. + +Examples: + +Container `bar` has not resources specified. + +```yaml +containers: + name: foo + resources: + limits: + cpu: 10m + memory: 1Gi + requests: + cpu: 10m + memory: 1Gi + + name: bar +``` + +Container `foo` and `bar` have limits set for different resources. + +```yaml +containers: + name: foo + resources: + limits: + memory: 1Gi + + name: bar + resources: + limits: + cpu: 100m +``` + +Container `foo` has no limits set, and `bar` has neither requests nor limits specified. + +```yaml +containers: + name: foo + resources: + requests: + cpu: 10m + memory: 1Gi + + name: bar +``` + +- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**. + +Examples: + +```yaml +containers: + name: foo + resources: + name: bar + resources: +``` + +Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled. + +Memory is an incompressible resource and so let's discuss the semantics of memory management a bit. + +- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. +These containers can use any amount of free memory in the node though. + +- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted. + +- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available. +Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist. + +### OOM Score configuration at the Nodes + +Pod OOM score configuration +- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed. +- The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ - process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B. +- The final OOM score of a process is also between 0 and 1000 + +*Best-effort* + - Set OOM_SCORE_ADJ: 1000 + - So processes in best-effort containers will have an OOM_SCORE of 1000 + +*Guaranteed* + - Set OOM_SCORE_ADJ: -998 + - So processes in guaranteed containers will have an OOM_SCORE of 0 or 1 + +*Burstable* + - If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2 + - Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested) + - This ensures that the OOM_SCORE of burstable pod is > 1 + - If memory request is `0`, OOM_SCORE_ADJ is set to `999`. + - So burstable pods will be killed if they conflict with guaranteed pods + - If a burstable pod uses less memory than requested, its OOM_SCORE < 1000 + - So best-effort pods will be killed if they conflict with burstable pods using less than requested memory + - If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000 + - Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed + - If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. + +*Pod infra containers* or *Special Pod init process* + - OOM_SCORE_ADJ: -998 + +*Kubelet, Docker* + - OOM_SCORE_ADJ: -999 (won’t be OOM killed) + - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. + +## Known issues and possible improvements + +The above implementation provides for basic oversubscription with protection, but there are a few known limitations. + +#### Support for Swap + +- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior. + +## Alternative QoS Class Policy + +An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). +A strict hierarchy of user-specified numerical priorities is not desirable because: + +1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively +2. Changes to desired priority bands would require changes to all user pod configurations. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() + diff --git a/design/resources.md b/design/resources.md new file mode 100644 index 00000000..bb66885b --- /dev/null +++ b/design/resources.md @@ -0,0 +1,370 @@ +**Note: this is a design doc, which describes features that have not been +completely implemented. User documentation of the current state is +[here](../user-guide/compute-resources.md). The tracking issue for +implementation of this model is [#168](http://issue.k8s.io/168). Currently, both +limits and requests of memory and cpu on containers (not pods) are supported. +"memory" is in bytes and "cpu" is in milli-cores.** + +# The Kubernetes resource model + +To do good pod placement, Kubernetes needs to know how big pods are, as well as +the sizes of the nodes onto which they are being placed. The definition of "how +big" is given by the Kubernetes resource model — the subject of this +document. + +The resource model aims to be: +* simple, for common cases; +* extensible, to accommodate future growth; +* regular, with few special cases; and +* precise, to avoid misunderstandings and promote pod portability. + +## The resource model + +A Kubernetes _resource_ is something that can be requested by, allocated to, or +consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, +and network bandwidth. + +Once resources on a node have been allocated to one pod, they should not be +allocated to another until that pod is removed or exits. This means that +Kubernetes schedulers should ensure that the sum of the resources allocated +(requested and granted) to its pods never exceeds the usable capacity of the +node. Testing whether a pod will fit on a node is called _feasibility checking_. + +Note that the resource model currently prohibits over-committing resources; we +will want to relax that restriction later. + +### Resource types + +All resources have a _type_ that is identified by their _typename_ (a string, +e.g., "memory"). Several resource types are predefined by Kubernetes (a full +list is below), although only two will be supported at first: CPU and memory. +Users and system administrators can define their own resource types if they wish +(e.g., Hadoop slots). + +A fully-qualified resource typename is constructed from a DNS-style _subdomain_, +followed by a slash `/`, followed by a name. +* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) +(e.g., `kubernetes.io`, `example.com`). +* The name must be not more than 63 characters, consisting of upper- or +lower-case alphanumeric characters, with the `-`, `_`, and `.` characters +allowed anywhere except the first or last character. +* As a shorthand, any resource typename that does not start with a subdomain and +a slash will automatically be prefixed with the built-in Kubernetes _namespace_, +`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for +code in the open source Kubernetes repository; as a result, all user typenames +MUST be fully qualified, and cannot be created in this namespace. + +Some example typenames include `memory` (which will be fully-qualified as +`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. + +For future reference, note that some resources, such as CPU and network +bandwidth, are _compressible_, which means that their usage can potentially be +throttled in a relatively benign manner. All other resources are +_incompressible_, which means that any attempt to throttle them is likely to +cause grief. This distinction will be important if a Kubernetes implementation +supports over-committing of resources. + +### Resource quantities + +Initially, all Kubernetes resource types are _quantitative_, and have an +associated _unit_ for quantities of the associated resource (e.g., bytes for +memory, bytes per seconds for bandwidth, instances for software licences). The +units will always be a resource type's natural base units (e.g., bytes, not MB), +to avoid confusion between binary and decimal multipliers and the underlying +unit multiplier (e.g., is memory measured in MiB, MB, or GB?). + +Resource quantities can be added and subtracted: for example, a node has a fixed +quantity of each resource type that can be allocated to pods/containers; once +such an allocation has been made, the allocated resources cannot be made +available to other pods/containers without over-committing the resources. + +To make life easier for people, quantities can be represented externally as +unadorned integers, or as fixed-point integers with one of these SI suffices +(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, + Ki). For example, the following represent roughly the same value: 128974848, +"129e6", "129M" , "123Mi". Small quantities can be represented directly as +decimals (e.g., 0.3), or using milli-units (e.g., "300m"). + * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML +resource specifications that might be generated or read by people. + * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI +suffix. There are no power-of-two equivalents for SI suffixes that represent +multipliers less than 1. + * These conventions only apply to resource quantities, not arbitrary values. + +Internally (i.e., everywhere else), Kubernetes will represent resource +quantities as integers so it can avoid problems with rounding errors, and will +not use strings to represent numeric values. To achieve this, quantities that +naturally have fractional parts (e.g., CPU seconds/second) will be scaled to +integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. +Internal APIs, data structures, and protobufs will use these scaled integer +units. Raw measurement data such as usage may still need to be tracked and +calculated using floating point values, but internally they should be rescaled +to avoid some values being in milli-units and some not. + * Note that reading in a resource quantity and writing it out again may change +the way its values are represented, and truncate precision (e.g., 1.0001 may +become 1.000), so comparison and difference operations (e.g., by an updater) +must be done on the internal representations. + * Avoiding milli-units in external representations has advantages for people +who will use Kubernetes, but runs the risk of developers forgetting to rescale +or accidentally using floating-point representations. That seems like the right +choice. We will try to reduce the risk by providing libraries that automatically +do the quantization for JSON/YAML inputs. + +### Resource specifications + +Both users and a number of system components, such as schedulers, (horizontal) +auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers +need to reason about resource requirements of workloads, resource capacities of +nodes, and resource usage. Kubernetes divides specifications of *desired state*, +aka the Spec, and representations of *current state*, aka the Status. Resource +requirements and total node capacity fall into the specification category, while +resource usage, characterizations derived from usage (e.g., maximum usage, +histograms), and other resource demand signals (e.g., CPU load) clearly fall +into the status category and are discussed in the Appendix for now. + +Resource requirements for a container or pod should have the following form: + +```yaml +resourceRequirementSpec: [ + request: [ cpu: 2.5, memory: "40Mi" ], + limit: [ cpu: 4.0, memory: "99Mi" ], +] +``` + +Where: +* _request_ [optional]: the amount of resources being requested, or that were +requested and have been allocated. Scheduler algorithms will use these +quantities to test feasibility (whether a pod will fit onto a node). +If a container (or pod) tries to use more resources than its _request_, any +associated SLOs are voided — e.g., the program it is running may be +throttled (compressible resource types), or the attempt may be denied. If +_request_ is omitted for a container, it defaults to _limit_ if that is +explicitly specified, otherwise to an implementation-defined value; this will +always be 0 for a user-defined resource type. If _request_ is omitted for a pod, +it defaults to the sum of the (explicit or implicit) _request_ values for the +containers it encloses. + +* _limit_ [optional]: an upper bound or cap on the maximum amount of resources +that will be made available to a container or pod; if a container or pod uses +more resources than its _limit_, it may be terminated. The _limit_ defaults to +"unbounded"; in practice, this probably means the capacity of an enclosing +container, pod, or node, but may result in non-deterministic behavior, +especially for memory. + +Total capacity for a node should have a similar structure: + +```yaml +resourceCapacitySpec: [ + total: [ cpu: 12, memory: "128Gi" ] +] +``` + +Where: +* _total_: the total allocatable resources of a node. Initially, the resources +at a given scope will bound the resources of the sum of inner scopes. + +#### Notes + + * It is an error to specify the same resource type more than once in each +list. + + * It is an error for the _request_ or _limit_ values for a pod to be less than +the sum of the (explicit or defaulted) values for the containers it encloses. +(We may relax this later.) + + * If multiple pods are running on the same node and attempting to use more +resources than they have requested, the result is implementation-defined. For +example: unallocated or unused resources might be spread equally across +claimants, or the assignment might be weighted by the size of the original +request, or as a function of limits, or priority, or the phase of the moon, +perhaps modulated by the direction of the tide. Thus, although it's not +mandatory to provide a _request_, it's probably a good idea. (Note that the +_request_ could be filled in by an automated system that is observing actual +usage and/or historical data.) + + * Internally, the Kubernetes master can decide the defaulting behavior and the +kubelet implementation may expected an absolute specification. For example, if +the master decided that "the default is unbounded" it would pass 2^64 to the +kubelet. + + +## Kubernetes-defined resource types + +The following resource types are predefined ("reserved") by Kubernetes in the +`kubernetes.io` namespace, and so cannot be used for user-defined resources. +Note that the syntax of all resource types in the resource spec is deliberately +similar, but some resource types (e.g., CPU) may receive significantly more +support than simply tracking quantities in the schedulers and/or the Kubelet. + +### Processor cycles + + * Name: `cpu` (or `kubernetes.io/cpu`) + * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to +a canonical "Kubernetes CPU") + * Internal representation: milli-KCUs + * Compressible? yes + * Qualities: this is a placeholder for the kind of thing that may be supported +in the future — see [#147](http://issue.k8s.io/147) + * [future] `schedulingLatency`: as per lmctfy + * [future] `cpuConversionFactor`: property of a node: the speed of a CPU +core on the node's processor divided by the speed of the canonical Kubernetes +CPU (a floating point value; default = 1.0). + +To reduce performance portability problems for pods, and to avoid worse-case +provisioning behavior, the units of CPU will be normalized to a canonical +"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be +equivalent to a single CPU hyperthreaded core for some recent x86 processor. The +normalization may be implementation-defined, although some reasonable defaults +will be provided in the open-source Kubernetes code. + +Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will +be allocated — control of aspects like this will be handled by resource +_qualities_ (a future feature). + + +### Memory + + * Name: `memory` (or `kubernetes.io/memory`) + * Units: bytes + * Compressible? no (at least initially) + +The precise meaning of what "memory" means is implementation dependent, but the +basic idea is to rely on the underlying `memcg` mechanisms, support, and +definitions. + +Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory +quantities rather than decimal ones: "64MiB" rather than "64MB". + + +## Resource metadata + +A resource type may have an associated read-only ResourceType structure, that +contains metadata about the type. For example: + +```yaml +resourceTypes: [ + "kubernetes.io/memory": [ + isCompressible: false, ... + ] + "kubernetes.io/cpu": [ + isCompressible: true, + internalScaleExponent: 3, ... + ] + "kubernetes.io/disk-space": [ ... ] +] +``` + +Kubernetes will provide ResourceType metadata for its predefined types. If no +resource metadata can be found for a resource type, Kubernetes will assume that +it is a quantified, incompressible resource that is not specified in +milli-units, and has no default value. + +The defined properties are as follows: + +| field name | type | contents | +| ---------- | ---- | -------- | +| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) | +| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) | +| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". | +| isCompressible | bool, default=false | true if the resource type is compressible | +| defaultRequest | string, default=none | in the same format as a user-supplied value | +| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). | + + +# Appendix: future extensions + +The following are planned future extensions to the resource model, included here +to encourage comments. + +## Usage data + +Because resource usage and related metrics change continuously, need to be +tracked over time (i.e., historically), can be characterized in a variety of +ways, and are fairly voluminous, we will not include usage in core API objects, +such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs +for accessing and managing that data. See the Appendix for possible +representations of usage data, but the representation we'll use is TBD. + +Singleton values for observed and predicted future usage will rapidly prove +inadequate, so we will support the following structure for extended usage +information: + +```yaml +resourceStatus: [ + usage: [ cpu: , memory: ], + maxusage: [ cpu: , memory: ], + predicted: [ cpu: , memory: ], +] +``` + +where a `` or `` structure looks like this: + +```yaml +{ + mean: # arithmetic mean + max: # minimum value + min: # maximum value + count: # number of data points + percentiles: [ # map from %iles to values + "10": <10th-percentile-value>, + "50": , + "99": <99th-percentile-value>, + "99.9": <99.9th-percentile-value>, + ... + ] +} +``` + +All parts of this structure are optional, although we strongly encourage +including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. +_[In practice, it will be important to include additional info such as the +length of the time window over which the averages are calculated, the +confidence level, and information-quality metrics such as the number of dropped +or discarded data points.]_ and predicted + +## Future resource types + +### _[future] Network bandwidth_ + + * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`) + * Units: bytes per second + * Compressible? yes + +### _[future] Network operations_ + + * Name: "network-iops" (or `kubernetes.io/network-iops`) + * Units: operations (messages) per second + * Compressible? yes + +### _[future] Storage space_ + + * Name: "storage-space" (or `kubernetes.io/storage-space`) + * Units: bytes + * Compressible? no + +The amount of secondary storage space available to a container. The main target +is local disk drives and SSDs, although this could also be used to qualify +remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a +disk array, or a file system fronting any of these, is left for future work. + +### _[future] Storage time_ + + * Name: storage-time (or `kubernetes.io/storage-time`) + * Units: seconds per second of disk time + * Internal representation: milli-units + * Compressible? yes + +This is the amount of time a container spends accessing disk, including actuator +and transfer time. A standard disk drive provides 1.0 diskTime seconds per +second. + +### _[future] Storage operations_ + + * Name: "storage-iops" (or `kubernetes.io/storage-iops`) + * Units: operations per second + * Compressible? yes + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() + diff --git a/design/scheduler_extender.md b/design/scheduler_extender.md new file mode 100644 index 00000000..1f362242 --- /dev/null +++ b/design/scheduler_extender.md @@ -0,0 +1,105 @@ +# Scheduler extender + +There are three ways to add new scheduling rules (predicates and priority +functions) to Kubernetes: (1) by adding these rules to the scheduler and +recompiling (described here: +https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), +(2) implementing your own scheduler process that runs instead of, or alongside +of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" +process that the standard Kubernetes scheduler calls out to as a final pass when +making scheduling decisions. + +This document describes the third approach. This approach is needed for use +cases where scheduling decisions need to be made on resources not directly +managed by the standard Kubernetes scheduler. The extender helps make scheduling +decisions based on such resources. (Note that the three approaches are not +mutually exclusive.) + +When scheduling a pod, the extender allows an external process to filter and +prioritize nodes. Two separate http/https calls are issued to the extender, one +for "filter" and one for "prioritize" actions. To use the extender, you must +create a scheduler policy configuration file. The configuration specifies how to +reach the extender, whether to use http or https and the timeout. + +```go +// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty, +// it is assumed that the extender chose not to provide that extension. +type ExtenderConfig struct { + // URLPrefix at which the extender is available + URLPrefix string `json:"urlPrefix"` + // Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. + FilterVerb string `json:"filterVerb,omitempty"` + // Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. + PrioritizeVerb string `json:"prioritizeVerb,omitempty"` + // The numeric multiplier for the node scores that the prioritize call generates. + // The weight should be a positive integer + Weight int `json:"weight,omitempty"` + // EnableHttps specifies whether https should be used to communicate with the extender + EnableHttps bool `json:"enableHttps,omitempty"` + // TLSConfig specifies the transport layer security config + TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"` + // HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize + // timeout is ignored, k8s/other extenders priorities are used to select the node. + HTTPTimeout time.Duration `json:"httpTimeout,omitempty"` +} +``` + +A sample scheduler policy file with extender configuration: + +```json +{ + "predicates": [ + { + "name": "HostName" + }, + { + "name": "MatchNodeSelector" + }, + { + "name": "PodFitsResources" + } + ], + "priorities": [ + { + "name": "LeastRequestedPriority", + "weight": 1 + } + ], + "extenders": [ + { + "urlPrefix": "http://127.0.0.1:12345/api/scheduler", + "filterVerb": "filter", + "enableHttps": false + } + ] +} +``` + +Arguments passed to the FilterVerb endpoint on the extender are the set of nodes +filtered through the k8s predicates and the pod. Arguments passed to the +PrioritizeVerb endpoint on the extender are the set of nodes filtered through +the k8s predicates and extender predicates and the pod. + +```go +// ExtenderArgs represents the arguments needed by the extender to filter/prioritize +// nodes for a pod. +type ExtenderArgs struct { + // Pod being scheduled + Pod api.Pod `json:"pod"` + // List of candidate nodes where the pod can be scheduled + Nodes api.NodeList `json:"nodes"` +} +``` + +The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call +returns priorities for each node (schedulerapi.HostPriorityList). + +The "filter" call may prune the set of nodes based on its predicates. Scores +returned by the "prioritize" call are added to the k8s scores (computed through +its priority functions) and used for final host selection. + +Multiple extenders can be configured in the scheduler policy. + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() + diff --git a/design/seccomp.md b/design/seccomp.md new file mode 100644 index 00000000..de00cbc0 --- /dev/null +++ b/design/seccomp.md @@ -0,0 +1,266 @@ +## Abstract + +A proposal for adding **alpha** support for +[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a +system call filtering facility in the Linux kernel which lets applications +define limits on system calls they may make, and what should happen when +system calls are made. Seccomp is used to reduce the attack surface available +to applications. + +## Motivation + +Applications use seccomp to restrict the set of system calls they can make. +Recently, container runtimes have begun adding features to allow the runtime +to interact with seccomp on behalf of the application, which eliminates the +need for applications to link against libseccomp directly. Adding support in +the Kubernetes API for describing seccomp profiles will allow administrators +greater control over the security of workloads running in Kubernetes. + +Goals of this design: + +1. Describe how to reference seccomp profiles in containers that use them + +## Constraints and Assumptions + +This design should: + +* build upon previous security context work +* be container-runtime agnostic +* allow use of custom profiles +* facilitate containerized applications that link directly to libseccomp + +## Use Cases + +1. As an administrator, I want to be able to grant access to a seccomp profile + to a class of users +2. As a user, I want to run an application with a seccomp profile similar to + the default one provided by my container runtime +3. As a user, I want to run an application which is already libseccomp-aware + in a container, and for my application to manage interacting with seccomp + unmediated by Kubernetes +4. As a user, I want to be able to use a custom seccomp profile and use + it with my containers + +### Use Case: Administrator access control + +Controlling access to seccomp profiles is a cluster administrator +concern. It should be possible for an administrator to control which users +have access to which profiles. + +The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893) +API extension governs the ability of users to make requests that affect pod +and container security contexts. The proposed design should deal with +required changes to control access to new functionality. + +### Use Case: Seccomp profiles similar to container runtime defaults + +Many users will want to use images that make assumptions about running in the +context of their chosen container runtime. Such images are likely to +frequently assume that they are running in the context of the container +runtime's default seccomp settings. Therefore, it should be possible to +express a seccomp profile similar to a container runtime's defaults. + +As an example, all dockerhub 'official' images are compatible with the Docker +default seccomp profile. So, any user who wanted to run one of these images +with seccomp would want the default profile to be accessible. + +### Use Case: Applications that link to libseccomp + +Some applications already link to libseccomp and control seccomp directly. It +should be possible to run these applications unmodified in Kubernetes; this +implies there should be a way to disable seccomp control in Kubernetes for +certain containers, or to run with a "no-op" or "unconfined" profile. + +Sometimes, applications that link to seccomp can use the default profile for a +container runtime, and restrict further on top of that. It is important to +note here that in this case, applications can only place _further_ +restrictions on themselves. It is not possible to re-grant the ability of a +process to make a system call once it has been removed with seccomp. + +As an example, elasticsearch manages its own seccomp filters in its code. +Currently, elasticsearch is capable of running in the context of the default +Docker profile, but if in the future, elasticsearch needed to be able to call +`ioperm` or `iopr` (both of which are disallowed in the default profile), it +should be possible to run elasticsearch by delegating the seccomp controls to +the pod. + +### Use Case: Custom profiles + +Different applications have different requirements for seccomp profiles; it +should be possible to specify an arbitrary seccomp profile and use it in a +container. This is more of a concern for applications which need a higher +level of privilege than what is granted by the default profile for a cluster, +since applications that want to restrict privileges further can always make +additional calls in their own code. + +An example of an application that requires the use of a syscall disallowed in +the Docker default profile is Chrome, which needs `clone` to create a new user +namespace. Another example would be a program which uses `ptrace` to +implement a sandbox for user-provided code, such as +[eval.in](https://eval.in/). + +## Community Work + +### Container runtime support for seccomp + +#### Docker / opencontainers + +Docker supports the open container initiative's API for +seccomp, which is very close to the libseccomp API. It allows full +specification of seccomp filters, with arguments, operators, and actions. + +Docker allows the specification of a single seccomp filter. There are +community requests for: + +Issues: + +* [docker/22109](https://github.com/docker/docker/issues/22109): composable + seccomp filters +* [docker/21105](https://github.com/docker/docker/issues/22105): custom + seccomp filters for builds + +#### rkt / appcontainers + +The `rkt` runtime delegates to systemd for seccomp support; there is an open +issue to add support once `appc` supports it. The `appc` project has an open +issue to be able to describe seccomp as an isolator in an appc pod. + +The systemd seccomp facility is based on a whitelist of system calls that can +be made, rather than a full filter specification. + +Issues: + +* [appc/529](https://github.com/appc/spec/issues/529) +* [rkt/1614](https://github.com/coreos/rkt/issues/1614) + +#### HyperContainer + +[HyperContainer](https://hypercontainer.io) does not support seccomp. + +### Other platforms and seccomp-like capabilities + +FreeBSD has a seccomp/capability-like facility called +[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). + +#### lxd + +[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile. + +Issues: + +* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp + +## Proposed Design + +### Seccomp API Resource? + +An earlier draft of this proposal described a new global API resource that +could be used to describe seccomp profiles. After some discussion, it was +determined that without a feedback signal from users indicating a need to +describe new profiles in the Kubernetes API, it is not possible to know +whether a new API resource is warranted. + +That being the case, we will not propose a new API resource at this time. If +there is strong community desire for such a resource, we may consider it in +the future. + +Instead of implementing a new API resource, we propose that pods be able to +reference seccomp profiles by name. Since this is an alpha feature, we will +use annotations instead of extending the API with new fields. + +### API changes? + +In the alpha version of this feature we will use annotations to store the +names of seccomp profiles. The keys will be: + +`container.seccomp.security.alpha.kubernetes.io/` + +which will be used to set the seccomp profile of a container, and: + +`seccomp.security.alpha.kubernetes.io/pod` + +which will set the seccomp profile for the containers of an entire pod. If a +pod-level annotation is present, and a container-level annotation present for +a container, then the container-level profile takes precedence. + +The value of these keys should be container-runtime agnostic. We will +establish a format that expresses the conventions for distinguishing between +an unconfined profile, the container runtime's default, or a custom profile. +Since format of profile is likely to be runtime dependent, we will consider +profiles to be opaque to kubernetes for now. + +The following format is scoped as follows: + +1. `runtime/default` - the default profile for the container runtime +2. `unconfined` - unconfined profile, ie, no seccomp sandboxing +3. `localhost/` - the profile installed to the node's local seccomp profile root + +Since seccomp profile schemes may vary between container runtimes, we will +treat the contents of profiles as opaque for now and avoid attempting to find +a common way to describe them. It is up to the container runtime to be +sensitive to the annotations proposed here and to interpret instructions about +local profiles. + +A new area on disk (which we will call the seccomp profile root) must be +established to hold seccomp profiles. A field will be added to the Kubelet +for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to +allow admins to set it. If unset, it should default to the `seccomp` +subdirectory of the kubelet root directory. + +### Pod Security Policy annotation + +The `PodSecurityPolicy` type should be annotated with the allowed seccomp +profiles using the key +`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this +key should be a comma delimited list. + +## Examples + +### Unconfined profile + +Here's an example of a pod that uses the unconfined profile: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: trustworthy-pod + annotations: + seccomp.security.alpha.kubernetes.io/pod: unconfined +spec: + containers: + - name: trustworthy-container + image: sotrustworthy:latest +``` + +### Custom profile + +Here's an example of a pod that uses a profile called `example-explorer- +profile` using the container-level annotation: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: explorer + annotations: + container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile +spec: + containers: + - name: explorer + image: gcr.io/google_containers/explorer:1.0 + args: ["-port=8080"] + ports: + - containerPort: 8080 + protocol: TCP + volumeMounts: + - mountPath: "/mount/test-volume" + name: test-volume + volumes: + - name: test-volume + emptyDir: {} +``` + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() + diff --git a/design/secrets.md b/design/secrets.md new file mode 100644 index 00000000..29d18411 --- /dev/null +++ b/design/secrets.md @@ -0,0 +1,628 @@ +## Abstract + +A proposal for the distribution of [secrets](../user-guide/secrets.md) +(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using +a custom [volume](../user-guide/volumes.md#secrets) type. See the +[secrets example](../user-guide/secrets/) for more information. + +## Motivation + +Secrets are needed in containers to access internal resources like the +Kubernetes master or external resources such as git repositories, databases, +etc. Users may also want behaviors in the kubelet that depend on secret data +(credentials for image pull from a docker registry) associated with pods. + +Goals of this design: + +1. Describe a secret resource +2. Define the various challenges attendant to managing secrets on the node +3. Define a mechanism for consuming secrets in containers without modification + +## Constraints and Assumptions + +* This design does not prescribe a method for storing secrets; storage of +secrets should be pluggable to accommodate different use-cases +* Encryption of secret data and node security are orthogonal concerns +* It is assumed that node and master are secure and that compromising their +security could also compromise secrets: + * If a node is compromised, the only secrets that could potentially be +exposed should be the secrets belonging to containers scheduled onto it + * If the master is compromised, all secrets in the cluster may be exposed +* Secret rotation is an orthogonal concern, but it should be facilitated by +this proposal +* A user who can consume a secret in a container can know the value of the +secret; secrets must be provisioned judiciously + +## Use Cases + +1. As a user, I want to store secret artifacts for my applications and consume +them securely in containers, so that I can keep the configuration for my +applications separate from the images that use them: + 1. As a cluster operator, I want to allow a pod to access the Kubernetes +master using a custom `.kubeconfig` file, so that I can securely reach the +master + 2. As a cluster operator, I want to allow a pod to access a Docker registry +using credentials from a `.dockercfg` file, so that containers can push images + 3. As a cluster operator, I want to allow a pod to access a git repository +using SSH keys, so that I can push to and fetch from the repository +2. As a user, I want to allow containers to consume supplemental information +about services such as username and password which should be kept secret, so +that I can share secrets about a service amongst the containers in my +application securely +3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a +secret and have the kubelet implement some reserved behaviors based on the types +of secrets the service account consumes: + 1. Use credentials for a docker registry to pull the pod's docker image + 2. Present Kubernetes auth token to the pod or transparently decorate +traffic between the pod and master service +4. As a user, I want to be able to indicate that a secret expires and for that +secret's value to be rotated once it expires, so that the system can help me +follow good practices + +### Use-Case: Configuration artifacts + +Many configuration files contain secrets intermixed with other configuration +information. For example, a user's application may contain a properties file +than contains database credentials, SaaS API tokens, etc. Users should be able +to consume configuration artifacts in their containers and be able to control +the path on the container's filesystems where the artifact will be presented. + +### Use-Case: Metadata about services + +Most pieces of information about how to use a service are secrets. For example, +a service that provides a MySQL database needs to provide the username, +password, and database name to consumers so that they can authenticate and use +the correct database. Containers in pods consuming the MySQL service would also +consume the secrets associated with the MySQL service. + +### Use-Case: Secrets associated with service accounts + +[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple +capabilities and security contexts from individual human users. A +`ServiceAccount` contains references to some number of secrets. A `Pod` can +specify that it is associated with a `ServiceAccount`. Secrets should have a +`Type` field to allow the Kubelet and other system components to take action +based on the secret's type. + +#### Example: service account consumes auth token secret + +As an example, the service account proposal discusses service accounts consuming +secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod +associated with a service account which consumes this type of secret, the +Kubelet may take a number of actions: + +1. Expose the secret in a `.kubernetes_auth` file in a well-known location in +the container's file system +2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod +to the `kubernetes-master` service with the auth token, e. g. by adding a header +to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) + +#### Example: service account consumes docker registry credentials + +Another example use case is where a pod is associated with a secret containing +docker registry credentials. The Kubelet could use these credentials for the +docker pull to retrieve the image. + +### Use-Case: Secret expiry and rotation + +Rotation is considered a good practice for many types of secret data. It should +be possible to express that a secret has an expiry date; this would make it +possible to implement a system component that could regenerate expired secrets. +As an example, consider a component that rotates expired secrets. The rotator +could periodically regenerate the values for expired secrets of common types and +update their expiry dates. + +## Deferral: Consuming secrets as environment variables + +Some images will expect to receive configuration items as environment variables +instead of files. We should consider what the best way to allow this is; there +are a few different options: + +1. Force the user to adapt files into environment variables. Users can store +secrets that need to be presented as environment variables in a format that is +easy to consume from a shell: + + $ cat /etc/secrets/my-secret.txt + export MY_SECRET_ENV=MY_SECRET_VALUE + + The user could `source` the file at `/etc/secrets/my-secret` prior to +executing the command for the image either inline in the command or in an init +script. + +2. Give secrets an attribute that allows users to express the intent that the +platform should generate the above syntax in the file used to present a secret. +The user could consume these files in the same manner as the above option. + +3. Give secrets attributes that allow the user to express that the secret +should be presented to the container as an environment variable. The container's +environment would contain the desired values and the software in the container +could use them without accommodation the command or setup script. + +For our initial work, we will treat all secrets as files to narrow the problem +space. There will be a future proposal that handles exposing secrets as +environment variables. + +## Flow analysis of secret data with respect to the API server + +There are two fundamentally different use-cases for access to secrets: + +1. CRUD operations on secrets by their owners +2. Read-only access to the secrets needed for a particular node by the kubelet + +### Use-Case: CRUD operations by owners + +In use cases for CRUD operations, the user experience for secrets should be no +different than for other API resources. + +#### Data store backing the REST API + +The data store backing the REST API should be pluggable because different +cluster operators will have different preferences for the central store of +secret data. Some possibilities for storage: + +1. An etcd collection alongside the storage for other API resources +2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) +3. A secrets server like [Vault](https://www.vaultproject.io/) or +[Keywhiz](https://square.github.io/keywhiz/) +4. An external datastore such as an external etcd, RDBMS, etc. + +#### Size limit for secrets + +There should be a size limit for secrets in order to: + +1. Prevent DOS attacks against the API server +2. Allow kubelet implementations that prevent secret data from touching the +node's filesystem + +The size limit should satisfy the following conditions: + +1. Large enough to store common artifact types (encryption keypairs, +certificates, small configuration files) +2. Small enough to avoid large impact on node resource consumption (storage, +RAM for tmpfs, etc) + +To begin discussion, we propose an initial value for this size limit of **1MB**. + +#### Other limitations on secrets + +Defining a policy for limitations on how a secret may be referenced by another +API resource and how constraints should be applied throughout the cluster is +tricky due to the number of variables involved: + +1. Should there be a maximum number of secrets a pod can reference via a +volume? +2. Should there be a maximum number of secrets a service account can reference? +3. Should there be a total maximum number of secrets a pod can reference via +its own spec and its associated service account? +4. Should there be a total size limit on the amount of secret data consumed by +a pod? +5. How will cluster operators want to be able to configure these limits? +6. How will these limits impact API server validations? +7. How will these limits affect scheduling? + +For now, we will not implement validations around these limits. Cluster +operators will decide how much node storage is allocated to secrets. It will be +the operator's responsibility to ensure that the allocated storage is sufficient +for the workload scheduled onto a node. + +For now, kubelets will only attach secrets to api-sourced pods, and not file- +or http-sourced ones. Doing so would: + - confuse the secrets admission controller in the case of mirror pods. + - create an apiserver-liveness dependency -- avoiding this dependency is a +main reason to use non-api-source pods. + +### Use-Case: Kubelet read of secrets for node + +The use-case where the kubelet reads secrets has several additional requirements: + +1. Kubelets should only be able to receive secret data which is required by +pods scheduled onto the kubelet's node +2. Kubelets should have read-only access to secret data +3. Secret data should not be transmitted over the wire insecurely +4. Kubelets must ensure pods do not have access to each other's secrets + +#### Read of secret data by the Kubelet + +The Kubelet should only be allowed to read secrets which are consumed by pods +scheduled onto that Kubelet's node and their associated service accounts. +Authorization of the Kubelet to read this data would be delegated to an +authorization plugin and associated policy rule. + +#### Secret data on the node: data at rest + +Consideration must be given to whether secret data should be allowed to be at +rest on the node: + +1. If secret data is not allowed to be at rest, the size of secret data becomes +another draw on the node's RAM - should it affect scheduling? +2. If secret data is allowed to be at rest, should it be encrypted? + 1. If so, how should be this be done? + 2. If not, what threats exist? What types of secret are appropriate to +store this way? + +For the sake of limiting complexity, we propose that initially secret data +should not be allowed to be at rest on a node; secret data should be stored on a +node-level tmpfs filesystem. This filesystem can be subdivided into directories +for use by the kubelet and by the volume plugin. + +#### Secret data on the node: resource consumption + +The Kubelet will be responsible for creating the per-node tmpfs file system for +secret storage. It is hard to make a prescriptive declaration about how much +storage is appropriate to reserve for secrets because different installations +will vary widely in available resources, desired pod to node density, overcommit +policy, and other operation dimensions. That being the case, we propose for +simplicity that the amount of secret storage be controlled by a new parameter to +the kubelet with a default value of **64MB**. It is the cluster operator's +responsibility to handle choosing the right storage size for their installation +and configuring their Kubelets correctly. + +Configuring each Kubelet is not the ideal story for operator experience; it is +more intuitive that the cluster-wide storage size be readable from a central +configuration store like the one proposed in [#1553](http://issue.k8s.io/1553). +When such a store exists, the Kubelet could be modified to read this +configuration item from the store. + +When the Kubelet is modified to advertise node resources (as proposed in +[#4441](http://issue.k8s.io/4441)), the capacity calculation +for available memory should factor in the potential size of the node-level tmpfs +in order to avoid memory overcommit on the node. + +#### Secret data on the node: isolation + +Every pod will have a [security context](security_context.md). +Secret data on the node should be isolated according to the security context of +the container. The Kubelet volume plugin API will be changed so that a volume +plugin receives the security context of a volume along with the volume spec. +This will allow volume plugins to implement setting the security context of +volumes they manage. + +## Community work + +Several proposals / upstream patches are notable as background for this +proposal: + +1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) +2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) +3. [Kubernetes service account proposal](service_accounts.md) +4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) +5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) + +## Proposed Design + +We propose a new `Secret` resource which is mounted into containers with a new +volume type. Secret volumes will be handled by a volume plugin that does the +actual work of fetching the secret and storing it. Secrets contain multiple +pieces of data that are presented as different files within the secret volume +(example: SSH key pair). + +In order to remove the burden from the end user in specifying every file that a +secret consists of, it should be possible to mount all files provided by a +secret with a single `VolumeMount` entry in the container specification. + +### Secret API Resource + +A new resource for secrets will be added to the API: + +```go +type Secret struct { + TypeMeta + ObjectMeta + + // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN. + // The serialized form of the secret data is a base64 encoded string, + // representing the arbitrary (possibly non-string) data value here. + Data map[string][]byte `json:"data,omitempty"` + + // Used to facilitate programmatic handling of secret data. + Type SecretType `json:"type,omitempty"` +} + +type SecretType string + +const ( + SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) + SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token + SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth + SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth + // FUTURE: other type values +) + +const MaxSecretSize = 1 * 1024 * 1024 +``` + +A Secret can declare a type in order to provide type information to system +components that work with secrets. The default type is `opaque`, which +represents arbitrary user-owned data. + +Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must +be valid DNS subdomains. + +A new REST API and registry interface will be added to accompany the `Secret` +resource. The default implementation of the registry will store `Secret` +information in etcd. Future registry implementations could store the `TypeMeta` +and `ObjectMeta` fields in etcd and store the secret data in another data store +entirely, or store the whole object in another data store. + +#### Other validations related to secrets + +Initially there will be no validations for the number of secrets a pod +references, or the number of secrets that can be associated with a service +account. These may be added in the future as the finer points of secrets and +resource allocation are fleshed out. + +### Secret Volume Source + +A new `SecretSource` type of volume source will be added to the `VolumeSource` +struct in the API: + +```go +type VolumeSource struct { + // Other fields omitted + + // SecretSource represents a secret that should be presented in a volume + SecretSource *SecretSource `json:"secret"` +} + +type SecretSource struct { + Target ObjectReference +} +``` + +Secret volume sources are validated to ensure that the specified object +reference actually points to an object of type `Secret`. + +In the future, the `SecretSource` will be extended to allow: + +1. Fine-grained control over which pieces of secret data are exposed in the +volume +2. The paths and filenames for how secret data are exposed + +### Secret Volume Plugin + +A new Kubelet volume plugin will be added to handle volumes with a secret +source. This plugin will require access to the API server to retrieve secret +data and therefore the volume `Host` interface will have to change to expose a +client interface: + +```go +type Host interface { + // Other methods omitted + + // GetKubeClient returns a client interface + GetKubeClient() client.Interface +} +``` + +The secret volume plugin will be responsible for: + +1. Returning a `volume.Mounter` implementation from `NewMounter` that: + 1. Retrieves the secret data for the volume from the API server + 2. Places the secret data onto the container's filesystem + 3. Sets the correct security attributes for the volume based on the pod's +`SecurityContext` +2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that +cleans the volume from the container's filesystem + +### Kubelet: Node-level secret storage + +The Kubelet must be modified to accept a new parameter for the secret storage +size and to create a tmpfs file system of that size to store secret data. Rough +accounting of specific changes: + +1. The Kubelet should have a new field added called `secretStorageSize`; units +are megabytes +2. `NewMainKubelet` should accept a value for secret storage size +3. The Kubelet server should have a new flag added for secret storage size +4. The Kubelet's `setupDataDirs` method should be changed to create the secret +storage + +### Kubelet: New behaviors for secrets associated with service accounts + +For use-cases where the Kubelet's behavior is affected by the secrets associated +with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example, +if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the +Kubelet will need to be changed to accommodate this. Subsequent proposals can +address this on a type-by-type basis. + +## Examples + +For clarity, let's examine some detailed examples of some common use-cases in +terms of the suggested changes. All of these examples are assumed to be created +in a namespace called `example`. + +### Use-Case: Pod with ssh keys + +To create a pod that uses an ssh key stored as a secret, we first need to create +a secret: + +```json +{ + "kind": "Secret", + "apiVersion": "v1", + "metadata": { + "name": "ssh-key-secret" + }, + "data": { + "id-rsa": "dmFsdWUtMg0KDQo=", + "id-rsa.pub": "dmFsdWUtMQ0K" + } +} +``` + +**Note:** The serialized JSON and YAML values of secret data are encoded as +base64 strings. Newlines are not valid within these strings and must be +omitted. + +Now we can create a pod which references the secret with the ssh key and +consumes it in a volume: + +```json +{ + "kind": "Pod", + "apiVersion": "v1", + "metadata": { + "name": "secret-test-pod", + "labels": { + "name": "secret-test" + } + }, + "spec": { + "volumes": [ + { + "name": "secret-volume", + "secret": { + "secretName": "ssh-key-secret" + } + } + ], + "containers": [ + { + "name": "ssh-test-container", + "image": "mySshImage", + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" + } + ] + } + ] + } +} +``` + +When the container's command runs, the pieces of the key will be available in: + + /etc/secret-volume/id-rsa.pub + /etc/secret-volume/id-rsa + +The container is then free to use the secret data to establish an ssh +connection. + +### Use-Case: Pods with prod / test credentials + +This example illustrates a pod which consumes a secret containing prod +credentials and another pod which consumes a secret with test environment +credentials. + +The secrets: + +```json +{ + "apiVersion": "v1", + "kind": "List", + "items": + [{ + "kind": "Secret", + "apiVersion": "v1", + "metadata": { + "name": "prod-db-secret" + }, + "data": { + "password": "dmFsdWUtMg0KDQo=", + "username": "dmFsdWUtMQ0K" + } + }, + { + "kind": "Secret", + "apiVersion": "v1", + "metadata": { + "name": "test-db-secret" + }, + "data": { + "password": "dmFsdWUtMg0KDQo=", + "username": "dmFsdWUtMQ0K" + } + }] +} +``` + +The pods: + +```json +{ + "apiVersion": "v1", + "kind": "List", + "items": + [{ + "kind": "Pod", + "apiVersion": "v1", + "metadata": { + "name": "prod-db-client-pod", + "labels": { + "name": "prod-db-client" + } + }, + "spec": { + "volumes": [ + { + "name": "secret-volume", + "secret": { + "secretName": "prod-db-secret" + } + } + ], + "containers": [ + { + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" + } + ] + } + ] + } + }, + { + "kind": "Pod", + "apiVersion": "v1", + "metadata": { + "name": "test-db-client-pod", + "labels": { + "name": "test-db-client" + } + }, + "spec": { + "volumes": [ + { + "name": "secret-volume", + "secret": { + "secretName": "test-db-secret" + } + } + ], + "containers": [ + { + "name": "db-client-container", + "image": "myClientImage", + "volumeMounts": [ + { + "name": "secret-volume", + "readOnly": true, + "mountPath": "/etc/secret-volume" + } + ] + } + ] + } + }] +} +``` + +The specs for the two pods differ only in the value of the object referred to by +the secret volume source. Both containers will have the following files present +on their filesystems: + + /etc/secret-volume/username + /etc/secret-volume/password + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() + diff --git a/design/security.md b/design/security.md new file mode 100644 index 00000000..b1aeacbd --- /dev/null +++ b/design/security.md @@ -0,0 +1,218 @@ +# Security in Kubernetes + +Kubernetes should define a reasonable set of security best practices that allows +processes to be isolated from each other, from the cluster infrastructure, and +which preserves important boundaries between those who manage the cluster, and +those who use the cluster. + +While Kubernetes today is not primarily a multi-tenant system, the long term +evolution of Kubernetes will increasingly rely on proper boundaries between +users and administrators. The code running on the cluster must be appropriately +isolated and secured to prevent malicious parties from affecting the entire +cluster. + + +## High Level Goals + +1. Ensure a clear isolation between the container and the underlying host it +runs on +2. Limit the ability of the container to negatively impact the infrastructure +or other containers +3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - +ensure components are only authorized to perform the actions they need, and +limit the scope of a compromise by limiting the capabilities of individual +components +4. Reduce the number of systems that have to be hardened and secured by +defining clear boundaries between components +5. Allow users of the system to be cleanly separated from administrators +6. Allow administrative functions to be delegated to users where necessary +7. Allow applications to be run on the cluster that have "secret" data (keys, +certs, passwords) which is properly abstracted from "public" data. + +## Use cases + +### Roles + +We define "user" as a unique identity accessing the Kubernetes API server, which +may be a human or an automated process. Human users fall into the following +categories: + +1. k8s admin - administers a Kubernetes cluster and has access to the underlying +components of the system +2. k8s project administrator - administrates the security of a small subset of +the cluster +3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster +resources + +Automated process users fall into the following categories: + +1. k8s container user - a user that processes running inside a container (on the +cluster) can use to access other cluster resources independent of the human +users attached to a project +2. k8s infrastructure user - the user that Kubernetes infrastructure components +use to perform cluster functions with clearly defined roles + +### Description of roles + +* Developers: + * write pod specs. + * making some of their own images, and using some "community" docker images + * know which pods need to talk to which other pods + * decide which pods should share files with other pods, and which should not. + * reason about application level security, such as containing the effects of a +local-file-read exploit in a webserver pod. + * do not often reason about operating system or organizational security. + * are not necessarily comfortable reasoning about the security properties of a +system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. + +* Project Admins: + * allocate identity and roles within a namespace + * reason about organizational security within a namespace + * don't give a developer permissions that are not needed for role. + * protect files on shared storage from unnecessary cross-team access + * are less focused about application security + +* Administrators: + * are less focused on application security. Focused on operating system +security. + * protect the node from bad actors in containers, and properly-configured +innocent containers from bad actors in other containers. + * comfortable reasoning about the security properties of a system at the level +of detail of Linux Capabilities, SELinux, AppArmor, etc. + * decides who can use which Linux Capabilities, run privileged containers, use +hostPath, etc. + * e.g. a team that manages Ceph or a mysql server might be trusted to have +raw access to storage devices in some organizations, but teams that develop the +applications at higher layers would not. + + +## Proposed Design + +A pod runs in a *security context* under a *service account* that is defined by +an administrator or project administrator, and the *secrets* a pod has access to +is limited by that *service account*. + + +1. The API should authenticate and authorize user actions [authn and authz](access.md) +2. All infrastructure components (kubelets, kube-proxies, controllers, +scheduler) should have an infrastructure user that they can authenticate with +and be authorized to perform only the functions they require against the API. +3. Most infrastructure components should use the API as a way of exchanging data +and changing the system, and only the API should have access to the underlying +data store (etcd) +4. When containers run on the cluster and need to talk to other containers or +the API server, they should be identified and authorized clearly as an +autonomous process via a [service account](service_accounts.md) + 1. If the user who started a long-lived process is removed from access to +the cluster, the process should be able to continue without interruption + 2. If the user who started processes are removed from the cluster, +administrators may wish to terminate their processes in bulk + 3. When containers run with a service account, the user that created / +triggered the service account behavior must be associated with the container's +action +5. When container processes run on the cluster, they should run in a +[security context](security_context.md) that isolates those processes via Linux +user security, user namespaces, and permissions. + 1. Administrators should be able to configure the cluster to automatically +confine all container processes as a non-root, randomly assigned UID + 2. Administrators should be able to ensure that container processes within +the same namespace are all assigned the same unix user UID + 3. Administrators should be able to limit which developers and project +administrators have access to higher privilege actions + 4. Project administrators should be able to run pods within a namespace +under different security contexts, and developers must be able to specify which +of the available security contexts they may use + 5. Developers should be able to run their own images or images from the +community and expect those images to run correctly + 6. Developers may need to ensure their images work within higher security +requirements specified by administrators + 7. When available, Linux kernel user namespaces can be used to ensure 5.2 +and 5.4 are met. + 8. When application developers want to share filesystem data via distributed +filesystems, the Unix user ids on those filesystems must be consistent across +different container processes +6. Developers should be able to define [secrets](secrets.md) that are +automatically added to the containers when pods are run + 1. Secrets are files injected into the container whose values should not be +displayed within a pod. Examples: + 1. An SSH private key for git cloning remote data + 2. A client certificate for accessing a remote system + 3. A private key and certificate for a web server + 4. A .kubeconfig file with embedded cert / token data for accessing the +Kubernetes master + 5. A .dockercfg file for pulling images from a protected registry + 2. Developers should be able to define the pod spec so that a secret lands +in a specific location + 3. Project administrators should be able to limit developers within a +namespace from viewing or modifying secrets (anyone who can launch an arbitrary +pod can view secrets) + 4. Secrets are generally not copied from one namespace to another when a +developer's application definitions are copied + + +### Related design discussion + +* [Authorization and authentication](access.md) +* [Secret distribution via files](http://pr.k8s.io/2030) +* [Docker secrets](https://github.com/docker/docker/pull/6697) +* [Docker vault](https://github.com/docker/docker/issues/10310) +* [Service Accounts:](service_accounts.md) +* [Secret volumes](http://pr.k8s.io/4126) + +## Specific Design Points + +### TODO: authorization, authentication + +### Isolate the data store from the nodes and supporting infrastructure + +Access to the central data store (etcd) in Kubernetes allows an attacker to run +arbitrary containers on hosts, to gain access to any protected information +stored in either volumes or in pods (such as access tokens or shared secrets +provided as environment variables), to intercept and redirect traffic from +running services by inserting middlemen, or to simply delete the entire history +of the cluster. + +As a general principle, access to the central data store should be restricted to +the components that need full control over the system and which can apply +appropriate authorization and authentication of change requests. In the future, +etcd may offer granular access control, but that granularity will require an +administrator to understand the schema of the data to properly apply security. +An administrator must be able to properly secure Kubernetes at a policy level, +rather than at an implementation level, and schema changes over time should not +risk unintended security leaks. + +Both the Kubelet and Kube Proxy need information related to their specific roles - +for the Kubelet, the set of pods it should be running, and for the Proxy, the +set of services and endpoints to load balance. The Kubelet also needs to provide +information about running pods and historical termination data. The access +pattern for both Kubelet and Proxy to load their configuration is an efficient +"wait for changes" request over HTTP. It should be possible to limit the Kubelet +and Proxy to only access the information they need to perform their roles and no +more. + +The controller manager for Replication Controllers and other future controllers +act on behalf of a user via delegation to perform automated maintenance on +Kubernetes resources. Their ability to access or modify resource state should be +strictly limited to their intended duties and they should be prevented from +accessing information not pertinent to their role. For example, a replication +controller needs only to create a copy of a known pod configuration, to +determine the running state of an existing pod, or to delete an existing pod +that it created - it does not need to know the contents or current state of a +pod, nor have access to any data in the pods attached volumes. + +The Kubernetes pod scheduler is responsible for reading data from the pod to fit +it onto a node in the cluster. At a minimum, it needs access to view the ID of a +pod (to craft the binding), its current state, any resource information +necessary to identify placement, and other data relevant to concerns like +anti-affinity, zone or region preference, or custom logic. It does not need the +ability to modify pods or see other resources, only to create bindings. It +should not need the ability to delete bindings unless the scheduler takes +control of relocating components on failed hosts (which could be implemented by +a separate component that can delete bindings but not create them). The +scheduler may need read access to user or project-container information to +determine preferential location (underspecified at this time). + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() + diff --git a/design/security_context.md b/design/security_context.md new file mode 100644 index 00000000..76bc8ee8 --- /dev/null +++ b/design/security_context.md @@ -0,0 +1,192 @@ +# Security Contexts + +## Abstract + +A security context is a set of constraints that are applied to a container in +order to achieve the following goals (from [security design](security.md)): + +1. Ensure a clear isolation between container and the underlying host it runs +on +2. Limit the ability of the container to negatively impact the infrastructure +or other containers + +## Background + +The problem of securing containers in Kubernetes has come up +[before](http://issue.k8s.io/398) and the potential problems with container +security are [well known](http://opensource.com/business/14/7/docker-security-selinux). +Although it is not possible to completely isolate Docker containers from their +hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) +make it possible to greatly reduce the attack surface. + +## Motivation + +### Container isolation + +In order to improve container isolation from host and other containers running +on the host, containers should only be granted the access they need to perform +their work. To this end it should be possible to take advantage of Docker +features such as the ability to +[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) +and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) +to the container process. + +Support for user namespaces has recently been +[merged](https://github.com/docker/libcontainer/pull/304) into Docker's +libcontainer project and should soon surface in Docker itself. It will make it +possible to assign a range of unprivileged uids and gids from the host to each +container, improving the isolation between host and container and between +containers. + +### External integration with shared storage + +In order to support external integration with shared storage, processes running +in a Kubernetes cluster should be able to be uniquely identified by their Unix +UID, such that a chain of ownership can be established. Processes in pods will +need to have consistent UID/GID/SELinux category labels in order to access +shared disks. + +## Constraints and Assumptions + +* It is out of the scope of this document to prescribe a specific set of +constraints to isolate containers from their host. Different use cases need +different settings. +* The concept of a security context should not be tied to a particular security +mechanism or platform (i.e. SELinux, AppArmor) +* Applying a different security context to a scope (namespace or pod) requires +a solution such as the one proposed for [service accounts](service_accounts.md). + +## Use Cases + +In order of increasing complexity, following are example use cases that would +be addressed with security contexts: + +1. Kubernetes is used to run a single cloud application. In order to protect +nodes from containers: + * All containers run as a single non-root user + * Privileged containers are disabled + * All containers run with a particular MCS label + * Kernel capabilities like CHOWN and MKNOD are removed from containers + +2. Just like case #1, except that I have more than one application running on +the Kubernetes cluster. + * Each application is run in its own namespace to avoid name collisions + * For each application a different uid and MCS label is used + +3. Kubernetes is used as the base for a PAAS with multiple projects, each +project represented by a namespace. + * Each namespace is associated with a range of uids/gids on the node that +are mapped to uids/gids on containers using linux user namespaces. + * Certain pods in each namespace have special privileges to perform system +actions such as talking back to the server for deployment, run docker builds, +etc. + * External NFS storage is assigned to each namespace and permissions set +using the range of uids/gids assigned to that namespace. + +## Proposed Design + +### Overview + +A *security context* consists of a set of constraints that determine how a +container is secured before getting created and run. A security context resides +on the container and represents the runtime parameters that will be used to +create and run the container via container APIs. A *security context provider* +is passed to the Kubelet so it can have a chance to mutate Docker API calls in +order to apply the security context. + +It is recommended that this design be implemented in two phases: + +1. Implement the security context provider extension point in the Kubelet +so that a default security context can be applied on container run and creation. +2. Implement a security context structure that is part of a service account. The +default context provider can then be used to apply a security context based on +the service account associated with the pod. + +### Security Context Provider + +The Kubelet will have an interface that points to a `SecurityContextProvider`. +The `SecurityContextProvider` is invoked before creating and running a given +container: + +```go +type SecurityContextProvider interface { + // ModifyContainerConfig is called before the Docker createContainer call. + // The security context provider can make changes to the Config with which + // the container is created. + // An error is returned if it's not possible to secure the container as + // requested with a security context. + ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) + + // ModifyHostConfig is called before the Docker runContainer call. + // The security context provider can make changes to the HostConfig, affecting + // security options, whether the container is privileged, volume binds, etc. + // An error is returned if it's not possible to secure the container as requested + // with a security context. + ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig) +} +``` + +If the value of the SecurityContextProvider field on the Kubelet is nil, the +kubelet will create and run the container as it does today. + +### Security Context + +A security context resides on the container and represents the runtime +parameters that will be used to create and run the container via container APIs. +Following is an example of an initial implementation: + +```go +type Container struct { + ... other fields omitted ... + // Optional: SecurityContext defines the security options the pod should be run with + SecurityContext *SecurityContext +} + +// SecurityContext holds security configuration that will be applied to a container. SecurityContext +// contains duplication of some existing fields from the Container resource. These duplicate fields +// will be populated based on the Container configuration if they are not set. Defining them on +// both the Container AND the SecurityContext will result in an error. +type SecurityContext struct { + // Capabilities are the capabilities to add/drop when running the container + Capabilities *Capabilities + + // Run the container in privileged mode + Privileged *bool + + // SELinuxOptions are the labels to be applied to the container + // and volumes + SELinuxOptions *SELinuxOptions + + // RunAsUser is the UID to run the entrypoint of the container process. + RunAsUser *int64 +} + +// SELinuxOptions are the labels to be applied to the container. +type SELinuxOptions struct { + // SELinux user label + User string + + // SELinux role label + Role string + + // SELinux type label + Type string + + // SELinux level label. + Level string +} +``` + +### Admission + +It is up to an admission plugin to determine if the security context is +acceptable or not. At the time of writing, the admission control plugin for +security contexts will only allow a context that has defined capabilities or +privileged. Contexts that attempt to define a UID or SELinux options will be +denied by default. In the future the admission plugin will base this decision +upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() + diff --git a/design/selector-generation.md b/design/selector-generation.md new file mode 100644 index 00000000..efb32cf2 --- /dev/null +++ b/design/selector-generation.md @@ -0,0 +1,180 @@ +Design +============= + +# Goals + +Make it really hard to accidentally create a job which has an overlapping +selector, while still making it possible to chose an arbitrary selector, and +without adding complex constraint solving to the APIserver. + +# Use Cases + +1. user can leave all label and selector fields blank and system will fill in +reasonable ones: non-overlappingness guaranteed. +2. user can put on the pod template some labels that are useful to the user, +without reasoning about non-overlappingness. System adds additional label to +assure not overlapping. +3. If user wants to reparent pods to new job (very rare case) and knows what +they are doing, they can completely disable this behavior and specify explicit +selector. +4. If a controller that makes jobs, like scheduled job, wants to use different +labels, such as the time and date of the run, it can do that. +5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and +just changes the API group, the user should not automatically be allowed to +specify a selector, since this is very rarely what people want to do and is +error prone. +6. If User downloads an existing job definition, e.g. with +`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not +create an overlapping job. +7. If User downloads an existing job definition, e.g. with +`kubectl get jobs/old -o yaml` and tries to modify and post it, and he +accidentally copies the uniquifying label from the old one, then he should not +get an error from a label-key conflict, nor get erratic behavior. +8. If user reads swagger docs and sees the selector field, he should not be able +to set it without realizing the risks. +8. (Deferred requirement:) If user wants to specify a preferred name for the +non-overlappingness key, they can pick a name. + +# Proposed changes + +## API + +`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as +follows. + +Field `job.spec.manualSelector` is added. It controls whether selectors are +automatically generated. In automatic mode, user cannot make the mistake of +creating non-unique selectors. In manual mode, certain rare use cases are +supported. + +Validation is not changed. A selector must be provided, and it must select the +pod template. + +Defaulting changes. Defaulting happens in one of two modes: + +### Automatic Mode + +- User does not specify `job.spec.selector`. +- User is probably unaware of the `job.spec.manualSelector` field and does not +think about it. +- User optionally puts labels on pod template (optional). User does not think +about uniqueness, just labeling for user's own reasons. +- Defaulting logic sets `job.spec.selector` to +`matchLabels["controller-uid"]="$UIDOFJOB"` +- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. + - The first label is controller-uid=$UIDOFJOB. + - The second label is "job-name=$NAMEOFJOB". + +### Manual Mode + +- User means User or Controller for the rest of this list. +- User does specify `job.spec.selector`. +- User does specify `job.spec.manualSelector=true` +- User puts a unique label or label(s) on pod template (required). User does +think carefully about uniqueness. +- No defaulting of pod labels or the selector happen. + +### Rationale + +UID is better than Name in that: +- it allows cross-namespace control someday if we need it. +- it is unique across all kinds. `controller-name=foo` does not ensure +uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a +problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the +latter cannot use label `job-name=foo`, though there is a temptation to do so. +- it uniquely identifies the controller across time. This prevents the case +where, for example, someone deletes a job via the REST api or client +(where cascade=false), leaving pods around. We don't want those to be picked up +unintentionally. It also prevents the case where a user looks at an old job that +finished but is not deleted, and tries to select its pods, and gets the wrong +impression that it is still running. + +Job name is more user friendly. It is self documenting + +Commands like `kubectl get pods -l job-name=myjob` should do exactly what is +wanted 99.9% of the time. Automated control loops should still use the +controller-uid=label. + +Using both gets the benefits of both, at the cost of some label verbosity. + +The field is a `*bool`. Since false is expected to be much more common, +and since the feature is complex, it is better to leave it unspecified so that +users looking at a stored pod spec do not need to be aware of this field. + +### Overriding Unique Labels + +If user does specify `job.spec.selector` then the user must also specify +`job.spec.manualSelector`. This ensures the user knows that what he is doing is +not the normal thing to do. + +To prevent users from copying the `job.spec.manualSelector` flag from existing +jobs, it will be optional and default to false, which means when you ask GET and +existing job back that didn't use this feature, you don't even see the +`job.spec.manualSelector` flag, so you are not tempted to wonder if you should +fiddle with it. + +## Job Controller + +No changes + +## Kubectl + +No required changes. Suggest moving SELECTOR to wide output of `kubectl get +jobs` since users do not write the selector. + +## Docs + +Remove examples that use selector and remove labels from pod templates. +Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. + +# Conversion + +The following applies to Job, as well as to other types that adopt this pattern: + +- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`. +- Both the internal type and the `batch/v1` type will get +`job.spec.manualSelector`. +- The fields `manualSelector` and `autoSelector` have opposite meanings. +- Each field defaults to false when unset, and so v1beta1 has a different +default than v1 and internal. This is intentional: we want new uses to default +to the less error-prone behavior, and we do not want to change the behavior of +v1beta1. + +*Note*: since the internal default is changing, client library consumers that +create Jobs may need to add "job.spec.manualSelector=true" to keep working, or +switch to auto selectors. + +Conversion is as follows: +- `extensions/__internal` to `extensions/v1beta1`: the value of +`__internal.Spec.ManualSelector` is defaulted to false if nil, negated, +defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. +- `extensions/v1beta1` to `extensions/__internal`: the value of +`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to +nil if false, and written to `__internal.Spec.ManualSelector`. + +This conversion gives the following properties. + +1. Users that previously used v1beta1 do not start seeing a new field when they +get back objects. +2. Distinction between originally unset versus explicitly set to false is not +preserved (would have been nice to do so, but requires more complicated +solution). +3. Users who only created v1beta1 examples or v1 examples, will not ever see the +existence of either field. +4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) +does not need to change, allowing scriptable rollforward/rollback. + +# Future Work + +Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if +it works well for job. + +Docs will be edited to show examples without a `job.spec.selector`. + +We probably want as much as possible the same behavior for Job and +ReplicationController. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() + diff --git a/design/selinux.md b/design/selinux.md new file mode 100644 index 00000000..ece83d44 --- /dev/null +++ b/design/selinux.md @@ -0,0 +1,317 @@ +## Abstract + +A proposal for enabling containers in a pod to share volumes using a pod level SELinux context. + +## Motivation + +Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin +authors should not have to explicitly account for SELinux except for volume types that require +special handling of the SELinux context during setup. + +Currently, each container in a pod has an SELinux context. This is not an ideal factoring for +sharing resources using SELinux. + +We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a +generic way. + +Goals of this design: + +1. Describe the problems with a container SELinux context +2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context + which is backward compatible with the v1.0.0 API + +## Constraints and Assumptions + +1. We will not support securing containers within a pod from one another +2. Volume plugins should not have to handle setting SELinux context on volumes +3. We will not deal with shared storage + +## Current State Overview + +### Docker + +Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux +context of a container can be overridden with the `SecurityOpt` api that allows setting the different +parts of the SELinux context individually. + +Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different +use-cases: + +1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's + SELinux context +2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's + SElinux context, but remove the MCS labels, making the volume shareable between containers + +We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container +(from an SELinux standpoint) can use the volume. + +### rkt + +rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts` +and allocates a unique MCS label per pod. + +### Kubernetes + + +There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the +EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a +patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem +in general of handling SELinux in kubernetes to merging this PR. + +A new `PodSecurityContext` type has been added that carries information about security attributes +that apply to the entire pod and that apply to all containers in a pod. See: + +1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939) +1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823) + +## Use Cases + +1. As a cluster operator, I want to support securing pods from one another using SELinux when + SELinux integration is enabled in the cluster +2. As a user, I want volumes sharing to work correctly amongst containers in pods + +#### SELinux context: pod- or container- level? + +Currently, SELinux context is specifiable only at the container level. This is an inconvenient +factoring for sharing volumes and other SELinux-secured resources between containers because there +is no way in SELinux to share resources between processes with different MCS labels except to +remove MCS labels from the shared resource. This is a big security risk: _any container_ in the +system can work with a resource which has the same SELinux context as it and no MCS labels. Since +we are also not interested in isolating containers in a pod from one another, the SELinux context +should be shared by all containers in a pod to facilitate isolation from the containers in other +pods and sharing resources amongst all the containers of a pod. + +#### Volumes + +Kubernetes volumes can be divided into two broad categories: + +1. Unshared storage: + 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, + downward api. All volumes in this category delegate to `EmptyDir` for their underlying + storage. + 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively + by a single pod*. +2. Shared storage: + 1. `hostPath` is shared storage because it is necessarily used by a container and the host + 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. + 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because + they may be used simultaneously by multiple pods. + +For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon` +operation on the volume directory after running the volume plugin's `Setup` function. For these +volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume +plugin code. Some volume plugins may need to use the SELinux context during a mount operation in +certain cases. To account for this, our design must have a way for volume plugins to state that +a particular volume should or should not receive generic label management. + +For shared storage, the picture is murkier. Labels for existing shared storage will be managed +outside Kubernetes and administrators will have to set the SELinux context of pods correctly. +The problem of solving SELinux label management for new shared storage is outside the scope for +this proposal. + +## Analysis + +The system needs to be able to: + +1. Model correctly which volumes require SELinux label management +1. Relabel volumes with the correct SELinux context when required + +### Modeling whether a volume requires label management + +#### Unshared storage: volumes derived from `EmptyDir` + +Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure +that the ownership and SELinux context (when relevant) are set correctly for the volume to be +usable. + +#### Unshared storage: network block devices + +Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way +as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` +volumes, permissions and ownership can be managed on the client side by the Kubelet when used +exclusively by one pod. When the volumes are used outside of a persistent volume, or with the +`ReadWriteOnce` mode, they are effectively unshared storage. + +When used by multiple pods, there are many additional use-cases to analyze before we can be +confident that we can support SELinux label management robustly with these file systems. The right +design is one that makes it easy to experiment and develop support for ownership management with +volume plugins to enable developers and cluster operators to continue exploring these issues. + +#### Shared storage: hostPath + +The `hostPath` volume should only be used by effective-root users, and the permissions of paths +exposed into containers via hostPath volumes should always be managed by the cluster operator. If +the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath` +volume could affect changes in the state of arbitrary paths within the host's filesystem. This +would be a severe security risk, so we will consider hostPath a corner case that the kubelet should +never perform ownership management for. + +#### Shared storage: network + +Ownership management of shared storage is a complex topic. SELinux labels for existing shared +storage will be managed externally from Kubernetes. For this case, our API should make it simple to +express whether a particular volume should have these concerns managed by Kubernetes. + +We will not attempt to address the concerns of new shared storage in this proposal. + +When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` +modes, it is shared storage, and thus outside the scope of this proposal. + +#### API requirements + +From the above, we know that label management must be applied: + +1. To some volume types always +2. To some volume types never +3. To some volume types *sometimes* + +Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it +is desirable for other container runtime implementations to provide similar functionality. + +Relabeling should be an optional aspect of a volume plugin to accommodate: + +1. volume types for which generalized relabeling support is not sufficient +2. testing for each volume plugin individually + +## Proposed Design + +Our design should minimize code for handling SELinux labelling required in the Kubelet and volume +plugins. + +### Deferral: MCS label allocation + +Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the +primitives for higher level composition; making these automatic is a longer-term goal. Allocating +groups and MCS labels are fairly complex problems in their own right, and so our proposal will not +encompass either of these topics. There are several problems that the solution for allocation +depends on: + +1. Users and groups in Kubernetes +2. General auth policy in Kubernetes +3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893) + +### API changes + +The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823) +adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is +the addition of the semantics to this field: + +* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership +management in the Kubelet have their SELinuxContext set from this field. + +```go +package api + +type PodSecurityContext struct { + // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's + // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. + // + // This field will be used to set the SELinux of volumes that support SELinux label management + // by the kubelet. + SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` +} +``` + +The V1 API is extended with the same semantics: + +```go +package v1 + +type PodSecurityContext struct { + // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's + // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. + // + // This field will be used to set the SELinux of volumes that support SELinux label management + // by the kubelet. + SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` +} +``` + +#### API backward compatibility + +Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive +SELinux label management for their volumes. This is acceptable since old clients won't know about +this field and won't have any expectation of their volumes being managed this way. + +The existing backward compatibility semantics for SELinux do not change at all with this proposal. + +### Kubelet changes + +The Kubelet should be modified to perform SELinux label management when required for a volume. The +criteria to activate the kubelet SELinux label management for volumes are: + +1. SELinux integration is enabled in the cluster +2. SELinux is enabled on the node +3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set +4. The volume plugin supports SELinux label management + +The `volume.Mounter` interface should have a new method added that indicates whether the plugin +supports SELinux label management: + +```go +package volume + +type Builder interface { + // other methods omitted + SupportsSELinux() bool +} +``` + +Individual volume plugins are responsible for correctly reporting whether they support label +management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its +derivations will be tested with ownership management support: + +| Plugin Name | SupportsOwnershipManagement | +|-------------------------|-------------------------------| +| `hostPath` | false | +| `emptyDir` | true | +| `gitRepo` | true | +| `secret` | true | +| `downwardAPI` | true | +| `gcePersistentDisk` | false | +| `awsElasticBlockStore` | false | +| `nfs` | false | +| `iscsi` | false | +| `glusterfs` | false | +| `persistentVolumeClaim` | depends on underlying volume and PV mode | +| `rbd` | false | +| `cinder` | false | +| `cephfs` | false | + +Ultimately, the matrix will theoretically look like: + +| Plugin Name | SupportsOwnershipManagement | +|-------------------------|-------------------------------| +| `hostPath` | false | +| `emptyDir` | true | +| `gitRepo` | true | +| `secret` | true | +| `downwardAPI` | true | +| `gcePersistentDisk` | true | +| `awsElasticBlockStore` | true | +| `nfs` | false | +| `iscsi` | true | +| `glusterfs` | false | +| `persistentVolumeClaim` | depends on underlying volume and PV mode | +| `rbd` | true | +| `cinder` | false | +| `cephfs` | false | + +In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a +function of the container runtime implementations. Initially, we will modify the docker runtime +implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish +generic label management for docker containers. + +Volume types that require SELinux context information at mount must be injected with and respect the +enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism +will be used to carry information about label management enablement to the volume plugins that have +to manage labels individually. + +This allows the volume plugins to determine when they do and don't want this type of support from +the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selinux.md?pixel)]() + diff --git a/design/service_accounts.md b/design/service_accounts.md new file mode 100644 index 00000000..89a3771b --- /dev/null +++ b/design/service_accounts.md @@ -0,0 +1,210 @@ +# Service Accounts + +## Motivation + +Processes in Pods may need to call the Kubernetes API. For example: + - scheduler + - replication controller + - node controller + - a map-reduce type framework which has a controller that then tries to make a +dynamically determined number of workers and watch them + - continuous build and push system + - monitoring system + +They also may interact with services other than the Kubernetes API, such as: + - an image repository, such as docker -- both when the images are pulled to +start the containers, and for writing images in the case of pods that generate +images. + - accessing other cloud services, such as blob storage, in the context of a +large, integrated, cloud offering (hosted or private). + - accessing files in an NFS volume attached to the pod + +## Design Overview + +A service account binds together several things: + - a *name*, understood by users, and perhaps by peripheral systems, for an +identity + - a *principal* that can be authenticated and [authorized](../admin/authorization.md) + - a [security context](security_context.md), which defines the Linux +Capabilities, User IDs, Groups IDs, and other capabilities and controls on +interaction with the file system and OS. + - a set of [secrets](secrets.md), which a container may use to access various +networked resources. + +## Design Discussion + +A new object Kind is added: + +```go +type ServiceAccount struct { + TypeMeta `json:",inline" yaml:",inline"` + ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"` + + username string + securityContext ObjectReference // (reference to a securityContext object) + secrets []ObjectReference // (references to secret objects +} +``` + +The name ServiceAccount is chosen because it is widely used already (e.g. by +Kerberos and LDAP) to refer to this type of account. Note that it has no +relation to Kubernetes Service objects. + +The ServiceAccount object does not include any information that could not be +defined separately: + - username can be defined however users are defined. + - securityContext and secrets are only referenced and are created using the +REST API. + +The purpose of the serviceAccount object is twofold: + - to bind usernames to securityContexts and secrets, so that the username can +be used to refer succinctly in contexts where explicitly naming securityContexts +and secrets would be inconvenient + - to provide an interface to simplify allocation of new securityContexts and +secrets. + +These features are explained later. + +### Names + +From the standpoint of the Kubernetes API, a `user` is any principal which can +authenticate to Kubernetes API. This includes a human running `kubectl` on her +desktop and a container in a Pod on a Node making API calls. + +There is already a notion of a username in Kubernetes, which is populated into a +request context after authentication. However, there is no API object +representing a user. While this may evolve, it is expected that in mature +installations, the canonical storage of user identifiers will be handled by a +system external to Kubernetes. + +Kubernetes does not dictate how to divide up the space of user identifier +strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or +may be qualified to allow for federated identity (`alice@example.com` vs. +`alice@example.org`.) Naming convention may distinguish service accounts from +user accounts (e.g. `alice@example.com` vs. +`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but +Kubernetes does not require this. + +Kubernetes also does not require that there be a distinction between human and +Pod users. It will be possible to setup a cluster where Alice the human talks to +the Kubernetes API as username `alice` and starts pods that also talk to the API +as user `alice` and write files to NFS as user `alice`. But, this is not +recommended. + +Instead, it is recommended that Pods and Humans have distinct identities, and +reference implementations will make this distinction. + +The distinction is useful for a number of reasons: + - the requirements for humans and automated processes are different: + - Humans need a wide range of capabilities to do their daily activities. +Automated processes often have more narrowly-defined activities. + - Humans may better tolerate the exceptional conditions created by +expiration of a token. Remembering to handle this in a program is more annoying. +So, either long-lasting credentials or automated rotation of credentials is +needed. + - A Human typically keeps credentials on a machine that is not part of the +cluster and so not subject to automatic management. A VM with a +role/service-account can have its credentials automatically managed. + - the identity of a Pod cannot in general be mapped to a single human. + - If policy allows, it may be created by one human, and then updated by +another, and another, until its behavior cannot be attributed to a single human. + +**TODO**: consider getting rid of separate serviceAccount object and just +rolling its parts into the SecurityContext or Pod Object. + +The `secrets` field is a list of references to /secret objects that an process +started as that service account should have access to be able to assert that +role. + +The secrets are not inline with the serviceAccount object. This way, most or +all users can have permission to `GET /serviceAccounts` so they can remind +themselves what serviceAccounts are available for use. + +Nothing will prevent creation of a serviceAccount with two secrets of type +`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and +client libraries will have some behavior, TBD, to handle the case of multiple +secrets of a given type (pick first or provide all and try each in order, etc). + +When a serviceAccount and a matching secret exist, then a `User.Info` for the +serviceAccount and a `BearerToken` from the secret are added to the map of +tokens used by the authentication process in the apiserver, and similarly for +other types. (We might have some types that do not do anything on apiserver but +just get pushed to the kubelet.) + +### Pods + +The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If +this is unset, then a default value is chosen. If it is set, then the +corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account +Finalizer (see below). + +TBD: how policy limits which users can make pods with which service accounts. + +### Authorization + +Kubernetes API Authorization Policies refer to users. Pods created with a +`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to +authenticate to the Kubernetes APIserver as a particular user. So any policy +that is desired can be applied to them. + +A higher level workflow is needed to coordinate creation of serviceAccounts, +secrets and relevant policy objects. Users are free to extend Kubernetes to put +this business logic wherever is convenient for them, though the Service Account +Finalizer is one place where this can happen (see below). + +### Kubelet + +The kubelet will treat as "not ready to run" (needing a finalizer to act on it) +any Pod which has an empty SecurityContext. + +The kubelet will set a default, restrictive, security context for any pods +created from non-Apiserver config sources (http, file). + +Kubelet watches apiserver for secrets which are needed by pods bound to it. + +**TODO**: how to only let kubelet see secrets it needs to know. + +### The service account finalizer + +There are several ways to use Pods with SecurityContexts and Secrets. + +One way is to explicitly specify the securityContext and all secrets of a Pod +when the pod is initially created, like this: + +**TODO**: example of pod with explicit refs. + +Another way is with the *Service Account Finalizer*, a plugin process which is +optional, and which handles business logic around service accounts. + +The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount +definitions. + +First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no +`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext +and secrets references for the corresponding `serviceAccount`. + +Second, if ServiceAccount definitions change, it may take some actions. + +**TODO**: decide what actions it takes when a serviceAccount definition changes. +Does it stop pods, or just allow someone to list ones that are out of spec? In +general, people may want to customize this? + +Third, if a new namespace is created, it may create a new serviceAccount for +that namespace. This may include a new username (e.g. +`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), +a new securityContext, a newly generated secret to authenticate that +serviceAccount to the Kubernetes API, and default policies for that service +account. + +**TODO**: more concrete example. What are typical default permissions for +default service account (e.g. readonly access to services in the same namespace +and read-write access to events in that namespace?) + +Finally, it may provide an interface to automate creation of new +serviceAccounts. In that case, the user may want to GET serviceAccounts to see +what has been created. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() + diff --git a/design/simple-rolling-update.md b/design/simple-rolling-update.md new file mode 100644 index 00000000..c4a5f671 --- /dev/null +++ b/design/simple-rolling-update.md @@ -0,0 +1,131 @@ +## Simple rolling update + +This is a lightweight design document for simple +[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. + +Complete execution flow can be found [here](#execution-details). See the +[example of rolling update](../user-guide/update-demo/) for more information. + +### Lightweight rollout + +Assume that we have a current replication controller named `foo` and it is +running image `image:v1` + +`kubectl rolling-update foo [foo-v2] --image=myimage:v2` + +If the user doesn't specify a name for the 'next' replication controller, then +the 'next' replication controller is renamed to +the name of the original replication controller. + +Obviously there is a race here, where if you kill the client between delete foo, +and creating the new version of 'foo' you might be surprised about what is +there, but I think that's ok. See [Recovery](#recovery) below + +If the user does specify a name for the 'next' replication controller, then the +'next' replication controller is retained with its existing name, and the old +'foo' replication controller is deleted. For the purposes of the rollout, we add +a unique-ifying label `kubernetes.io/deployment` to both the `foo` and +`foo-next` replication controllers. The value of that label is the hash of the +complete JSON representation of the`foo-next` or`foo` replication controller. +The name of this label can be overridden by the user with the +`--deployment-label-key` flag. + +#### Recovery + +If a rollout fails or is terminated in the middle, it is important that the user +be able to resume the roll out. To facilitate recovery in the case of a crash of +the updating process itself, we add the following annotations to each +replication controller in the `kubernetes.io/` annotation namespace: + * `desired-replicas` The desired number of replicas for this replication +controller (either N or zero) + * `update-partner` A pointer to the replication controller resource that is +the other half of this update (syntax `` the namespace is assumed to be +identical to the namespace of this replication controller.) + +Recovery is achieved by issuing the same command again: + +```sh +kubectl rolling-update foo [foo-v2] --image=myimage:v2 +``` + +Whenever the rolling update command executes, the kubectl client looks for +replication controllers called `foo` and `foo-next`, if they exist, an attempt +is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is +created, and the rollout is a new rollout. If `foo` doesn't exist, then it is +assumed that the rollout is nearly completed, and `foo-next` is renamed to +`foo`. Details of the execution flow are given below. + + +### Aborting a rollout + +Abort is assumed to want to reverse a rollout in progress. + +`kubectl rolling-update foo [foo-v2] --rollback` + +This is really just semantic sugar for: + +`kubectl rolling-update foo-v2 foo` + +With the added detail that it moves the `desired-replicas` annotation from +`foo-v2` to `foo` + + +### Execution Details + +For the purposes of this example, assume that we are rolling from `foo` to +`foo-next` where the only change is an image update from `v1` to `v2` + +If the user doesn't specify a `foo-next` name, then it is either discovered from +the `update-partner` annotation on `foo`. If that annotation doesn't exist, +then `foo-next` is synthesized using the pattern +`-` + +#### Initialization + + * If `foo` and `foo-next` do not exist: + * Exit, and indicate an error to the user, that the specified controller +doesn't exist. + * If `foo` exists, but `foo-next` does not: + * Create `foo-next` populate it with the `v2` image, set +`desired-replicas` to `foo.Spec.Replicas` + * Goto Rollout + * If `foo-next` exists, but `foo` does not: + * Assume that we are in the rename phase. + * Goto Rename + * If both `foo` and `foo-next` exist: + * Assume that we are in a partial rollout + * If `foo-next` is missing the `desired-replicas` annotation + * Populate the `desired-replicas` annotation to `foo-next` using the +current size of `foo` + * Goto Rollout + +#### Rollout + + * While size of `foo-next` < `desired-replicas` annotation on `foo-next` + * increase size of `foo-next` + * if size of `foo` > 0 + decrease size of `foo` + * Goto Rename + +#### Rename + + * delete `foo` + * create `foo` that is identical to `foo-next` + * delete `foo-next` + +#### Abort + + * If `foo-next` doesn't exist + * Exit and indicate to the user that they may want to simply do a new +rollout with the old version + * If `foo` doesn't exist + * Exit and indicate not found to the user + * Otherwise, `foo-next` and `foo` both exist + * Set `desired-replicas` annotation on `foo` to match the annotation on +`foo-next` + * Goto Rollout with `foo` and `foo-next` trading places. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() + diff --git a/design/taint-toleration-dedicated.md b/design/taint-toleration-dedicated.md new file mode 100644 index 00000000..c523319f --- /dev/null +++ b/design/taint-toleration-dedicated.md @@ -0,0 +1,291 @@ +# Taints, Tolerations, and Dedicated Nodes + +## Introduction + +This document describes *taints* and *tolerations*, which constitute a generic +mechanism for restricting the set of pods that can use a node. We also describe +one concrete use case for the mechanism, namely to limit the set of users (or +more generally, authorization domains) who can access a set of nodes (a feature +we call *dedicated nodes*). There are many other uses--for example, a set of +nodes with a particular piece of hardware could be reserved for pods that +require that hardware, or a node could be marked as unschedulable when it is +being drained before shutdown, or a node could trigger evictions when it +experiences hardware or software problems or abnormal node configurations; see +issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and +[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion. + +## Taints, tolerations, and dedicated nodes + +A *taint* is a new type that is part of the `NodeSpec`; when present, it +prevents pods from scheduling onto the node unless the pod *tolerates* the taint +(tolerations are listed in the `PodSpec`). Note that there are actually multiple +flavors of taints: taints that prevent scheduling on a node, taints that cause +the scheduler to try to avoid scheduling on a node but do not prevent it, taints +that prevent a pod from starting on Kubelet even if the pod's `NodeName` was +written directly (i.e. pod did not go through the scheduler), and taints that +evict already-running pods. +[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) +has more background on these different scenarios. We will focus on the first +kind of taint in this doc, since it is the kind required for the "dedicated +nodes" use case. + +Implementing dedicated nodes using taints and tolerations is straightforward: in +essence, a node that is dedicated to group A gets taint `dedicated=A` and the +pods belonging to group A get toleration `dedicated=A`. (The exact syntax and +semantics of taints and tolerations are described later in this doc.) This keeps +all pods except those belonging to group A off of the nodes. This approach +easily generalizes to pods that are allowed to schedule into multiple dedicated +node groups, and nodes that are a member of multiple dedicated node groups. + +Note that because tolerations are at the granularity of pods, the mechanism is +very flexible -- any policy can be used to determine which tolerations should be +placed on a pod. So the "group A" mentioned above could be all pods from a +particular namespace or set of namespaces, or all pods with some other arbitrary +characteristic in common. We expect that any real-world usage of taints and +tolerations will employ an admission controller to apply the tolerations. For +example, to give all pods from namespace A access to dedicated node group A, an +admission controller would add the corresponding toleration to all pods from +namespace A. Or to give all pods that require GPUs access to GPU nodes, an +admission controller would add the toleration for GPU taints to pods that +request the GPU resource. + +Everything that can be expressed using taints and tolerations can be expressed +using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. +in the example in the previous paragraph, you could put a label `dedicated=A` on +the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not* +belonging to group A. But it is cumbersome to express exclusion policies using +node affinity because every time you add a new type of restricted node, all pods +that aren't allowed to use those nodes need to start avoiding those nodes using +node affinity. This means the node affinity list can get quite long in clusters +with lots of different groups of special nodes (lots of dedicated node groups, +lots of different kinds of special hardware, etc.). Moreover, you need to also +update any Pending pods when you add new types of special nodes. In contrast, +with taints and tolerations, when you add a new type of special node, "regular" +pods are unaffected, and you just need to add the necessary toleration to the +pods you subsequent create that need to use the new type of special nodes. To +put it another way, with taints and tolerations, only pods that use a set of +special nodes need to know about those special nodes; with the node affinity +approach, pods that have no interest in those special nodes need to know about +all of the groups of special nodes. + +One final comment: in practice, it is often desirable to not only keep "regular" +pods off of special nodes, but also to keep "special" pods off of regular nodes. +An example in the dedicated nodes case is to not only keep regular users off of +dedicated nodes, but also to keep dedicated users off of non-dedicated (shared) +nodes. In this case, the "non-dedicated" nodes can be modeled as their own +dedicated node group (for example, tainted as `dedicated=shared`), and pods that +are not given access to any dedicated nodes ("regular" pods) would be given a +toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations +will be added by an admission controller.) In this case taints/tolerations are +still better than node affinity because with taints/tolerations each pod only +needs one special "marking", versus in the node affinity case where every time +you add a dedicated node group (i.e. a new `dedicated=` value), you need to add +a new node affinity rule to all pods (including pending pods) except the ones +allowed to use that new dedicated node group. + +## API + +```go +// The node this Taint is attached to has the effect "effect" on +// any pod that that does not tolerate the Taint. +type Taint struct { + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + Value string `json:"value,omitempty"` + Effect TaintEffect `json:"effect"` +} + +type TaintEffect string + +const ( + // Do not allow new pods to schedule unless they tolerate the taint, + // but allow all pods submitted to Kubelet without going through the scheduler + // to start, and allow all already-running pods to continue running. + // Enforced by the scheduler. + TaintEffectNoSchedule TaintEffect = "NoSchedule" + // Like TaintEffectNoSchedule, but the scheduler tries not to schedule + // new pods onto the node, rather than prohibiting new pods from scheduling + // onto the node. Enforced by the scheduler. + TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule" + // Do not allow new pods to schedule unless they tolerate the taint, + // do not allow pods to start on Kubelet unless they tolerate the taint, + // but allow all already-running pods to continue running. + // Enforced by the scheduler and Kubelet. + TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit" + // Do not allow new pods to schedule unless they tolerate the taint, + // do not allow pods to start on Kubelet unless they tolerate the taint, + // and try to eventually evict any already-running pods that do not tolerate the taint. + // Enforced by the scheduler and Kubelet. + TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute" +) + +// The pod this Toleration is attached to tolerates any taint that matches +// the triple using the matching operator . +type Toleration struct { + Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` + // operator represents a key's relationship to the value. + // Valid operators are Exists and Equal. Defaults to Equal. + // Exists is equivalent to wildcard for value, so that a pod can + // tolerate all taints of a particular category. + Operator TolerationOperator `json:"operator"` + Value string `json:"value,omitempty"` + Effect TaintEffect `json:"effect"` + // TODO: For forgiveness (#1574), we'd eventually add at least a grace period + // here, and possibly an occurrence threshold and period. +} + +// A toleration operator is the set of operators that can be used in a toleration. +type TolerationOperator string + +const ( + TolerationOpExists TolerationOperator = "Exists" + TolerationOpEqual TolerationOperator = "Equal" +) + +``` + +(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) +to understand the motivation for the various taint effects.) + +We will add: + +```go + // Multiple tolerations with the same key are allowed. + Tolerations []Toleration `json:"tolerations,omitempty"` +``` + +to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type +TaintEffectPreferNoSchedule) in order to be able to schedule onto that node. + +We will add: + +```go + // Multiple taints with the same key are not allowed. + Taints []Taint `json:"taints,omitempty"` +``` + +to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union +of the taints specified by various sources. For now, the only source is +the `NodeSpec` itself, but in the future one could imagine a node inheriting +taints from pods (if we were to allow taints to be attached to pods), from +the node's startup configuration, etc. The scheduler should look at the `Taints` +in `NodeStatus`, not in `NodeSpec`. + +Taints and tolerations are not scoped to namespace. + +## Implementation plan: taints, tolerations, and dedicated nodes + +Using taints and tolerations to implement dedicated nodes requires these steps: + +1. Add the API described above +1. Add a scheduler predicate function that respects taints and tolerations (for +TaintEffectNoSchedule) and a scheduler priority function that respects taints +and tolerations (for TaintEffectPreferNoSchedule). +1. Add to the Kubelet code to implement the "no admit" behavior of +TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute +1. Implement code in Kubelet that evicts a pod that no longer satisfies +TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the +controllers instead, but since taints might be used to enforce security +policies, it is better to do in kubelet because kubelet can respond quickly and +can guarantee the rules will be applied to all pods. Eviction may need to happen +under a variety of circumstances: when a taint is added, when an existing taint +is updated, when a toleration is removed from a pod, or when a toleration is +modified on a pod. +1. Add a new `kubectl` command that adds/removes taints to/from nodes, +1. (This is the one step is that is specific to dedicated nodes) Implement an +admission controller that adds tolerations to pods that are supposed to be +allowed to use dedicated nodes (for example, based on pod's namespace). + +In the future one can imagine a generic policy configuration that configures an +admission controller to apply the appropriate tolerations to the desired class +of pods and taints to Nodes upon node creation. It could be used not just for +policies about dedicated nodes, but also other uses of taints and tolerations, +e.g. nodes that are restricted due to their hardware configuration. + +The `kubectl` command to add and remove taints on nodes will be modeled after +`kubectl label`. Examples usages: + +```sh +# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. +# If a taint with that key already exists, its value and effect are replaced as specified. +$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute + +# Remove from node 'foo' the taint with key 'dedicated' if one exists. +$ kubectl taint nodes foo dedicated- +``` + +## Example: implementing a dedicated nodes policy + +Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available +only to pods in a particular namespace `banana`. First the administrator does + +```sh +$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute +$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute +$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute + +``` + +(assuming they want to evict pods that are already running on those nodes if those +pods don't already tolerate the new taint) + +Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify +a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`. + +In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having +to enumerate them by name. + +## Future work + +At present, the Kubernetes security model allows any user to add and remove any +taints and tolerations. Obviously this makes it impossible to securely enforce +rules like dedicated nodes. We need some mechanism that prevents regular users +from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them +from mutating any fields of `NodeSpec`) and from mutating the `Tolerations` +field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549) +is relevant. + +Another security vulnerability arises if nodes are added to the cluster before +receiving their taint. Thus we need to ensure that a new node does not become +"Ready" until it has been configured with its taints. One way to do this is to +have an admission controller that adds the taint whenever a Node object is +created. + +A quota policy may want to treat nodes differently based on what taints, if any, +they have. For example, if a particular namespace is only allowed to access +dedicated nodes, then it may be convenient to give the namespace unlimited +quota. (To use finite quota, you'd have to size the namespace's quota to the sum +of the sizes of the machines in the dedicated node group, and update it when +nodes are added/removed to/from the group.) + +It's conceivable that taints and tolerations could be unified with +[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). +We have chosen not to do this for the reasons described in the "Future work" +section of that doc. + +## Backward compatibility + +Old scheduler versions will ignore taints and tolerations. New scheduler +versions will respect them. + +Users should not start using taints and tolerations until the full +implementation has been in Kubelet and the master for enough binary versions +that we feel comfortable that we will not need to roll back either Kubelet or +master to a version that does not support them. Longer-term we will use a +programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). + +## Related issues + +This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190). +There are a number of other related issues, all of which are linked to from +[#17190](https://github.com/kubernetes/kubernetes/issues/17190). + +The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574). + +The concepts of taints and tolerations were originally developed as part of the +Omega project at Google. + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() + diff --git a/design/ubernetes-cluster-state.png b/design/ubernetes-cluster-state.png new file mode 100644 index 00000000..56ec2df8 Binary files /dev/null and b/design/ubernetes-cluster-state.png differ diff --git a/design/ubernetes-design.png b/design/ubernetes-design.png new file mode 100644 index 00000000..44924846 Binary files /dev/null and b/design/ubernetes-design.png differ diff --git a/design/ubernetes-scheduling.png b/design/ubernetes-scheduling.png new file mode 100644 index 00000000..01774882 Binary files /dev/null and b/design/ubernetes-scheduling.png differ diff --git a/design/versioning.md b/design/versioning.md new file mode 100644 index 00000000..ae724b12 --- /dev/null +++ b/design/versioning.md @@ -0,0 +1,174 @@ +# Kubernetes API and Release Versioning + +Reference: [Semantic Versioning](http://semver.org) + +Legend: + +* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. +This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the +major version, **Y** is the minor version, and **Z** is the patch version.) +* **API vX[betaY]** refers to the version of the HTTP API. + +## Release versioning + +### Minor version scheme and timeline + +* Kube X.Y.0-alpha.W, W > 0 (Branch: master) + * Alpha releases are released roughly every two weeks directly from the master +branch. + * No cherrypick releases. If there is a critical bugfix, a new release from +master can be created ahead of schedule. +* Kube X.Y.Z-beta.W (Branch: release-X.Y) + * When master is feature-complete for Kube X.Y, we will cut the release-X.Y +branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential +to X.Y. + * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. + * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, +(X.Y.0-beta.W | W > 0) as necessary. +* Kube X.Y.0 (Branch: release-X.Y) + * Final release, cut from the release-X.Y branch cut two weeks prior. + * X.Y.1-beta.0 will be tagged at the same commit on the same branch. + * X.Y.0 occur 3 to 4 months after X.(Y-1).0. +* Kube X.Y.Z, Z > 0 (Branch: release-X.Y) + * [Patch releases](#patch-releases) are released as we cherrypick commits into +the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. + * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is +tagged on the followup commit that updates pkg/version/base.go with the beta +version. +* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z) + * These are special and different in that the X.Y.Z tag is branched to isolate +the emergency/critical fix from all other changes that have landed on the +release branch since the previous tag + * Cut release-X.Y.Z branch to hold the isolated patch release + * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1) + * Branched [patch releases](#patch-releases) are rarely needed but used for +emergency/critical fixes to the latest release + * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed +for this kind of release to be possible. + +### Major version timeline + +There is no mandated timeline for major versions. They only occur when we need +to start the clock on deprecating features. A given major version should be the +latest major version for at least one year from its original release date. + +### CI and dev version scheme + +* Continuous integration versions also exist, and are versioned off of alpha and +beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an +additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after +X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds +that are built off of a dirty build tree, (during development, with things in +the tree that are not checked it,) it will be appended with -dirty. + +### Supported releases and component skew + +We expect users to stay reasonably up-to-date with the versions of Kubernetes +they use in production, but understand that it may take time to upgrade, +especially for production-critical components. + +We expect users to be running approximately the latest patch release of a given +minor release; we often include critical bug fixes in +[patch releases](#patch-release), and so encourage users to upgrade as soon as +possible. + +Different components are expected to be compatible across different amounts of +skew, all relative to the master version. Nodes may lag masters components by +up to two minor versions but should be at a version no newer than the master; a +client should be skewed no more than one minor version from the master, but may +lead the master by up to one minor version. For example, a v1.3 master should +work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4 +clients. + +Furthermore, we expect to "support" three minor releases at a time. "Support" +means we expect users to be running that version in production, though we may +not port fixes back before the latest minor version. For example, when v1.3 +comes out, v1.0 will no longer be supported: basically, that means that the +reasonable response to the question "my v1.0 cluster isn't working," is, "you +should probably upgrade it, (and probably should have some time ago)". With +minor releases happening approximately every three months, that means a minor +release is supported for approximately nine months. + +This policy is in line with +[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). + +## API versioning + +### Release versions as related to API versions + +Here is an example major release cycle: + +* **Kube 1.0 should have API v1 without v1beta\* API versions** + * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have +the stable v1 API. This enables you to migrate all your objects off of the beta +API versions of the API and allows us to remove those beta API versions in Kube +1.0 with no effect. There will be tooling to help you detect and migrate any +v1beta\* data versions or calls to v1 before you do the upgrade. +* **Kube 1.x may have API v2beta*** + * The first incarnation of a new (backwards-incompatible) API in HEAD is + v2beta1. By default this will be unregistered in apiserver, so it can change + freely. Once it is available by default in apiserver (which may not happen for +several minor releases), it cannot change ever again because we serialize +objects in versioned form, and we always need to be able to deserialize any +objects that are saved in etcd, even between alpha versions. If further changes +to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x +versions. +* **Kube 1.y (where y is the last version of the 1.x series) must have final +API v2** + * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two + things: (1) users can upgrade to API v2 when running Kube 1.x and then switch + over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can + cleanup and remove all API v2beta\* versions because no one should have + v2beta\* objects left in their database. As mentioned above, tooling will exist + to make sure there are no calls or references to a given API version anywhere + inside someone's kube installation before someone upgrades. + * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. +It *may* include the v1 API as well if the burden is not high - this will be +determined on a per-major-version basis. + +#### Rationale for API v2 being complete before v2.0's release + +It may seem a bit strange to complete the v2 API before v2.0 is released, +but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* +APIs *is* a breaking change, which is what necessitates the major version bump. +There are other ways to do this, but having the major release be the fresh start +of that release's API without the baggage of its beta versions seems most +intuitive out of the available options. + +## Patch releases + +Patch releases are intended for critical bug fixes to the latest minor version, +such as addressing security vulnerabilities, fixes to problems affecting a large +number of users, severe problems with no workaround, and blockers for products +based on Kubernetes. + +They should not contain miscellaneous feature additions or improvements, and +especially no incompatibilities should be introduced between patch versions of +the same minor version (or even major version). + +Dependencies, such as Docker or Etcd, should also not be changed unless +absolutely necessary, and also just to fix critical bugs (so, at most patch +version changes, not new major nor minor versions). + +## Upgrades + +* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a +rolling upgrade across their cluster. (Rolling upgrade means being able to +upgrade the master first, then one node at a time. See #4855 for details.) + * However, we do not recommend upgrading more than two minor releases at a +time (see [Supported releases](#supported-releases)), and do not recommend +running non-latest patch releases of a given minor release. +* No hard breaking changes over version boundaries. + * For example, if a user is at Kube 1.x, we may require them to upgrade to +Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across +major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as +graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone +to go from 1.x to 1.x+y before they go to 2.x. + +There is a separate question of how to track the capabilities of a kubelet to +facilitate rolling upgrades. That is not addressed here. + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() + diff --git a/design/volume-snapshotting.md b/design/volume-snapshotting.md new file mode 100644 index 00000000..e92ed3d1 --- /dev/null +++ b/design/volume-snapshotting.md @@ -0,0 +1,523 @@ +Kubernetes Snapshotting Proposal +================================ + +**Authors:** [Cindy Wang](https://github.com/ciwang) + +## Background + +Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). + +Typical existing backup solutions offer on demand or scheduled snapshots. + +An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume. + +Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves. + +## Objectives + +For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals. + +* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers. + + * Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods. + +* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API. + + * Nongoal: Support Delete and Restore snapshot operations in API. + +* Goal 3: Implement snapshotting interface for GCE PDs. + + * Nongoal: Implement snapshotting interface for non GCE PD volumes. + +### Feature Roadmap + +Major features, in order of priority (bold features are priorities for v1): + +* **On demand snapshots** + + * **API to create new snapshots and list existing snapshots** + + * API to restore a disk from a snapshot and delete old snapshots + +* Scheduled snapshots + +* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node) + +## Requirements + +### Performance + +* Time SLA from issuing a snapshot to completion: + +* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location. + +* This should be on the order of a few minutes. + +### Reliability + +* Data corruption + + * Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons: + + * GCE and Amazon can create snapshots while the application is running. + + * Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code. + + * Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown. + +* Snapshot failure + + * Case: Failure during external process, such as during API call or upload + + * Log error, retry until success (indefinitely) + + * Case: Failure within Kubernetes, such as controller restarts + + * If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state. + +## Solution Overview + +Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects. + +* **Create:** + + * Key: create.snapshot.volume.alpha.kubernetes.io + + * Value: [snapshot name] + +* **List:** + + * Key: snapshot.volume.alpha.kubernetes.io/[snapshot name] + + * Value: [snapshot timestamp] + +A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes. + +The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage). + +## Detailed Design + +### API + +* Create snapshot + + * Usage: + + * Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter + + * When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list. + + * API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually. + + * PVC control loop in master + + * If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV. + + * If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV. + +* List existing snapshots + + * Only displayed as annotations on PVC object. + + * Only lists unique names and timestamps of snapshots taken using the Kubernetes API. + + * Usage: + + * Get the PVC object + + * Snapshots are listed as key-value pairs within the PVC annotations + +### SnapshotController + +![Snapshot Controller Diagram](volume-snapshotting.png?raw=true "Snapshot controller diagram") + +**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests. + +**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully. + +**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful. + +The controller will have a loop that does the following: + +* Fetch State + + * Fetch all PVC objects from the API server. + +* Act + + * Trigger snapshot: + + * Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation. + +* Persist State + + * Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server. + +Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor. + +The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor. + +### Create Snapshot Logic + +To create a snapshot: + +* Acquire operation lock for volume so that no other attach or detach operations can be started for volume. + + * Abort if there is already a pending operation for the specified volume (main loop will retry, if needed). + +* Spawn a new thread: + + * Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC. + + * For any errors, log the error, and terminate the thread (the main controller will retry as needed). + + * Once a snapshot is created successfully: + + * Make a call to the API server to delete the create snapshot annotation in the PVC object. + + * Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations. + +*Brainstorming notes below, read at your own risk!* + +* * * + + +Open questions: + +* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API? + + * It seems that the API route is a bit more feasible in implementation and can also be fully utilized. + + * Can the API call methods on VolumePlugins? Yeah via controller + + * The scheduler gives users functionality that doesn’t already exist, but required adding an entirely new controller + +* Should the list and restore operations be part of v1? + +* Do we call them snapshots or backups? + + * From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application." + +* At what minimum granularity should snapshots be allowed? + +* How do we store information about the most recent snapshot in case the controller restarts? + +* In case of error, do we err on the side of fewer or more snapshots? + +Snapshot Scheduler + +1. PVC API Object + +A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression. + +* CRUD operations on snapshot schedules + + * Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/) + + * The cron expression provides flexibility to decrease the interval between snapshots in future versions + + * Read: Display snapshot schedule to user via kubectl get pvc + + * Update: Do not support changing the snapshot schedule for an existing PVC + + * Delete: Do not support deleting the snapshot schedule for an existing PVC + + * In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name + +* Validation + + * Cron expressions must have a 0 in the minutes place and use exact, not interval syntax + + * [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals). + + * If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user + + Objective + +Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins. + +Goal: Implement snapshotting interface for GCE PDs. + +Goal: Protect against data loss by allowing users to restore snapshots of their disks. + +Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes. + +Nongoal: Use snapshotting to provide additional features such as migration. + + Background + +Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). + +Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage. + +Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules. + + what is the value case? Who wants this? What do we enable by implementing this? + +I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious. + +Requirements + +Functionality + +Should this support PVs, direct volumes, or both? + +Should we support deletion? + +Should we support restores? + +Automated schedule -- times or intervals? Before major event? + +Performance + +Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing? + +* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards. + + * Pending = uploading to GCE + +* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting + +Reliability + +How do we ascertain that deletions happen when we want them to? + +For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes. + +We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this. + +What happens when the snapshot fails as these are async operations? + +Retry (for some time period? indefinitely?) and log the error + +Other + +What is the UI for seeing the list of snapshots? + +In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots. + +Overview + +There are several design options for the design of each layer of implementation as follows. + +1. **Public API:** + +Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive: + + 1. New Volume API object + + 1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object. + + 2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot. + + 3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object. + + 4. Pros + + 1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule. + + 5. Cons + + 2. Heavyweight codewise; involves changing and touching a lot of existing code. + + 3. Potentially bad UX: How is the Volume API object created? + + 1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands. + + 2. When a unique volume is specified in a pod or PV spec. + + 2. Directly in volume definition in the pod/PV object + + 6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. + + 7. Pros + + 4. Easy for users to implement and understand + + 8. Cons + + 5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used. + + 6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set. + + 3. Only in PV object + + 9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. + + 10. Pros + + 7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property. + + 11. Cons + + 8. No support for direct volumes + + 9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job + + 10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules. + + 4. Annotations: key value pairs on API object + + 12. User experience is the same as (b) + + 13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours. + + 14. Pros + + 11. Less invasive to the codebase than (a-c) + + 15. Cons + + 12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation. + +2. **Business logic:** + + 5. Does this go on the master, node, or both? + + 16. Where the snapshot is stored + + 13. GCE, Amazon: cloud storage + + 14. Others stored on volume itself (gluster) or external drive (iSCSI) + + 17. Requirements for snapshot operation + + 15. Application flush, sync, and fsfreeze before creating snapshot + + 6. Suggestion: + + 18. New SnapshotController on master + + 16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot + + 17. If controller restarts and we miss a snapshot in the process, just skip it + + 3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5) + + 18. If snapshotting call fails, retry for an amount of time specified in retryPeriod + + 19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period + + 19. Logic to prepare the disk for snapshotting on node + + 20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD) + + 7. Alternatives: login entirely on node + + 20. Problems: + + 21. If pod moves from one node to another + + 4. A different node is in now in charge of snapshotting + + 5. If the volume plugin requires external memory for snapshots, we need to move the existing data + + 22. If the same pod exists on two different nodes, which node is in charge + +3. **Volume plugin interface/internal API:** + + 8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin) + + 9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot + + 10. Similar to volume.attach call + +4. **Other questions:** + + 11. Snapshot period + + 12. Time or period + + 13. What is our SLO around time accuracy? + + 21. Best effort, but no guarantees (depends on time or period) -- if going with time. + + 14. What if we miss a snapshot? + + 22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity + + 15. Will we know when an operation has failed? How do we report that? + + 23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases + + 16. Will we be responsible for GCing old snapshots? + + 24. Maybe this can be explicit non-goal, in the future can automate garbage collection + + 17. If the pod dies do we continue creating snapshots? + + 18. How to communicate errors (PD doesn’t support snapshotting, time period unsupported) + + 19. Off schedule snapshotting like before an application upgrade + + 20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this? + +Options, pros, cons, suggestion/recommendation + +Example 1b + +During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod’s associated volume. + +For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2): + +apiVersion: v1 +kind: Pod +metadata: + name: test-pd +spec: + containers: + - image: gcr.io/google_containers/test-webserver + name: test-container + volumeMounts: + - mountPath: /test-pd + name: test-volume + volumes: + - name: test-volume + # This GCE PD must already exist. + gcePersistentDisk: + pdName: my-data-disk + fsType: ext4 + +Introduce a new field into the volume spec: + +apiVersion: v1 +kind: Pod +metadata: + name: test-pd +spec: + containers: + - image: gcr.io/google_containers/test-webserver + name: test-container + volumeMounts: + - mountPath: /test-pd + name: test-volume + volumes: + - name: test-volume + # This GCE PD must already exist. + gcePersistentDisk: + pdName: my-data-disk + fsType: ext4 + +** ssTimes: ****[1, 5]** + + Caveats + +* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because + + * this does not provide value to the user and only adds an extra layer of indirection/complexity. + + * ? + + Dependencies + +* Kubernetes + +* Persistent volume snapshot support through API + + * POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/volume-snapshotting.md?pixel)]() + diff --git a/design/volume-snapshotting.png b/design/volume-snapshotting.png new file mode 100644 index 00000000..1b1ea748 Binary files /dev/null and b/design/volume-snapshotting.png differ diff --git a/downward_api_resources_limits_requests.md b/downward_api_resources_limits_requests.md deleted file mode 100644 index ab17c321..00000000 --- a/downward_api_resources_limits_requests.md +++ /dev/null @@ -1,622 +0,0 @@ -# Downward API for resource limits and requests - -## Background - -Currently the downward API (via environment variables and volume plugin) only -supports exposing a Pod's name, namespace, annotations, labels and its IP -([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This -document explains the need and design to extend them to expose resources -(e.g. cpu, memory) limits and requests. - -## Motivation - -Software applications require configuration to work optimally with the resources they're allowed to use. -Exposing the requested and limited amounts of available resources inside containers will allow -these applications to be configured more easily. Although docker already -exposes some of this information inside containers, the downward API helps -exposing this information in a runtime-agnostic manner in Kubernetes. - -## Use cases - -As an application author, I want to be able to use cpu or memory requests and -limits to configure the operational requirements of my applications inside containers. -For example, Java applications expect to be made aware of the available heap size via -a command line argument to the JVM, for example: java -Xmx:``. Similarly, an -application may want to configure its thread pool based on available cpu resources and -the exported value of GOMAXPROCS. - -## Design - -This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473). -There are three approaches discussed in this document to obtain resources limits -and requests to be exposed as environment variables and volumes inside -containers: - -1. The first approach requires users to specify full json path selectors -in which selectors are relative to the pod spec. The benefit of this -approach is to specify pod-level resources, and since containers are -also part of a pod spec, it can be used to specify container-level -resources too. - -2. The second approach requires specifying partial json path selectors -which are relative to the container spec. This approach helps -in retrieving a container specific resource limits and requests, and at -the same time, it is simpler to specify than full json path selectors. - -3. In the third approach, users specify fixed strings (magic keys) to retrieve -resources limits and requests and do not specify any json path -selectors. This approach is similar to the existing downward API -implementation approach. The advantages of this approach are that it is -simpler to specify that the first two, and does not require any type of -conversion between internal and versioned objects or json selectors as -discussed below. - -Before discussing a bit more about merits of each approach, here is a -brief discussion about json path selectors and some implications related -to their use. - -#### JSONpath selectors - -Versioned objects in kubernetes have json tags as part of their golang fields. -Currently, objects in the internal API have json tags, but it is planned that -these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933) -for discussion). So for discussion in this proposal, we assume that -internal objects do not have json tags. In the first two approaches -(full and partial json selectors), when a user creates a pod and its -containers, the user specifies a json path selector in the pod's -spec to retrieve values of its limits and requests. The selector -is composed of json tags similar to json paths used with kubectl -([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal -uses kubernetes' json path library to process the selectors to retrieve -the values. As kubelet operates on internal objects (without json tags), -and the selectors are part of versioned objects, retrieving values of -the limits and requests can be handled using these two solutions: - -1. By converting an internal object to versioned object, and then using -the json path library to retrieve the values from the versioned object -by processing the selector. - -2. By converting a json selector of the versioned objects to internal -object's golang expression and then using the json path library to -retrieve the values from the internal object by processing the golang -expression. However, converting a json selector of the versioned objects -to internal object's golang expression will still require an instance -of the versioned object, so it seems more work from the first solution -unless there is another way without requiring the versioned object. - -So there is a one time conversion cost associated with the first (full -path) and second (partial path) approaches, whereas the third approach -(magic keys) does not require any such conversion and can directly -work on internal objects. If we want to avoid conversion cost and to -have implementation simplicity, my opinion is that magic keys approach -is relatively easiest to implement to expose limits and requests with -least impact on existing functionality. - -To summarize merits/demerits of each approach: - -|Approach | Scope | Conversion cost | JSON selectors | Future extension| -| ---------- | ------------------- | -------------------| ------------------- | ------------------- | -|Full selectors | Pod/Container | Yes | Yes | Possible | -|Partial selectors | Container | Yes | Yes | Possible | -|Magic keys | Container | No | No | Possible| - -Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object -in conjunction with partial selectors and magic keys approaches. - -### API with full JSONpath selectors - -Full json path selectors specify the complete path to the resources -limits and requests relative to pod spec. - -#### Environment variables - -This table shows how selectors can be used for various requests and -limits to be exposed as environment variables. Environment variable names -are examples only and not necessarily as specified, and the selectors do not -have to start with dot. - -| Env Var Name | Selector | -| ---- | ------------------- | -| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu| -| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory| -| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu| -| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory | - -#### Volume plugin - -This table shows how selectors can be used for various requests and -limits to be exposed as volumes. The path names are examples only and -not necessarily as specified, and the selectors do not have to start with dot. - - -| Path | Selector | -| ---- | ------------------- | -| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu| -| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory| -| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu| -| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory| - -Volumes are pod scoped, so a selector must be specified with a container name. - -Full json path selectors will use existing `type ObjectFieldSelector` -to extend the current implementation for resources requests and limits. - -``` -// ObjectFieldSelector selects an APIVersioned field of an object. -type ObjectFieldSelector struct { - APIVersion string `json:"apiVersion"` - // Required: Path of the field to select in the specified API version - FieldPath string `json:"fieldPath"` -} -``` - -#### Examples - -These examples show how to use full selectors with environment variables and volume plugin. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - fieldRef: - fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu -``` - -``` -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: gcr.io/google_containers/busybox - command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - fieldRef: - fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu -``` - -#### Validations - -For APIs with full json path selectors, verify that selectors are -valid relative to pod spec. - - -### API with partial JSONpath selectors - -Partial json path selectors specify paths to resources limits and requests -relative to the container spec. These will be implemented by introducing a -`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current -implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. - -``` -// ContainerSpecFieldSelector selects an APIVersioned field of an object. -type ContainerSpecFieldSelector struct { - APIVersion string `json:"apiVersion"` - // Container name - ContainerName string `json:"containerName,omitempty"` - // Required: Path of the field to select in the specified API version - FieldPath string `json:"fieldPath"` -} - -// Represents a single file containing information from the downward API -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. - Path string `json:"path"` - // Selects a field of the pod: only annotations, labels, name and - // namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` - // Selects a field of the container: only resources limits and requests - // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, - // resources.requests.memory) are currently supported. - ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` -} - -// EnvVarSource represents a source for the value of an EnvVar. -// Only one of its fields may be set. -type EnvVarSource struct { - // Selects a field of the container: only resources limits and requests - // (resources.limits.cpu, resources.limits.memory, resources.requests.cpu, - // resources.requests.memory) are currently supported. - ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"` - // Selects a field of the pod; only name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` - // Selects a key of a secret in the pod's namespace. - SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` -} -``` - -#### Environment variables - -This table shows how partial selectors can be used for various requests and -limits to be exposed as environment variables. Environment variable names -are examples only and not necessarily as specified, and the selectors do not -have to start with dot. - -| Env Var Name | Selector | -| -------------------- | -------------------| -| CPU_LIMIT | resources.limits.cpu | -| MEMORY_LIMIT | resources.limits.memory | -| CPU_REQUEST | resources.requests.cpu | -| MEMORY_REQUEST | resources.requests.memory | - -Since environment variables are container scoped, it is optional -to specify container name as part of the partial selectors as they are -relative to container spec. If container name is not specified, then -it defaults to current container. However, container name could be specified -to expose variables from other containers. - -#### Volume plugin - -This table shows volume paths and partial selectors used for resources cpu and memory. -Volume path names are examples only and not necessarily as specified, and the -selectors do not have to start with dot. - -| Path | Selector | -| -------------------- | -------------------| -| cpu_limit | resources.limits.cpu | -| memory_limit | resources.limits.memory | -| cpu_request | resources.requests.cpu | -| memory_request | resources.requests.memory | - -Volumes are pod scoped, the container name must be specified as part of -`containerSpecFieldRef` with them. - -#### Examples - -These examples show how to use partial selectors with environment variables and volume plugin. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - containerSpecFieldRef: - fieldPath: resources.limits.cpu -``` - -``` -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: gcr.io/google_containers/busybox - command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - containerSpecFieldRef: - containerName: "client-container" - fieldPath: resources.limits.cpu -``` - -#### Validations - -For APIs with partial json path selectors, verify -that selectors are valid relative to container spec. -Also verify that container name is provided with volumes. - - -### API with magic keys - -In this approach, users specify fixed strings (or magic keys) to retrieve resources -limits and requests. This approach is similar to the existing downward -API implementation approach. The fixed string used for resources limits and requests -for cpu and memory are `limits.cpu`, `limits.memory`, -`requests.cpu` and `requests.memory`. Though these strings are same -as json path selectors but are processed as fixed strings. These will be implemented by -introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current -implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`. - -The fields in ResourceFieldSelector are `containerName` to specify the name of a -container, `resource` to specify the type of a resource (cpu or memory), and `divisor` -to specify the output format of values of exposed resources. The default value of divisor -is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid -values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer -(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes), -`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`, -`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes). -For more information about these resource formats, [see details](resources.md). - -Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor. -For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then -exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and -the ceiling of 0.25 is 1. - -``` -type ResourceFieldSelector struct { - // Container name - ContainerName string `json:"containerName,omitempty"` - // Required: Resource to select - Resource string `json:"resource"` - // Specifies the output format of the exposed resources - Divisor resource.Quantity `json:"divisor,omitempty"` -} - -// Represents a single file containing information from the downward API -type DownwardAPIVolumeFile struct { - // Required: Path is the relative path name of the file to be created. - Path string `json:"path"` - // Selects a field of the pod: only annotations, labels, name and - // namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"` - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` -} - -// EnvVarSource represents a source for the value of an EnvVar. -// Only one of its fields may be set. -type EnvVarSource struct { - // Selects a resource of the container: only resources limits and requests - // (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported. - ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"` - // Selects a field of the pod; only name and namespace are supported. - FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"` - // Selects a key of a ConfigMap. - ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"` - // Selects a key of a secret in the pod's namespace. - SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"` -} -``` - -#### Environment variables - -This table shows environment variable names and strings used for resources cpu and memory. -The variable names are examples only and not necessarily as specified. - -| Env Var Name | Resource | -| -------------------- | -------------------| -| CPU_LIMIT | limits.cpu | -| MEMORY_LIMIT | limits.memory | -| CPU_REQUEST | requests.cpu | -| MEMORY_REQUEST | requests.memory | - -Since environment variables are container scoped, it is optional -to specify container name as part of the partial selectors as they are -relative to container spec. If container name is not specified, then -it defaults to current container. However, container name could be specified -to expose variables from other containers. - -#### Volume plugin - -This table shows volume paths and strings used for resources cpu and memory. -Volume path names are examples only and not necessarily as specified. - -| Path | Resource | -| -------------------- | -------------------| -| cpu_limit | limits.cpu | -| memory_limit | limits.memory| -| cpu_request | requests.cpu | -| memory_request | requests.memory | - -Volumes are pod scoped, the container name must be specified as part of -`resourceFieldRef` with them. - -#### Examples - -These examples show how to use magic keys approach with environment variables and volume plugin. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: dapi-test-pod -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - env: - - name: CPU_LIMIT - valueFrom: - resourceFieldRef: - resource: limits.cpu - - name: MEMORY_LIMIT - valueFrom: - resourceFieldRef: - resource: limits.memory - divisor: "1Mi" -``` - -In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively. - -``` -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: client-container - image: gcr.io/google_containers/busybox - command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"] - resources: - requests: - memory: "64Mi" - cpu: "250m" - limits: - memory: "128Mi" - cpu: "500m" - volumeMounts: - - name: podinfo - mountPath: /etc - readOnly: false - volumes: - - name: podinfo - downwardAPI: - items: - - path: "cpu_limit" - resourceFieldRef: - containerName: client-container - resource: limits.cpu - divisor: "1m" - - path: "memory_limit" - resourceFieldRef: - containerName: client-container - resource: limits.memory -``` - -In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively. - - -#### Validations - -For APIs with magic keys, verify that the resource strings are valid and is one -of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`. -Also verify that container name is provided with volumes. - -## Pod-level and container-level resource access - -Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in -all approaches. Container-level resources will be accessed by `type ObjectFieldSelector` -with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef` -with partial and magic keys approaches, respectively. The following table -summarizes resource access with these approaches. - -| Approach | Pod resources| Container resources | -| -------------------- | -------------------|-------------------| -| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`| -| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` | -| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` | - -## Output format - -The output format for resources limits and requests will be same as -cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024 -and rounded to integer) and memory in bytes. For example, memory request -or limit of `64Mi` in the container spec will be output as `67108864` -bytes, and cpu request or limit of `250m` (millicores) will be output as -`256` of cpu shares. - -## Implementation approach - -The current implementation of this proposal will focus on the API with magic keys -approach. The main reason for selecting this approach is that it might be -easier to incorporate and extend resource specific functionality. - -## Applied example - -Here we discuss how to use exposed resource values to set, for example, Java -memory size or GOMAXPROCS for your applications. Lets say, you expose a container's -(running an application like tomcat for example) requested memory as `HEAP_SIZE` -and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable. -One way to set the heap size or cpu for this application would be to wrap the binary -in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it) -and GOMAXPROCS environment variables inside the container image. The spec file for the -application pod could look like: - -``` -apiVersion: v1 -kind: Pod -metadata: - name: kubernetes-downwardapi-volume-example -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh","-c", "env" ] - resources: - requests: - memory: "64M" - cpu: "250m" - limits: - memory: "128M" - cpu: "500m" - env: - - name: HEAP_SIZE - valueFrom: - resourceFieldRef: - resource: requests.memory - - name: CPU_LIMIT - valueFrom: - resourceFieldRef: - resource: requests.cpu -``` - -Note that the value of divisor by default is `1`. Now inside the container, -the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as: - -``` -export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)" - -and - -export GOMAXPROCS=$(CPU_LIMIT)" -``` - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]() - diff --git a/enhance-pluggable-policy.md b/enhance-pluggable-policy.md deleted file mode 100644 index 2468d3c1..00000000 --- a/enhance-pluggable-policy.md +++ /dev/null @@ -1,429 +0,0 @@ -# Enhance Pluggable Policy - -While trying to develop an authorization plugin for Kubernetes, we found a few -places where API extensions would ease development and add power. There are a -few goals: - 1. Provide an authorization plugin that can evaluate a .Authorize() call based -on the full content of the request to RESTStorage. This includes information -like the full verb, the content of creates and updates, and the names of -resources being acted upon. - 1. Provide a way to ask whether a user is permitted to take an action without - running in process with the API Authorizer. For instance, a proxy for exec - calls could ask whether a user can run the exec they are requesting. - 1. Provide a way to ask who can perform a given action on a given resource. -This is useful for answering questions like, "who can create replication -controllers in my namespace". - -This proposal adds to and extends the existing API to so that authorizers may -provide the functionality described above. It does not attempt to describe how -the policies themselves can be expressed, that is up the authorization plugins -themselves. - - -## Enhancements to existing Authorization interfaces - -The existing Authorization interfaces are described -[here](../admin/authorization.md). A couple additions will allow the development -of an Authorizer that matches based on different rules than the existing -implementation. - -### Request Attributes - -The existing authorizer.Attributes only has 5 attributes (user, groups, -isReadOnly, kind, and namespace). If we add more detailed verbs, content, and -resource names, then Authorizer plugins will have the same level of information -available to RESTStorage components in order to express more detailed policy. -The replacement excerpt is below. - -An API request has the following attributes that can be considered for -authorization: - - user - the user-string which a user was authenticated as. This is included -in the Context. - - groups - the groups to which the user belongs. This is included in the -Context. - - verb - string describing the requesting action. Today we have: get, list, -watch, create, update, and delete. The old `readOnly` behavior is equivalent to -allowing get, list, watch. - - namespace - the namespace of the object being access, or the empty string if -the endpoint does not support namespaced objects. This is included in the -Context. - - resourceGroup - the API group of the resource being accessed - - resourceVersion - the API version of the resource being accessed - - resource - which resource is being accessed - - applies only to the API endpoints, such as `/api/v1beta1/pods`. For -miscellaneous endpoints, like `/version`, the kind is the empty string. - - resourceName - the name of the resource during a get, update, or delete -action. - - subresource - which subresource is being accessed - -A non-API request has 2 attributes: - - verb - the HTTP verb of the request - - path - the path of the URL being requested - - -### Authorizer Interface - -The existing Authorizer interface is very simple, but there isn't a way to -provide details about allows, denies, or failures. The extended detail is useful -for UIs that want to describe why certain actions are allowed or disallowed. Not -all Authorizers will want to provide that information, but for those that do, -having that capability is useful. In addition, adding a `GetAllowedSubjects` -method that returns back the users and groups that can perform a particular -action makes it possible to answer questions like, "who can see resources in my -namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down). - -```go -// OLD -type Authorizer interface { - Authorize(a Attributes) error -} -``` - -```go -// NEW -// Authorizer provides the ability to determine if a particular user can perform -// a particular action -type Authorizer interface { - // Authorize takes a Context (for namespace, user, and traceability) and - // Attributes to make a policy determination. - // reason is an optional return value that can describe why a policy decision - // was made. Reasons are useful during debugging when trying to figure out - // why a user or group has access to perform a particular action. - Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error) -} - -// AuthorizerIntrospection is an optional interface that provides the ability to -// determine which users and groups can perform a particular action. This is -// useful for building caches of who can see what. For instance, "which -// namespaces can this user see". That would allow someone to see only the -// namespaces they are allowed to view instead of having to choose between -// listing them all or listing none. -type AuthorizerIntrospection interface { - // GetAllowedSubjects takes a Context (for namespace and traceability) and - // Attributes to determine which users and groups are allowed to perform the - // described action in the namespace. This API enables the ResourceBasedReview - // requests below - GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error) -} -``` - -### SubjectAccessReviews - -This set of APIs answers the question: can a user or group (use authenticated -user if none is specified) perform a given action. Given the Authorizer -interface (proposed or existing), this endpoint can be implemented generically -against any Authorizer by creating the correct Attributes and making an -.Authorize() call. - -There are three different flavors: - -1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this -checks to see if a specified user or group can perform a given action at the -cluster scope or across all namespaces. This is a highly privileged operation. -It allows a cluster-admin to inspect rights of any person across the entire -cluster and against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` - -this checks to see if the current user (including his groups) can perform a -given action at any specified scope. This is an unprivileged operation. It -doesn't expose any information that a user couldn't discover simply by trying an -endpoint themselves. -3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` - -this checks to see if a specified user or group can perform a given action in -**this** namespace. This is a moderately privileged operation. In a multi-tenant -environment, having a namespace scoped resource makes it very easy to reason -about powers granted to a namespace admin. This allows a namespace admin -(someone able to manage permissions inside of one namespaces, but not all -namespaces), the power to inspect whether a given user or group can manipulate -resources in his namespace. - -SubjectAccessReview is runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets -a SubjectAccessReviewResponse back. Here is an example of a call and its -corresponding return: - -``` -// input -{ - "kind": "SubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "user": "Clark", - "groups": ["admins", "managers"] - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json -// or -accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject) - -// output -{ - "kind": "SubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "allowed": true -} -``` - -PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that -only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL -and he gets a SubjectAccessReviewResponse back. Here is an example of a call and -its corresponding return: - -``` -// input -{ - "kind": "PersonalSubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "namespace": "any-ns", - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json -// or -accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject) - -// output -{ - "kind": "PersonalSubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "allowed": true -} -``` - -LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he -gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and -its corresponding return: - -``` -// input -{ - "kind": "LocalSubjectAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "my-ns" - "authorizationAttributes": { - "verb": "create", - "resource": "pods", - "user": "Clark", - "groups": ["admins", "managers"] - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json -// or -accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject) - -// output -{ - "kind": "LocalSubjectAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "my-ns" - "allowed": true -} -``` - -The actual Go objects look like this: - -```go -type AuthorizationAttributes struct { - // Namespace is the namespace of the action being requested. Currently, there - // is no distinction between no namespace and all namespaces - Namespace string `json:"namespace" description:"namespace of the action being requested"` - // Verb is one of: get, list, watch, create, update, delete - Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"` - // Resource is one of the existing resource types - ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"` - // ResourceVersion is the version of resource - ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"` - // Resource is one of the existing resource types - Resource string `json:"resource" description:"one of the existing resource types"` - // ResourceName is the name of the resource being requested for a "get" or - // deleted for a "delete" - ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"` - // Subresource is one of the existing subresources types - Subresource string `json:"subresource" description:"one of the existing subresources"` -} - -// SubjectAccessReview is an object for requesting information about whether a -// user or group can perform an action -type SubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` - // User is optional, but at least one of User or Groups must be specified - User string `json:"user" description:"optional, user to check"` - // Groups is optional, but at least one of User or Groups must be specified - Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` -} - -// SubjectAccessReviewResponse describes whether or not a user or group can -// perform an action -type SubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} - -// PersonalSubjectAccessReview is an object for requesting information about -// whether a user or group can perform an action -type PersonalSubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// PersonalSubjectAccessReviewResponse describes whether this user can perform -// an action -type PersonalSubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} - -// LocalSubjectAccessReview is an object for requesting information about -// whether a user or group can perform an action -type LocalSubjectAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` - // User is optional, but at least one of User or Groups must be specified - User string `json:"user" description:"optional, user to check"` - // Groups is optional, but at least one of User or Groups must be specified - Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"` -} - -// LocalSubjectAccessReviewResponse describes whether or not a user or group can -// perform an action -type LocalSubjectAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Allowed is required. True if the action would be allowed, false otherwise. - Allowed bool - // Reason is optional. It indicates why a request was allowed or denied. - Reason string -} -``` - -### ResourceAccessReview - -This set of APIs nswers the question: which users and groups can perform the -specified verb on the specified resourceKind. Given the Authorizer interface -described above, this endpoint can be implemented generically against any -Authorizer by calling the .GetAllowedSubjects() function. - -There are two different flavors: - -1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this -checks to see which users and groups can perform a given action at the cluster -scope or across all namespaces. This is a highly privileged operation. It allows -a cluster-admin to inspect rights of all subjects across the entire cluster and -against cluster level resources. -2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` - -this checks to see which users and groups can perform a given action in **this** -namespace. This is a moderately privileged operation. In a multi-tenant -environment, having a namespace scoped resource makes it very easy to reason -about powers granted to a namespace admin. This allows a namespace admin -(someone able to manage permissions inside of one namespaces, but not all -namespaces), the power to inspect which users and groups can manipulate -resources in his namespace. - -ResourceAccessReview is a runtime.Object with associated RESTStorage that only -accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets -a ResourceAccessReviewResponse back. Here is an example of a call and its -corresponding return: - -``` -// input -{ - "kind": "ResourceAccessReview", - "apiVersion": "authorization.kubernetes.io/v1", - "authorizationAttributes": { - "verb": "list", - "resource": "replicationcontrollers" - } -} - -// POSTed like this -curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json -// or -accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject) - -// output -{ - "kind": "ResourceAccessReviewResponse", - "apiVersion": "authorization.kubernetes.io/v1", - "namespace": "default" - "users": ["Clark", "Hubert"], - "groups": ["cluster-admins"] -} -``` - -The actual Go objects look like this: - -```go -// ResourceAccessReview is a means to request a list of which users and groups -// are authorized to perform the action specified by spec -type ResourceAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// ResourceAccessReviewResponse describes who can perform the action -type ResourceAccessReviewResponse struct { - kapi.TypeMeta - - // Users is the list of users who can perform the action - Users []string - // Groups is the list of groups who can perform the action - Groups []string -} - -// LocalResourceAccessReview is a means to request a list of which users and -// groups are authorized to perform the action specified in a specific namespace -type LocalResourceAccessReview struct { - kapi.TypeMeta `json:",inline"` - - // AuthorizationAttributes describes the action being tested. - AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"` -} - -// LocalResourceAccessReviewResponse describes who can perform the action -type LocalResourceAccessReviewResponse struct { - kapi.TypeMeta - - // Namespace is the namespace used for the access review - Namespace string - // Users is the list of users who can perform the action - Users []string - // Groups is the list of groups who can perform the action - Groups []string -} -``` - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]() - diff --git a/event_compression.md b/event_compression.md deleted file mode 100644 index 7a1cbb33..00000000 --- a/event_compression.md +++ /dev/null @@ -1,169 +0,0 @@ -# Kubernetes Event Compression - -This document captures the design of event compression. - -## Background - -Kubernetes components can get into a state where they generate tons of events. - -The events can be categorized in one of two ways: - -1. same - The event is identical to previous events except it varies only on -timestamp. -2. similar - The event is identical to previous events except it varies on -timestamp and message. - -For example, when pulling a non-existing image, Kubelet will repeatedly generate -`image_not_existing` and `container_is_waiting` events until upstream components -correct the image. When this happens, the spam from the repeated events makes -the entire event mechanism useless. It also appears to cause memory pressure in -etcd (see [#3853](http://issue.k8s.io/3853)). - -The goal is introduce event counting to increment same events, and event -aggregation to collapse similar events. - -## Proposal - -Each binary that generates events (for example, `kubelet`) should keep track of -previously generated events so that it can collapse recurring events into a -single event instead of creating a new instance for each new event. In addition, -if many similar events are created, events should be aggregated into a single -event to reduce spam. - -Event compression should be best effort (not guaranteed). Meaning, in the worst -case, `n` identical (minus timestamp) events may still result in `n` event -entries. - -## Design - -Instead of a single Timestamp, each event object -[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following -fields: - * `FirstTimestamp unversioned.Time` - * The date/time of the first occurrence of the event. - * `LastTimestamp unversioned.Time` - * The date/time of the most recent occurrence of the event. - * On first occurrence, this is equal to the FirstTimestamp. - * `Count int` - * The number of occurrences of this event between FirstTimestamp and -LastTimestamp. - * On first occurrence, this is 1. - -Each binary that generates events: - * Maintains a historical record of previously generated events: - * Implemented with -["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) -in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go). - * Implemented behind an `EventCorrelator` that manages two subcomponents: -`EventAggregator` and `EventLogger`. - * The `EventCorrelator` observes all incoming events and lets each -subcomponent visit and modify the event in turn. - * The `EventAggregator` runs an aggregation function over each event. This -function buckets each event based on an `aggregateKey` and identifies the event -uniquely with a `localKey` in that bucket. - * The default aggregation function groups similar events that differ only by -`event.Message`. Its `localKey` is `event.Message` and its aggregate key is -produced by joining: - * `event.Source.Component` - * `event.Source.Host` - * `event.InvolvedObject.Kind` - * `event.InvolvedObject.Namespace` - * `event.InvolvedObject.Name` - * `event.InvolvedObject.UID` - * `event.InvolvedObject.APIVersion` - * `event.Reason` - * If the `EventAggregator` observes a similar event produced 10 times in a 10 -minute window, it drops the event that was provided as input and creates a new -event that differs only on the message. The message denotes that this event is -used to group similar events that matched on reason. This aggregated `Event` is -then used in the event processing sequence. - * The `EventLogger` observes the event out of `EventAggregation` and tracks -the number of times it has observed that event previously by incrementing a key -in a cache associated with that matching event. - * The key in the cache is generated from the event object minus -timestamps/count/transient fields, specifically the following events fields are -used to construct a unique key for an event: - * `event.Source.Component` - * `event.Source.Host` - * `event.InvolvedObject.Kind` - * `event.InvolvedObject.Namespace` - * `event.InvolvedObject.Name` - * `event.InvolvedObject.UID` - * `event.InvolvedObject.APIVersion` - * `event.Reason` - * `event.Message` - * The LRU cache is capped at 4096 events for both `EventAggregator` and -`EventLogger`. That means if a component (e.g. kubelet) runs for a long period -of time and generates tons of unique events, the previously generated events -cache will not grow unchecked in memory. Instead, after 4096 unique events are -generated, the oldest events are evicted from the cache. - * When an event is generated, the previously generated events cache is checked -(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)). - * If the key for the new event matches the key for a previously generated -event (meaning all of the above fields match between the new event and some -previously generated event), then the event is considered to be a duplicate and -the existing event entry is updated in etcd: - * The new PUT (update) event API is called to update the existing event -entry in etcd with the new last seen timestamp and count. - * The event is also updated in the previously generated events cache with -an incremented count, updated last seen timestamp, name, and new resource -version (all required to issue a future event update). - * If the key for the new event does not match the key for any previously -generated event (meaning none of the above fields match between the new event -and any previously generated events), then the event is considered to be -new/unique and a new event entry is created in etcd: - * The usual POST/create event API is called to create a new event entry in -etcd. - * An entry for the event is also added to the previously generated events -cache. - -## Issues/Risks - - * Compression is not guaranteed, because each component keeps track of event - history in memory - * An application restart causes event history to be cleared, meaning event -history is not preserved across application restarts and compression will not -occur across component restarts. - * Because an LRU cache is used to keep track of previously generated events, -if too many unique events are generated, old events will be evicted from the -cache, so events will only be compressed until they age out of the events cache, -at which point any new instance of the event will cause a new entry to be -created in etcd. - -## Example - -Sample kubectl output: - -```console -FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE -Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet. -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest" -Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal -``` - -This demonstrates what would have been 20 separate entries (indicating -scheduling failure) collapsed/compressed down to 5 entries. - -## Related Pull Requests/Issues - - * Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events. - * PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API. - * PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow -compressing multiple recurring events in to a single event. - * PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a -single event to optimize etcd storage. - * PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache -instead of map. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]() - diff --git a/expansion.md b/expansion.md deleted file mode 100644 index ace1faf0..00000000 --- a/expansion.md +++ /dev/null @@ -1,417 +0,0 @@ -# Variable expansion in pod command, args, and env - -## Abstract - -A proposal for the expansion of environment variables using a simple `$(var)` -syntax. - -## Motivation - -It is extremely common for users to need to compose environment variables or -pass arguments to their commands using the values of environment variables. -Kubernetes should provide a facility for the 80% cases in order to decrease -coupling and the use of workarounds. - -## Goals - -1. Define the syntax format -2. Define the scoping and ordering of substitutions -3. Define the behavior for unmatched variables -4. Define the behavior for unexpected/malformed input - -## Constraints and Assumptions - -* This design should describe the simplest possible syntax to accomplish the -use-cases. -* Expansion syntax will not support more complicated shell-like behaviors such -as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc. - -## Use Cases - -1. As a user, I want to compose new environment variables for a container using -a substitution syntax to reference other variables in the container's -environment and service environment variables. -1. As a user, I want to substitute environment variables into a container's -command. -1. As a user, I want to do the above without requiring the container's image to -have a shell. -1. As a user, I want to be able to specify a default value for a service -variable which may not exist. -1. As a user, I want to see an event associated with the pod if an expansion -fails (ie, references variable names that cannot be expanded). - -### Use Case: Composition of environment variables - -Currently, containers are injected with docker-style environment variables for -the services in their pod's namespace. There are several variables for each -service, but users routinely need to compose URLs based on these variables -because there is not a variable for the exact format they need. Users should be -able to build new environment variables with the exact format they need. -Eventually, it should also be possible to turn off the automatic injection of -the docker-style variables into pods and let the users consume the exact -information they need via the downward API and composition. - -#### Expanding expanded variables - -It should be possible to reference an variable which is itself the result of an -expansion, if the referenced variable is declared in the container's environment -prior to the one referencing it. Put another way -- a container's environment is -expanded in order, and expanded variables are available to subsequent -expansions. - -### Use Case: Variable expansion in command - -Users frequently need to pass the values of environment variables to a -container's command. Currently, Kubernetes does not perform any expansion of -variables. The workaround is to invoke a shell in the container's command and -have the shell perform the substitution, or to write a wrapper script that sets -up the environment and runs the command. This has a number of drawbacks: - -1. Solutions that require a shell are unfriendly to images that do not contain -a shell. -2. Wrapper scripts make it harder to use images as base images. -3. Wrapper scripts increase coupling to Kubernetes. - -Users should be able to do the 80% case of variable expansion in command without -writing a wrapper script or adding a shell invocation to their containers' -commands. - -### Use Case: Images without shells - -The current workaround for variable expansion in a container's command requires -the container's image to have a shell. This is unfriendly to images that do not -contain a shell (`scratch` images, for example). Users should be able to perform -the other use-cases in this design without regard to the content of their -images. - -### Use Case: See an event for incomplete expansions - -It is possible that a container with incorrect variable values or command line -may continue to run for a long period of time, and that the end-user would have -no visual or obvious warning of the incorrect configuration. If the kubelet -creates an event when an expansion references a variable that cannot be -expanded, it will help users quickly detect problems with expansions. - -## Design Considerations - -### What features should be supported? - -In order to limit complexity, we want to provide the right amount of -functionality so that the 80% cases can be realized and nothing more. We felt -that the essentials boiled down to: - -1. Ability to perform direct expansion of variables in a string. -2. Ability to specify default values via a prioritized mapping function but -without support for defaults as a syntax-level feature. - -### What should the syntax be? - -The exact syntax for variable expansion has a large impact on how users perceive -and relate to the feature. We considered implementing a very restrictive subset -of the shell `${var}` syntax. This syntax is an attractive option on some level, -because many people are familiar with it. However, this syntax also has a large -number of lesser known features such as the ability to provide default values -for unset variables, perform inline substitution, etc. - -In the interest of preventing conflation of the expansion feature in Kubernetes -with the shell feature, we chose a different syntax similar to the one in -Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since -it is not required to implement the required use-cases. - -Nested references, ie, variable expansion within variable names, are not -supported. - -#### How should unmatched references be treated? - -Ideally, it should be extremely clear when a variable reference couldn't be -expanded. We decided the best experience for unmatched variable references would -be to have the entire reference, syntax included, show up in the output. As an -example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then -`$(VARIABLE_NAME)` should be present in the output. - -#### Escaping the operator - -Although the `$(var)` syntax does overlap with the `$(command)` form of command -substitution supported by many shells, because unexpanded variables are present -verbatim in the output, we expect this will not present a problem to many users. -If there is a collision between a variable name and command substitution syntax, -the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate -to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not. - -## Design - -This design encompasses the variable expansion syntax and specification and the -changes needed to incorporate the expansion feature into the container's -environment and command. - -### Syntax and expansion mechanics - -This section describes the expansion syntax, evaluation of variable values, and -how unexpected or malformed inputs are handled. - -#### Syntax - -The inputs to the expansion feature are: - -1. A utf-8 string (the input string) which may contain variable references. -2. A function (the mapping function) that maps the name of a variable to the -variable's value, of type `func(string) string`. - -Variable references in the input string are indicated exclusively with the syntax -`$()`. The syntax tokens are: - -- `$`: the operator, -- `(`: the reference opener, and -- `)`: the reference closer. - -The operator has no meaning unless accompanied by the reference opener and -closer tokens. The operator can be escaped using `$$`. One literal `$` will be -emitted for each `$$` in the input. - -The reference opener and closer characters have no meaning when not part of a -variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME` -without a closing expression, the operator and expression opening characters are -treated as ordinary characters without special meanings. - -#### Scope and ordering of substitutions - -The scope in which variable references are expanded is defined by the mapping -function. Within the mapping function, any arbitrary strategy may be used to -determine the value of a variable name. The most basic implementation of a -mapping function is to use a `map[string]string` to lookup the value of a -variable. - -In order to support default values for variables like service variables -presented by the kubelet, which may not be bound because the service that -provides them does not yet exist, there should be a mapping function that uses a -list of `map[string]string` like: - -```go -func MakeMappingFunc(maps ...map[string]string) func(string) string { - return func(input string) string { - for _, context := range maps { - val, ok := context[input] - if ok { - return val - } - } - - return "" - } -} - -// elsewhere -containerEnv := map[string]string{ - "FOO": "BAR", - "ZOO": "ZAB", - "SERVICE2_HOST": "some-host", -} - -serviceEnv := map[string]string{ - "SERVICE_HOST": "another-host", - "SERVICE_PORT": "8083", -} - -// single-map variation -mapping := MakeMappingFunc(containerEnv) - -// default variables not found in serviceEnv -mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv) -``` - -### Implementation changes - -The necessary changes to implement this functionality are: - -1. Add a new interface, `ObjectEventRecorder`, which is like the -`EventRecorder` interface, but scoped to a single object, and a function that -returns an `ObjectEventRecorder` given an `ObjectReference` and an -`EventRecorder`. -2. Introduce `third_party/golang/expansion` package that provides: - 1. An `Expand(string, func(string) string) string` function. - 2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` -function. -3. Make the kubelet expand environment correctly. -4. Make the kubelet expand command correctly. - -#### Event Recording - -In order to provide an event when an expansion references undefined variables, -the mapping function must be able to create an event. In order to facilitate -this, we should create a new interface in the `api/client/record` package which -is similar to `EventRecorder`, but scoped to a single object: - -```go -// ObjectEventRecorder knows how to record events about a single object. -type ObjectEventRecorder interface { - // Event constructs an event from the given information and puts it in the queue for sending. - // 'reason' is the reason this event is generated. 'reason' should be short and unique; it will - // be used to automate handling of events, so imagine people writing switch statements to - // handle them. You want to make that easy. - // 'message' is intended to be human readable. - // - // The resulting event will be created in the same namespace as the reference object. - Event(reason, message string) - - // Eventf is just like Event, but with Sprintf for the message field. - Eventf(reason, messageFmt string, args ...interface{}) - - // PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field. - PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{}) -} -``` - -There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object` -and an `EventRecorder`: - -```go -type objectRecorderImpl struct { - object runtime.Object - recorder EventRecorder -} - -func (r *objectRecorderImpl) Event(reason, message string) { - r.recorder.Event(r.object, reason, message) -} - -func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder { - return &objectRecorderImpl{object, recorder} -} -``` - -#### Expansion package - -The expansion package should provide two methods: - -```go -// MappingFuncFor returns a mapping function for use with Expand that -// implements the expansion semantics defined in the expansion spec; it -// returns the input string wrapped in the expansion syntax if no mapping -// for the input is found. If no expansion is found for a key, an event -// is raised on the given recorder. -func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string { - // ... -} - -// Expand replaces variable references in the input string according to -// the expansion spec using the given mapping function to resolve the -// values of variables. -func Expand(input string, mapping func(string) string) string { - // ... -} -``` - -#### Kubelet changes - -The Kubelet should be made to correctly expand variables references in a -container's environment, command, and args. Changes will need to be made to: - -1. The `makeEnvironmentVariables` function in the kubelet; this is used by -`GenerateRunContainerOptions`, which is used by both the docker and rkt -container runtimes. -2. The docker manager `setEntrypointAndCommand` func has to be changed to -perform variable expansion. -3. The rkt runtime should be made to support expansion in command and args -when support for it is implemented. - -### Examples - -#### Inputs and outputs - -These examples are in the context of the mapping: - -| Name | Value | -|-------------|------------| -| `VAR_A` | `"A"` | -| `VAR_B` | `"B"` | -| `VAR_C` | `"C"` | -| `VAR_REF` | `$(VAR_A)` | -| `VAR_EMPTY` | `""` | - -No other variables are defined. - -| Input | Result | -|--------------------------------|----------------------------| -| `"$(VAR_A)"` | `"A"` | -| `"___$(VAR_B)___"` | `"___B___"` | -| `"___$(VAR_C)"` | `"___C"` | -| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` | -| `"$(VAR_A)-1"` | `"A-1"` | -| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` | -| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` | -| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` | -| `"f000-$$VAR_A"` | `"f000-$VAR_A"` | -| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` | -| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` | -| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` | -| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` | -| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` | -| `"$(VAR_REF)"` | `"$(VAR_A)"` | -| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` | -| `"foo$(VAR_EMPTY)bar"` | `"foobar"` | -| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` | -| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` | -| `"$?_boo_$!"` | `"$?_boo_$!"` | -| `"$VAR_A"` | `"$VAR_A"` | -| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` | -| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` | -| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` | -| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` | -| `"$$$$$$$(VAR_A)"` | `"$$$A"` | -| `"$VAR_A)"` | `"$VAR_A)"` | -| `"${VAR_A}"` | `"${VAR_A}"` | -| `"$(VAR_B)_______$(A"` | `"B_______$(A"` | -| `"$(VAR_C)_______$("` | `"C_______$("` | -| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` | -| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` | -| `"--$($($($($--"` | `"--$($($($($--"` | -| `"$($($($($--foo$("` | `"$($($($($--foo$("` | -| `"foo0--$($($($("` | `"foo0--$($($($("` | -| `"$(foo$$var)` | `$(foo$$var)` | - -#### In a pod: building a URL - -Notice the `$(var)` syntax. - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: expansion-pod -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh", "-c", "env" ] - env: - - name: PUBLIC_URL - value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)" - restartPolicy: Never -``` - -#### In a pod: building a URL using downward API - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: expansion-pod -spec: - containers: - - name: test-container - image: gcr.io/google_containers/busybox - command: [ "/bin/sh", "-c", "env" ] - env: - - name: POD_NAMESPACE - valueFrom: - fieldRef: - fieldPath: "metadata.namespace" - - name: PUBLIC_URL - value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)" - restartPolicy: Never -``` - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]() - diff --git a/extending-api.md b/extending-api.md deleted file mode 100644 index 45a07ca5..00000000 --- a/extending-api.md +++ /dev/null @@ -1,203 +0,0 @@ -# Adding custom resources to the Kubernetes API server - -This document describes the design for implementing the storage of custom API -types in the Kubernetes API Server. - - -## Resource Model - -### The ThirdPartyResource - -The `ThirdPartyResource` resource describes the multiple versions of a custom -resource that the user wants to add to the Kubernetes API. `ThirdPartyResource` -is a non-namespaced resource; attempting to place it in a namespace will return -an error. - -Each `ThirdPartyResource` resource has the following: - * Standard Kubernetes object metadata. - * ResourceKind - The kind of the resources described by this third party -resource. - * Description - A free text description of the resource. - * APIGroup - An API group that this resource should be placed into. - * Versions - One or more `Version` objects. - -### The `Version` Object - -The `Version` object describes a single concrete version of a custom resource. -The `Version` object currently only specifies: - * The `Name` of the version. - * The `APIGroup` this version should belong to. - -## Expectations about third party objects - -Every object that is added to a third-party Kubernetes object store is expected -to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata). -This requirement enables the Kubernetes API server to provide the following -features: - * Filtering lists of objects via label queries. - * `resourceVersion`-based optimistic concurrency via compare-and-swap. - * Versioned storage. - * Event recording. - * Integration with basic `kubectl` command line tooling. - * Watch for resource changes. - -The `Kind` for an instance of a third-party object (e.g. CronTab) below is -expected to be programmatically convertible to the name of the resource using -the following conversion. Kinds are expected to be of the form -``, and the `APIVersion` for the object is expected to be -`/`. To prevent collisions, it's expected that you'll -use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`. - -For example `mygroup.example.com/v1` - -'CamelCaseKind' is the specific type name. - -To convert this into the `metadata.name` for the `ThirdPartyResource` resource -instance, the `` is copied verbatim, the `CamelCaseKind` is then -converted using '-' instead of capitalization ('camel-case'), with the first -character being assumed to be capitalized. In pseudo code: - -```go -var result string -for ix := range kindName { - if isCapital(kindName[ix]) { - result = append(result, '-') - } - result = append(result, toLowerCase(kindName[ix]) -} -``` - -As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines -resources of Kind `CamelCaseKind`, in the APIGroup with the prefix -`mygroup.example.com/...`. - -The reason for this is to enable rapid lookup of a `ThirdPartyResource` object -given the kind information. This is also the reason why `ThirdPartyResource` is -not namespaced. - -## Usage - -When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts -by creating a new, namespaced RESTful resource path. For now, non-namespaced -objects are not supported. As with existing built-in objects, deleting a -namespace deletes all third party resources in that namespace. - -For example, if a user creates: - -```yaml -metadata: - name: cron-tab.mygroup.example.com -apiVersion: extensions/v1beta1 -kind: ThirdPartyResource -description: "A specification of a Pod to run on a cron style schedule" -versions: -- name: v1 -- name: v2 -``` - -Then the API server will program in the new RESTful resource path: - * `/apis/mygroup.example.com/v1/namespaces//crontabs/...` - -**Note: This may take a while before RESTful resource path registration happen, please -always check this before you create resource instances.** - -Now that this schema has been created, a user can `POST`: - -```json -{ - "metadata": { - "name": "my-new-cron-object" - }, - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTab", - "cronSpec": "* * * * /5", - "image": "my-awesome-cron-image" -} -``` - -to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs` - -and the corresponding data will be stored into etcd by the APIServer, so that -when the user issues: - -``` -GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object` -``` - -And when they do that, they will get back the same data, but with additional -Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in. - -Likewise, to list all resources, a user can issue: - -``` -GET /apis/mygroup.example.com/v1/namespaces/default/crontabs -``` - -and get back: - -```json -{ - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTabList", - "items": [ - { - "metadata": { - "name": "my-new-cron-object" - }, - "apiVersion": "mygroup.example.com/v1", - "kind": "CronTab", - "cronSpec": "* * * * /5", - "image": "my-awesome-cron-image" - } - ] -} -``` - -Because all objects are expected to contain standard Kubernetes metadata fields, -these list operations can also use label queries to filter requests down to -specific subsets. - -Likewise, clients can use watch endpoints to watch for changes to stored -objects. - -## Storage - -In order to store custom user data in a versioned fashion inside of etcd, we -need to also introduce a `Codec`-compatible object for persistent storage in -etcd. This object is `ThirdPartyResourceData` and it contains: - * Standard API Metadata. - * `Data`: The raw JSON data for this custom object. - -### Storage key specification - -Each custom object stored by the API server needs a custom key in storage, this -is described below: - -#### Definitions - - * `resource-namespace`: the namespace of the particular resource that is -being stored - * `resource-name`: the name of the particular resource being stored - * `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` -resource that represents the type for the specific instance being stored - * `third-party-resource-name`: the name of the `ThirdPartyResource` resource -that represents the type for the specific instance being stored - -#### Key - -Given the definitions above, the key for a specific third-party object is: - -``` -${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name} -``` - -Thus, listing a third-party resource can be achieved by listing the directory: - -``` -${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/ -``` - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]() - diff --git a/federated-replicasets.md b/federated-replicasets.md deleted file mode 100644 index f1744ade..00000000 --- a/federated-replicasets.md +++ /dev/null @@ -1,513 +0,0 @@ -# Federated ReplicaSets - -# Requirements & Design Document - -This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion. - -Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com) -Based on discussions with -Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com) - -## Overview - -### Summary & Vision - -When running a global application on a federation of Kubernetes -clusters the owner currently has to start it in multiple clusters and -control whether he has both enough application replicas running -locally in each of the clusters (so that, for example, users are -handled by a nearby cluster, with low latency) and globally (so that -there is always enough capacity to handle all traffic). If one of the -clusters has issues or hasn’t enough capacity to run the given set of -replicas the replicas should be automatically moved to some other -cluster to keep the application responsive. - -In single cluster Kubernetes there is a concept of ReplicaSet that -manages the replicas locally. We want to expand this concept to the -federation level. - -### Goals - -+ Win large enterprise customers who want to easily run applications - across multiple clusters -+ Create a reference controller implementation to facilitate bringing - other Kubernetes concepts to Federated Kubernetes. - -## Glossary - -Federation Cluster - a cluster that is a member of federation. - -Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster -that is a member of federation. - -Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server. - -Federated ReplicaSet Controller (FRSC) - A controller running inside -of Federated K8S server that controlls FRS. - -## User Experience - -### Critical User Journeys - -+ [CUJ1] User wants to create a ReplicaSet in each of the federation - cluster. They create a definition of federated ReplicaSet on the - federated master and (local) ReplicaSets are automatically created - in each of the federation clusters. The number of replicas is each - of the Local ReplicaSets is (perheps indirectly) configurable by - the user. -+ [CUJ2] When the current number of replicas in a cluster drops below - the desired number and new replicas cannot be scheduled then they - should be started in some other cluster. - -### Features Enabling Critical User Journeys - -Feature #1 -> CUJ1: -A component which looks for newly created Federated ReplicaSets and -creates the appropriate Local ReplicaSet definitions in the federated -clusters. - -Feature #2 -> CUJ2: -A component that checks how many replicas are actually running in each -of the subclusters and if the number matches to the -FederatedReplicaSet preferences (by default spread replicas evenly -across the clusters but custom preferences are allowed - see -below). If it doesn’t and the situation is unlikely to improve soon -then the replicas should be moved to other subclusters. - -### API and CLI - -All interaction with FederatedReplicaSet will be done by issuing -kubectl commands pointing on the Federated Master API Server. All the -commands would behave in a similar way as on the regular master, -however in the next versions (1.5+) some of the commands may give -slightly different output. For example kubectl describe on federated -replica set should also give some information about the subclusters. - -Moreover, for safety, some defaults will be different. For example for -kubectl delete federatedreplicaset cascade will be set to false. - -FederatedReplicaSet would have the same object as local ReplicaSet -(although it will be accessible in a different part of the -api). Scheduling preferences (how many replicas in which cluster) will -be passed as annotations. - -### FederateReplicaSet preferences - -The preferences are expressed by the following structure, passed as a -serialized json inside annotations. - -``` -type FederatedReplicaSetPreferences struct { - // If set to true then already scheduled and running replicas may be moved to other clusters to - // in order to bring cluster replicasets towards a desired state. Otherwise, if set to false, - // up and running replicas will not be moved. - Rebalance bool `json:"rebalance,omitempty"` - - // Map from cluster name to preferences for that cluster. It is assumed that if a cluster - // doesn’t have a matching entry then it should not have local replica. The cluster matches - // to "*" if there is no entry with the real cluster name. - Clusters map[string]LocalReplicaSetPreferences -} - -// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset. -type ClusterReplicaSetPreferences struct { - // Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default. - MinReplicas int64 `json:"minReplicas,omitempty"` - - // Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default). - MaxReplicas *int64 `json:"maxReplicas,omitempty"` - - // A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default. - Weight int64 -} -``` - -How this works in practice: - -**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config: - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity. - Replica layout: A=16 B=17 C=17. -+ Clusters A,B,C and C has capacity for 6 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C. B and C are offline: - Replica layout: A=50 - -**Scenario 2**. I want to have only 2 replicas in each of the clusters. - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1} - } -} -``` - -Or - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 } - } - } - -``` - -Or - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2} - } -} -``` - -There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running. - -**Scenario 3**. I want to have 20 replicas in each of 3 clusters. - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0} - } -} -``` - -There is a global target for 50, however clusters require 60. So some clusters will have less replicas. - Replica layout: A=20 B=20 C=10. - -**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C. - -``` -FederatedReplicaSetPreferences { - Rebalance : true - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=16 B=17 C=17. -+ B is offline/has no capacity - Replica layout: A=30 B=0 C=20 -+ A and B are offline: - Replica layout: C=20 - -**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally. - -``` -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 1000000} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ All have capacity. - Replica layout: A=50 B=0 C=0. -+ A has capacity for only 40 replicas - Replica layout: A=40 B=5 C=5 - -**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters. - -``` -FederatedReplicaSetPreferences { - Clusters : map[string]LocalReplicaSet { - “A” : LocalReplicaSet{ Weight: 2} - “B” : LocalReplicaSet{ Weight: 1} - “C” : LocalReplicaSet{ Weight: 1} - } -} -``` - -**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there -are already some replicas, please do not move them. Config: - -``` -FederatedReplicaSetPreferences { - Rebalance : false - Clusters : map[string]LocalReplicaSet { - "*" : LocalReplicaSet{ Weight: 1} - } -} -``` - -Example: - -+ Clusters A,B,C, all have capacity, but A already has 20 replicas - Replica layout: A=20 B=15 C=15. -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas. - Replica layout: A=22 B=22 C=6 -+ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas. - Replica layout: A=30 B=14 C=6 - -## The Idea - -A new federated controller - Federated Replica Set Controller (FRSC) -will be created inside federated controller manager. Below are -enumerated the key idea elements: - -+ [I0] It is considered OK to have slightly higher number of replicas - globally for some time. - -+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens - on FRS being created, updated or deleted. On each create/update the - scheduling code will be started to calculate where to put the - replicas. The default behavior is to start the same amount of - replicas in each of the cluster. While creating LocalReplicaSets - (LRS) the following errors/issues can occur: - - + [E1] Master rejects LRS creation (for known or unknown - reason). In this case another attempt to create a LRS should be - attempted in 1m or so. This action can be tied with - [[I5]](#heading=h.ififs95k9rng). Until the the LRS is created - the situation is the same as [E5]. If this happens multiple - times all due replicas should be moved elsewhere and later moved - back once the LRS is created. - - + [E2] LRS with the same name but different configuration already - exists. The LRS is then overwritten and an appropriate event - created to explain what happened. Pods under the control of the - old LRS are left intact and the new LRS may adopt them if they - match the selector. - - + [E3] LRS is new but the pods that match the selector exist. The - pods are adopted by the RS (if not owned by some other - RS). However they may have a different image, configuration - etc. Just like with regular LRS. - -+ [I2] For each of the cluster FRSC starts a store and an informer on - LRS that will listen for status updates. These status changes are - only interesting in case of troubles. Otherwise it is assumed that - LRS runs trouble free and there is always the right number of pod - created but possibly not scheduled. - - - + [E4] LRS is manually deleted from the local cluster. In this case - a new LRS should be created. It is the same case as - [[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind - won’t be killed and will be adopted after the LRS is recreated. - - + [E5] LRS fails to create (not necessary schedule) the desired - number of pods due to master troubles, admission control - etc. This should be considered as the same situation as replicas - unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)). - - + [E6] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216). - -+ [I3] For each of the cluster start an store and informer to monitor - whether the created pods are eventually scheduled and what is the - current number of correctly running ready pods. Errors: - - + [E7] It is impossible to tell that an informer lost connection - with a remote cluster or has other synchronization problem so it - should be handled by cluster liveness probe and deletion - [[I6]](#heading=h.z90979gc2216) - -+ [I4] It is assumed that a not scheduled pod is a normal situation -and can last up to X min if there is a huge traffic on the -cluster. However if the replicas are not scheduled in that time then -FRSC should consider moving most of the unscheduled replicas -elsewhere. For that purpose FRSC will maintain a data structure -where for each FRS controlled LRS we store a list of pods belonging -to that LRS along with their current status and status change timestamp. - -+ [I5] If a new cluster is added to the federation then it doesn’t - have a LRS and the situation is equal to - [[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef). - -+ [I6] If a cluster is removed from the federation then the situation - is equal to multiple [E4]. It is assumed that if a connection with - a cluster is lost completely then the cluster is removed from the - the cluster list (or marked accordingly) so - [[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda) - don’t need to be handled. - -+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable), - checked against the current list of clusters, and all missing LRS - are created. This will be executed in combination with [I8]. - -+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min - (configurable) to check whether some replica move between clusters - is needed or not. - -+ FRSC never moves replicas to LRS that have not scheduled/running -pods or that has pods that failed to be created. - - + When FRSC notices that a number of pods are not scheduler/running - or not_even_created in one LRS for more than Y minutes it takes - most of them from LRS, leaving couple still waiting so that once - they are scheduled FRSC will know that it is ok to put some more - replicas to that cluster. - -+ [I9] FRS becomes ToBeChecked if: - + It is newly created - + Some replica set inside changed its status - + Some pods inside cluster changed their status - + Some cluster is added or deleted. -> FRS stops ToBeChecked if is in desired configuration (or is stable enough). - -## (RE)Scheduling algorithm - -To calculate the (re)scheduling moves for a given FRS: - -1. For each cluster FRSC calculates the number of replicas that are placed -(not necessary up and running) in the cluster and the number of replicas that -failed to be scheduled. Cluster capacity is the difference between the -the placed and failed to be scheduled. - -2. Order all clusters by their weight and hash of the name so that every time -we process the same replica-set we process the clusters in the same order. -Include federated replica set name in the cluster name hash so that we get -slightly different ordering for different RS. So that not all RS of size 1 -end up on the same cluster. - -3. Assign minimum prefered number of replicas to each of the clusters, if -there is enough replicas and capacity. - -4. If rebalance = false, assign the previously present replicas to the clusters, -remember the number of extra replicas added (ER). Of course if there -is enough replicas and capacity. - -5. Distribute the remaining replicas with regard to weights and cluster capacity. -In multiple iterations calculate how many of the replicas should end up in the cluster. -For each of the cluster cap the number of assigned replicas by max number of replicas and -cluster capacity. If there were some extra replicas added to the cluster in step -4, don't really add the replicas but balance them gains ER from 4. - -## Goroutines layout - -+ [GR1] Involved in FRS informer (see - [[I1]]). Whenever a FRS is created and - updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with - delay 0. - -+ [GR2_1...GR2_N] Involved in informers/store on LRS (see - [[I2]]). On all changes the FRS is put on - FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR3_1...GR3_N] Involved in informers/store on Pods - (see [[I3]] and [[I4]]). They maintain the status store - so that for each of the LRS we know the number of pods that are - actually running and ready in O(1) time. They also put the - corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min. - -+ [GR4] Involved in cluster informer (see - [[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE - with delay 0. - -+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on - FRS_CHANNEL after the given delay (and remove from - FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to - FRS_TO_CHECK_QUEUE the delays are compared and updated so that the - shorter delay is used. - -+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever - a FRS is received it is put to a work queue. Work queue has no delay - and makes sure that a single replica set is process is processed by - only one goroutine. - -+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS. - Multiple replica set can be processed in parallel. Two Goroutines cannot - process the same FRS at the same time. - - -## Func DoFrsCheck - -The function does [[I7]] and[[I8]]. It is assumed that it is run on a -single thread/goroutine so we check and evaluate the same FRS on many -goroutines (however if needed the function can be parallelized for -different FRS). It takes data only from store maintained by GR2_* and -GR3_*. The external communication is only required to: - -+ Create LRS. If a LRS doesn’t exist it is created after the - rescheduling, when we know how much replicas should it have. - -+ Update LRS replica targets. - -If FRS is not in the desired state then it is put to -FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing). - -## Monitoring and status reporting - -FRCS should expose a number of metrics form the run, like - -+ FRSC -> LRS communication latency -+ Total times spent in various elements of DoFrsCheck - -FRSC should also expose the status of FRS as an annotation on FRS and -as events. - -## Workflow - -Here is the sequence of tasks that need to be done in order for a -typical FRS to be split into a number of LRS’s and to be created in -the underlying federated clusters. - -Note a: the reason the workflow would be helpful at this phase is that -for every one or two steps we can create PRs accordingly to start with -the development. - -Note b: we assume that the federation is already in place and the -federated clusters are added to the federation. - -Step 1. the client sends an RS create request to the -federation-apiserver - -Step 2. federation-apiserver persists an FRS into the federation etcd - -Note c: federation-apiserver populates the clusterid field in the FRS -before persisting it into the federation etcd - -Step 3: the federation-level “informer” in FRSC watches federation -etcd for new/modified FRS’s, with empty clusterid or clusterid equal -to federation ID, and if detected, it calls the scheduling code - -Step 4. - -Note d: scheduler populates the clusterid field in the LRS with the -IDs of target clusters - -Note e: at this point let us assume that it only does the even -distribution, i.e., equal weights for all of the underlying clusters - -Step 5. As soon as the scheduler function returns the control to FRSC, -the FRSC starts a number of cluster-level “informer”s, one per every -target cluster, to watch changes in every target cluster etcd -regarding the posted LRS’s and if any violation from the scheduled -number of replicase is detected the scheduling code is re-called for -re-scheduling purposes. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]() - diff --git a/federated-services.md b/federated-services.md deleted file mode 100644 index b9d51c43..00000000 --- a/federated-services.md +++ /dev/null @@ -1,517 +0,0 @@ -# Kubernetes Cluster Federation (previously nicknamed "Ubernetes") - -## Cross-cluster Load Balancing and Service Discovery - -### Requirements and System Design - -### by Quinton Hoole, Dec 3 2015 - -## Requirements - -### Discovery, Load-balancing and Failover - -1. **Internal discovery and connection**: Pods/containers (running in - a Kubernetes cluster) must be able to easily discover and connect - to endpoints for Kubernetes services on which they depend in a - consistent way, irrespective of whether those services exist in a - different kubernetes cluster within the same cluster federation. - Hence-forth referred to as "cluster-internal clients", or simply - "internal clients". -1. **External discovery and connection**: External clients (running - outside a Kubernetes cluster) must be able to discover and connect - to endpoints for Kubernetes services on which they depend. - 1. **External clients predominantly speak HTTP(S)**: External - clients are most often, but not always, web browsers, or at - least speak HTTP(S) - notable exceptions include Enterprise - Message Busses (Java, TLS), DNS servers (UDP), - SIP servers and databases) -1. **Find the "best" endpoint:** Upon initial discovery and - connection, both internal and external clients should ideally find - "the best" endpoint if multiple eligible endpoints exist. "Best" - in this context implies the closest (by network topology) endpoint - that is both operational (as defined by some positive health check) - and not overloaded (by some published load metric). For example: - 1. An internal client should find an endpoint which is local to its - own cluster if one exists, in preference to one in a remote - cluster (if both are operational and non-overloaded). - Similarly, one in a nearby cluster (e.g. in the same zone or - region) is preferable to one further afield. - 1. An external client (e.g. in New York City) should find an - endpoint in a nearby cluster (e.g. U.S. East Coast) in - preference to one further away (e.g. Japan). -1. **Easy fail-over:** If the endpoint to which a client is connected - becomes unavailable (no network response/disconnected) or - overloaded, the client should reconnect to a better endpoint, - somehow. - 1. In the case where there exist one or more connection-terminating - load balancers between the client and the serving Pod, failover - might be completely automatic (i.e. the client's end of the - connection remains intact, and the client is completely - oblivious of the fail-over). This approach incurs network speed - and cost penalties (by traversing possibly multiple load - balancers), but requires zero smarts in clients, DNS libraries, - recursing DNS servers etc, as the IP address of the endpoint - remains constant over time. - 1. In a scenario where clients need to choose between multiple load - balancer endpoints (e.g. one per cluster), multiple DNS A - records associated with a single DNS name enable even relatively - dumb clients to try the next IP address in the list of returned - A records (without even necessarily re-issuing a DNS resolution - request). For example, all major web browsers will try all A - records in sequence until a working one is found (TBD: justify - this claim with details for Chrome, IE, Safari, Firefox). - 1. In a slightly more sophisticated scenario, upon disconnection, a - smarter client might re-issue a DNS resolution query, and - (modulo DNS record TTL's which can typically be set as low as 3 - minutes, and buggy DNS resolvers, caches and libraries which - have been known to completely ignore TTL's), receive updated A - records specifying a new set of IP addresses to which to - connect. - -### Portability - -A Kubernetes application configuration (e.g. for a Pod, Replication -Controller, Service etc) should be able to be successfully deployed -into any Kubernetes Cluster or Federation of Clusters, -without modification. More specifically, a typical configuration -should work correctly (although possibly not optimally) across any of -the following environments: - -1. A single Kubernetes Cluster on one cloud provider (e.g. Google - Compute Engine, GCE). -1. A single Kubernetes Cluster on a different cloud provider - (e.g. Amazon Web Services, AWS). -1. A single Kubernetes Cluster on a non-cloud, on-premise data center -1. A Federation of Kubernetes Clusters all on the same cloud provider - (e.g. GCE). -1. A Federation of Kubernetes Clusters across multiple different cloud - providers and/or on-premise data centers (e.g. one cluster on - GCE/GKE, one on AWS, and one on-premise). - -### Trading Portability for Optimization - -It should be possible to explicitly opt out of portability across some -subset of the above environments in order to take advantage of -non-portable load balancing and DNS features of one or more -environments. More specifically, for example: - -1. For HTTP(S) applications running on GCE-only Federations, - [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - should be usable. These provide single, static global IP addresses - which load balance and fail over globally (i.e. across both regions - and zones). These allow for really dumb clients, but they only - work on GCE, and only for HTTP(S) traffic. -1. For non-HTTP(S) applications running on GCE-only Federations within - a single region, - [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - should be usable. These provide TCP (i.e. both HTTP/S and - non-HTTP/S) load balancing and failover, but only on GCE, and only - within a single region. - [Google Cloud DNS](https://cloud.google.com/dns) can be used to - route traffic between regions (and between different cloud - providers and on-premise clusters, as it's plain DNS, IP only). -1. For applications running on AWS-only Federations, - [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - should be usable. These provide both L7 (HTTP(S)) and L4 load - balancing, but only within a single region, and only on AWS - ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be - used to load balance and fail over across multiple regions, and is - also capable of resolving to non-AWS endpoints). - -## Component Cloud Services - -Cross-cluster Federated load balancing is built on top of the following: - -1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) - provide single, static global IP addresses which load balance and - fail over globally (i.e. across both regions and zones). These - allow for really dumb clients, but they only work on GCE, and only - for HTTP(S) traffic. -1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/) - provide both HTTP(S) and non-HTTP(S) load balancing and failover, - but only on GCE, and only within a single region. -1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/) - provide both L7 (HTTP(S)) and L4 load balancing, but only within a - single region, and only on AWS. -1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other - programmable DNS service, like - [CloudFlare](http://www.cloudflare.com) can be used to route - traffic between regions (and between different cloud providers and - on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS - doesn't provide any built-in geo-DNS, latency-based routing, health - checking, weighted round robin or other advanced capabilities. - It's plain old DNS. We would need to build all the aforementioned - on top of it. It can provide internal DNS services (i.e. serve RFC - 1918 addresses). - 1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can - be used to load balance and fail over across regions, and is also - capable of routing to non-AWS endpoints). It provides built-in - geo-DNS, latency-based routing, health checking, weighted - round robin and optional tight integration with some other - AWS services (e.g. Elastic Load Balancers). -1. Kubernetes L4 Service Load Balancing: This provides both a - [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies) - and a - [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer) - service IP which is load-balanced (currently simple round-robin) - across the healthy pods comprising a service within a single - Kubernetes cluster. -1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): -A generic wrapper around cloud-provided L4 and L7 load balancing services, and -roll-your-own load balancers run in pods, e.g. HA Proxy. - -## Cluster Federation API - -The Cluster Federation API for load balancing should be compatible with the equivalent -Kubernetes API, to ease porting of clients between Kubernetes and -federations of Kubernetes clusters. -Further details below. - -## Common Client Behavior - -To be useful, our load balancing solution needs to work properly with real -client applications. There are a few different classes of those... - -### Browsers - -These are the most common external clients. These are all well-written. See below. - -### Well-written clients - -1. Do a DNS resolution every time they connect. -1. Don't cache beyond TTL (although a small percentage of the DNS - servers on which they rely might). -1. Do try multiple A records (in order) to connect. -1. (in an ideal world) Do use SRV records rather than hard-coded port numbers. - -Examples: - -+ all common browsers (except for SRV records) -+ ... - -### Dumb clients - -1. Don't do a DNS resolution every time they connect (or do cache beyond the -TTL). -1. Do try multiple A records - -Examples: - -+ ... - -### Dumber clients - -1. Only do a DNS lookup once on startup. -1. Only try the first returned DNS A record. - -Examples: - -+ ... - -### Dumbest clients - -1. Never do a DNS lookup - are pre-configured with a single (or possibly -multiple) fixed server IP(s). Nothing else matters. - -## Architecture and Implementation - -### General Control Plane Architecture - -Each cluster hosts one or more Cluster Federation master components (Federation API -servers, controller managers with leader election, and etcd quorum members. This -is documented in more detail in a separate design doc: -[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#). - -In the description below, assume that 'n' clusters, named 'cluster-1'... -'cluster-n' have been registered against a Cluster Federation "federation-1", -each with their own set of Kubernetes API endpoints,so, -"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1), -[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1) -... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) . - -### Federated Services - -Federated Services are pretty straight-forward. They're comprised of multiple -equivalent underlying Kubernetes Services, each with their own external -endpoint, and a load balancing mechanism across them. Let's work through how -exactly that works in practice. - -Our user creates the following Federated Service (against a Federation -API endpoint): - - $ kubectl create -f my-service.yaml --context="federation-1" - -where service.yaml contains the following: - - kind: Service - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - ports: - - port: 2379 - protocol: TCP - targetPort: 2379 - name: client - - port: 2380 - protocol: TCP - targetPort: 2380 - name: peer - selector: - run: my-service - type: LoadBalancer - -The Cluster Federation control system in turn creates one equivalent service (identical config to the above) -in each of the underlying Kubernetes clusters, each of which results in -something like this: - - $ kubectl get -o yaml --context="cluster-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:25Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "147365" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00002 - spec: - clusterIP: 10.0.153.185 - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - ip: 104.197.117.10 - -Similar services are created in `cluster-2` and `cluster-3`, each of which are -allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`. - -In the Cluster Federation `federation-1`, the resulting federated service looks as follows: - - $ kubectl get -o yaml --context="federation-1" service my-service - - apiVersion: v1 - kind: Service - metadata: - creationTimestamp: 2015-11-25T23:35:23Z - labels: - run: my-service - name: my-service - namespace: my-namespace - resourceVersion: "157333" - selfLink: /api/v1/namespaces/my-namespace/services/my-service - uid: 33bfc927-93cd-11e5-a38c-42010af00007 - spec: - clusterIP: - ports: - - name: client - nodePort: 31333 - port: 2379 - protocol: TCP - targetPort: 2379 - - name: peer - nodePort: 31086 - port: 2380 - protocol: TCP - targetPort: 2380 - selector: - run: my-service - sessionAffinity: None - type: LoadBalancer - status: - loadBalancer: - ingress: - - hostname: my-service.my-namespace.my-federation.my-domain.com - -Note that the federated service: - -1. Is API-compatible with a vanilla Kubernetes service. -1. has no clusterIP (as it is cluster-independent) -1. has a federation-wide load balancer hostname - -In addition to the set of underlying Kubernetes services (one per cluster) -described above, the Cluster Federation control system has also created a DNS name (e.g. on -[Google Cloud DNS](https://cloud.google.com/dns) or -[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration) -which provides load balancing across all of those services. For example, in a -very basic configuration: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -Each of the above IP addresses (which are just the external load balancer -ingress IP's of each cluster service) is of course load balanced across the pods -comprising the service in each cluster. - -In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster -Federation control system -automatically creates a -[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules) -which exposes a single, globally load-balanced IP: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44 - -Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS) -in each Kubernetes cluster to preferentially return the local -clusterIP for the service in that cluster, with other clusters' -external service IP's (or a global load-balanced IP) also configured -for failover purposes: - - $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com - my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77 - my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157 - -If Cluster Federation Global Service Health Checking is enabled, multiple service health -checkers running across the federated clusters collaborate to monitor the health -of the service endpoints, and automatically remove unhealthy endpoints from the -DNS record (e.g. a majority quorum is required to vote a service endpoint -unhealthy, to avoid false positives due to individual health checker network -isolation). - -### Federated Replication Controllers - -So far we have a federated service defined, with a resolvable load balancer -hostname by which clients can reach it, but no pods serving traffic directed -there. So now we need a Federated Replication Controller. These are also fairly -straight-forward, being comprised of multiple underlying Kubernetes Replication -Controllers which do the hard work of keeping the desired number of Pod replicas -alive in each Kubernetes cluster. - - $ kubectl create -f my-service-rc.yaml --context="federation-1" - -where `my-service-rc.yaml` contains the following: - - kind: ReplicationController - metadata: - labels: - run: my-service - name: my-service - namespace: my-namespace - spec: - replicas: 6 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - -The Cluster Federation control system in turn creates one equivalent replication controller -(identical config to the above, except for the replica count) in each -of the underlying Kubernetes clusters, each of which results in -something like this: - - $ ./kubectl get -o yaml rc my-service --context="cluster-1" - kind: ReplicationController - metadata: - creationTimestamp: 2015-12-02T23:00:47Z - labels: - run: my-service - name: my-service - namespace: my-namespace - selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service - uid: 86542109-9948-11e5-a38c-42010af00002 - spec: - replicas: 2 - selector: - run: my-service - template: - metadata: - labels: - run: my-service - spec: - containers: - image: gcr.io/google_samples/my-service:v1 - name: my-service - ports: - - containerPort: 2379 - protocol: TCP - - containerPort: 2380 - protocol: TCP - resources: {} - dnsPolicy: ClusterFirst - restartPolicy: Always - status: - replicas: 2 - -The exact number of replicas created in each underlying cluster will of course -depend on what scheduling policy is in force. In the above example, the -scheduler created an equal number of replicas (2) in each of the three -underlying clusters, to make up the total of 6 replicas required. To handle -entire cluster failures, various approaches are possible, including: -1. **simple overprovisioning**, such that sufficient replicas remain even if a - cluster fails. This wastes some resources, but is simple and reliable. -2. **pod autoscaling**, where the replication controller in each - cluster automatically and autonomously increases the number of - replicas in its cluster in response to the additional traffic - diverted from the failed cluster. This saves resources and is relatively - simple, but there is some delay in the autoscaling. -3. **federated replica migration**, where the Cluster Federation - control system detects the cluster failure and automatically - increases the replica count in the remainaing clusters to make up - for the lost replicas in the failed cluster. This does not seem to - offer any benefits relative to pod autoscaling above, and is - arguably more complex to implement, but we note it here as a - possibility. - -### Implementation Details - -The implementation approach and architecture is very similar to Kubernetes, so -if you're familiar with how Kubernetes works, none of what follows will be -surprising. One additional design driver not present in Kubernetes is that -the Cluster Federation control system aims to be resilient to individual cluster and availability zone -failures. So the control plane spans multiple clusters. More specifically: - -+ Cluster Federation runs it's own distinct set of API servers (typically one - or more per underlying Kubernetes cluster). These are completely - distinct from the Kubernetes API servers for each of the underlying - clusters. -+ Cluster Federation runs it's own distinct quorum-based metadata store (etcd, - by default). Approximately 1 quorum member runs in each underlying - cluster ("approximately" because we aim for an odd number of quorum - members, and typically don't want more than 5 quorum members, even - if we have a larger number of federated clusters, so 2 clusters->3 - quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc). - -Cluster Controllers in the Federation control system watch against the -Federation API server/etcd -state, and apply changes to the underlying kubernetes clusters accordingly. They -also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired" -state against kubernetes "actual desired" state. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]() - diff --git a/federation-phase-1.md b/federation-phase-1.md deleted file mode 100644 index 0a3a8f50..00000000 --- a/federation-phase-1.md +++ /dev/null @@ -1,407 +0,0 @@ -# Ubernetes Design Spec (phase one) - -**Huawei PaaS Team** - -## INTRODUCTION - -In this document we propose a design for the “Control Plane” of -Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of -this work please refer to -[this proposal](../../docs/proposals/federation.md). -The document is arranged as following. First we briefly list scenarios -and use cases that motivate K8S federation work. These use cases drive -the design and they also verify the design. We summarize the -functionality requirements from these use cases, and define the “in -scope” functionalities that will be covered by this design (phase -one). After that we give an overview of the proposed architecture, API -and building blocks. And also we go through several activity flows to -see how these building blocks work together to support use cases. - -## REQUIREMENTS - -There are many reasons why customers may want to build a K8S -federation: - -+ **High Availability:** Customers want to be immune to the outage of - a single availability zone, region or even a cloud provider. -+ **Sensitive workloads:** Some workloads can only run on a particular - cluster. They cannot be scheduled to or migrated to other clusters. -+ **Capacity overflow:** Customers prefer to run workloads on a - primary cluster. But if the capacity of the cluster is not - sufficient, workloads should be automatically distributed to other - clusters. -+ **Vendor lock-in avoidance:** Customers want to spread their - workloads on different cloud providers, and can easily increase or - decrease the workload proportion of a specific provider. -+ **Cluster Size Enhancement:** Currently K8S cluster can only support -a limited size. While the community is actively improving it, it can -be expected that cluster size will be a problem if K8S is used for -large workloads or public PaaS infrastructure. While we can separate -different tenants to different clusters, it would be good to have a -unified view. - -Here are the functionality requirements derived from above use cases: - -+ Clients of the federation control plane API server can register and deregister -clusters. -+ Workloads should be spread to different clusters according to the - workload distribution policy. -+ Pods are able to discover and connect to services hosted in other - clusters (in cases where inter-cluster networking is necessary, - desirable and implemented). -+ Traffic to these pods should be spread across clusters (in a manner - similar to load balancing, although it might not be strictly - speaking balanced). -+ The control plane needs to know when a cluster is down, and migrate - the workloads to other clusters. -+ Clients have a unified view and a central control point for above - activities. - -## SCOPE - -It’s difficult to have a perfect design with one click that implements -all the above requirements. Therefore we will go with an iterative -approach to design and build the system. This document describes the -phase one of the whole work. In phase one we will cover only the -following objectives: - -+ Define the basic building blocks and API objects of control plane -+ Implement a basic end-to-end workflow - + Clients register federated clusters - + Clients submit a workload - + The workload is distributed to different clusters - + Service discovery - + Load balancing - -The following parts are NOT covered in phase one: - -+ Authentication and authorization (other than basic client - authentication against the ubernetes API, and from ubernetes control - plane to the underlying kubernetes clusters). -+ Deployment units other than replication controller and service -+ Complex distribution policy of workloads -+ Service affinity and migration - -## ARCHITECTURE - -The overall architecture of a control plane is shown as following: - -![Ubernetes Architecture](ubernetes-design.png) - -Some design principles we are following in this architecture: - -1. Keep the underlying K8S clusters independent. They should have no - knowledge of control plane or of each other. -1. Keep the Ubernetes API interface compatible with K8S API as much as - possible. -1. Re-use concepts from K8S as much as possible. This reduces -customers’ learning curve and is good for adoption. Below is a brief -description of each module contained in above diagram. - -## Ubernetes API Server - -The API Server in the Ubernetes control plane works just like the API -Server in K8S. It talks to a distributed key-value store to persist, -retrieve and watch API objects. This store is completely distinct -from the kubernetes key-value stores (etcd) in the underlying -kubernetes clusters. We still use `etcd` as the distributed -storage so customers don’t need to learn and manage a different -storage system, although it is envisaged that other storage systems -(consol, zookeeper) will probably be developedand supported over -time. - -## Ubernetes Scheduler - -The Ubernetes Scheduler schedules resources onto the underlying -Kubernetes clusters. For example it watches for unscheduled Ubernetes -replication controllers (those that have not yet been scheduled onto -underlying Kubernetes clusters) and performs the global scheduling -work. For each unscheduled replication controller, it calls policy -engine to decide how to spit workloads among clusters. It creates a -Kubernetes Replication Controller on one ore more underlying cluster, -and post them back to `etcd` storage. - -One sublety worth noting here is that the scheduling decision is arrived at by -combining the application-specific request from the user (which might -include, for example, placement constraints), and the global policy specified -by the federation administrator (for example, "prefer on-premise -clusters over AWS clusters" or "spread load equally across clusters"). - -## Ubernetes Cluster Controller - -The cluster controller -performs the following two kinds of work: - -1. It watches all the sub-resources that are created by Ubernetes - components, like a sub-RC or a sub-service. And then it creates the - corresponding API objects on the underlying K8S clusters. -1. It periodically retrieves the available resources metrics from the - underlying K8S cluster, and updates them as object status of the - `cluster` API object. An alternative design might be to run a pod - in each underlying cluster that reports metrics for that cluster to - the Ubernetes control plane. Which approach is better remains an - open topic of discussion. - -## Ubernetes Service Controller - -The Ubernetes service controller is a federation-level implementation -of K8S service controller. It watches service resources created on -control plane, creates corresponding K8S services on each involved K8S -clusters. Besides interacting with services resources on each -individual K8S clusters, the Ubernetes service controller also -performs some global DNS registration work. - -## API OBJECTS - -## Cluster - -Cluster is a new first-class API object introduced in this design. For -each registered K8S cluster there will be such an API resource in -control plane. The way clients register or deregister a cluster is to -send corresponding REST requests to following URL: -`/api/{$version}/clusters`. Because control plane is behaving like a -regular K8S client to the underlying clusters, the spec of a cluster -object contains necessary properties like K8S cluster address and -credentials. The status of a cluster API object will contain -following information: - -1. Which phase of its lifecycle -1. Cluster resource metrics for scheduling decisions. -1. Other metadata like the version of cluster - -$version.clusterSpec - - - - - - - - - - - - - - - - - - - - - - - - - -
Name
-
Description
-
Required
-
Schema
-
Default
-
Address
-
address of the cluster
-
yes
-
address
-

Credential
-
the type (e.g. bearer token, client -certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)
-
yes
-
string
-

- -$version.clusterStatus - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Name
-
Description
-
Required
-
Schema
-
Default
-
Phase
-
the recently observed lifecycle phase of the cluster
-
yes
-
enum
-

Capacity
-
represents the available resources of a cluster
-
yes
-
any
-

ClusterMeta
-
Other cluster metadata like the version
-
yes
-
ClusterMeta
-

- -**For simplicity we didn’t introduce a separate “cluster metrics” API -object here**. The cluster resource metrics are stored in cluster -status section, just like what we did to nodes in K8S. In phase one it -only contains available CPU resources and memory resources. The -cluster controller will periodically poll the underlying cluster API -Server to get cluster capability. In phase one it gets the metrics by -simply aggregating metrics from all nodes. In future we will improve -this with more efficient ways like leveraging heapster, and also more -metrics will be supported. Similar to node phases in K8S, the “phase” -field includes following values: - -+ pending: newly registered clusters or clusters suspended by admin - for various reasons. They are not eligible for accepting workloads -+ running: clusters in normal status that can accept workloads -+ offline: clusters temporarily down or not reachable -+ terminated: clusters removed from federation - -Below is the state transition diagram. - -![Cluster State Transition Diagram](ubernetes-cluster-state.png) - -## Replication Controller - -A global workload submitted to control plane is represented as a - replication controller in the Cluster Federation control plane. When a replication controller -is submitted to control plane, clients need a way to express its -requirements or preferences on clusters. Depending on different use -cases it may be complex. For example: - -+ This workload can only be scheduled to cluster Foo. It cannot be - scheduled to any other clusters. (use case: sensitive workloads). -+ This workload prefers cluster Foo. But if there is no available - capacity on cluster Foo, it’s OK to be scheduled to cluster Bar - (use case: workload ) -+ Seventy percent of this workload should be scheduled to cluster Foo, - and thirty percent should be scheduled to cluster Bar (use case: - vendor lock-in avoidance). In phase one, we only introduce a - _clusterSelector_ field to filter acceptable clusters. In default - case there is no such selector and it means any cluster is - acceptable. - -Below is a sample of the YAML to create such a replication controller. - -``` -apiVersion: v1 -kind: ReplicationController -metadata: - name: nginx-controller -spec: - replicas: 5 - selector: - app: nginx - template: - metadata: - labels: - app: nginx - spec: - containers: - - name: nginx - image: nginx - ports: - - containerPort: 80 - clusterSelector: - name in (Foo, Bar) -``` - -Currently clusterSelector (implemented as a -[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704)) -only supports a simple list of acceptable clusters. Workloads will be -evenly distributed on these acceptable clusters in phase one. After -phase one we will define syntax to represent more advanced -constraints, like cluster preference ordering, desired number of -splitted workloads, desired ratio of workloads spread on different -clusters, etc. - -Besides this explicit “clusterSelector” filter, a workload may have -some implicit scheduling restrictions. For example it defines -“nodeSelector” which can only be satisfied on some particular -clusters. How to handle this will be addressed after phase one. - -## Federated Services - -The Service API object exposed by the Cluster Federation is similar to service -objects on Kubernetes. It defines the access to a group of pods. The -federation service controller will create corresponding Kubernetes -service objects on underlying clusters. These are detailed in a -separate design document: [Federated Services](federated-services.md). - -## Pod - -In phase one we only support scheduling replication controllers. Pod -scheduling will be supported in later phase. This is primarily in -order to keep the Cluster Federation API compatible with the Kubernetes API. - -## ACTIVITY FLOWS - -## Scheduling - -The below diagram shows how workloads are scheduled on the Cluster Federation control\ -plane: - -1. A replication controller is created by the client. -1. APIServer persists it into the storage. -1. Cluster controller periodically polls the latest available resource - metrics from the underlying clusters. -1. Scheduler is watching all pending RCs. It picks up the RC, make - policy-driven decisions and split it into different sub RCs. -1. Each cluster control is watching the sub RCs bound to its - corresponding cluster. It picks up the newly created sub RC. -1. The cluster controller issues requests to the underlying cluster -API Server to create the RC. In phase one we don’t support complex -distribution policies. The scheduling rule is basically: - 1. If a RC does not specify any nodeSelector, it will be scheduled - to the least loaded K8S cluster(s) that has enough available - resources. - 1. If a RC specifies _N_ acceptable clusters in the - clusterSelector, all replica will be evenly distributed among - these clusters. - -There is a potential race condition here. Say at time _T1_ the control -plane learns there are _m_ available resources in a K8S cluster. As -the cluster is working independently it still accepts workload -requests from other K8S clients or even another Cluster Federation control -plane. The Cluster Federation scheduling decision is based on this data of -available resources. However when the actual RC creation happens to -the cluster at time _T2_, the cluster may don’t have enough resources -at that time. We will address this problem in later phases with some -proposed solutions like resource reservation mechanisms. - -![Federated Scheduling](ubernetes-scheduling.png) - -## Service Discovery - -This part has been included in the section “Federated Service” of -document -“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. -Please refer to that document for details. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]() - diff --git a/ha_master.md b/ha_master.md deleted file mode 100644 index d4cf26a9..00000000 --- a/ha_master.md +++ /dev/null @@ -1,236 +0,0 @@ -# Automated HA master deployment - -**Author:** filipg@, jsz@ - -# Introduction - -We want to allow users to easily replicate kubernetes masters to have highly available cluster, -initially using `kube-up.sh` and `kube-down.sh`. - -This document describes technical design of this feature. It assumes that we are using aforementioned -scripts for cluster deployment. All of the ideas described in the following sections should be easy -to implement on GCE, AWS and other cloud providers. - -It is a non-goal to design a specific setup for bare-metal environment, which -might be very different. - -# Overview - -In a cluster with replicated master, we will have N VMs, each running regular master components -such as apiserver, etcd, scheduler or controller manager. These components will interact in the -following way: -* All etcd replicas will be clustered together and will be using master election - and quorum mechanism to agree on the state. All of these mechanisms are integral - parts of etcd and we will only have to configure them properly. -* All apiserver replicas will be working independently talking to an etcd on - 127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master - (as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)). -* We will introduce provider specific solutions to load balance traffic between master replicas - (see section `load balancing`) -* Controller manager, scheduler & cluster autoscaler will use lease mechanism and - only a single instance will be an active master. All other will be waiting in a standby mode. -* All add-on managers will work independently and each of them will try to keep add-ons in sync - -# Detailed design - -## Components - -### etcd - -``` -Note: This design for etcd clustering is quite pet-set like - each etcd -replica has its name which is explicitly used in etcd configuration etc. In -medium-term future we would like to have the ability to run masters as part of -autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas -automatically. This is pretty tricky and this design does not cover this. -It will be covered in a separate doc. -``` - -All etcd instances will be clustered together and one of them will be an elected master. -In order to commit any change quorum of the cluster will have to confirm it. Etcd will be -configured in such a way that all writes and reads will go through the master (requests -will be forwarded by the local etcd server such that it’s invisible for the user). It will -affect latency for all operations, but it should not increase by much more than the network -latency between master replicas (latency between GCE zones with a region is < 10ms). - -Currently etcd exposes port only using localhost interface. In order to allow clustering -and inter-VM communication we will also have to use public interface. To secure the -communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)). - -When generating command line for etcd we will always assume it’s part of a cluster -(initially of size 1) and list all existing kubernetes master replicas. -Based on that, we will set the following flags: -* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one) -* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one): - * `new` if we are adding the first replica, i.e. the list of existing master replicas is empty - * `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty. - -This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs -with master replicas will be generated in `kube-up.sh` script and passed to as a env variable -`INITIAL_ETCD_CLUSTER`. - -### apiservers - -All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact -etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the -etcd leader. This functionality is completely hidden from the client (apiserver -in our case). - -Caching mechanism, which is implemented in apiserver, will not be affected by -replicating master because: -* GET requests go directly to etcd -* LIST requests go either directly to etcd or to cache populated via watch - (depending on the ResourceVersion in ListOptions). In the second scenario, - after a PUT/POST request, changes might not be visible in LIST response. - This is however not worse than it is with the current single master. -* WATCH does not give any guarantees when change will be delivered. - -#### load balancing - -With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud -providers have different capabilities and limitations, we will not try to find a common lowest -denominator that will work everywhere. Instead we will document various options and apply different -solution for different deployments. Below we list possible approaches: - -1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed -automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS) -or Google Cloud DNS (GCP). For load balancing we will have two options: - 1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately - 1.2. use round-robin DNS technique to access all apiservers directly -2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries -will be manually managed by the user. We will provide detailed documentation for the entries we -expect. -3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static -external IP address that is later assigned to the master VM. When creating additional replicas we -will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer -instead of a single master. When removing second to last replica we will reverse this operation (assign -IP address to the remaining master VM and delete load balancer). That way user will not have to provide -a domain name and all client configurations will keep working. - -This will also impact `kubelet <-> master` communication as it should use load -balancing for it. Depending on the chosen method we will use it to properly configure -kubelet. - -#### `kubernetes` service - -Kubernetes maintains a special service called `kubernetes`. Currently it keeps a -list of IP addresses for all apiservers. As it uses a command line flag -`--apiserver-count` it is not very dynamic and would require restarting all -masters to change number of master replicas. - -To allow dynamic changes to the number of apiservers in the cluster, we will -introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration -time for each apiserver (keyed by IP). Each apiserver will do three things: - -1. periodically update expiration time for it's own IP address -2. remove all the stale IP addresses from the endpoints list -3. add it's own IP address if it's not on the list yet. - -That way we will not only solve the problem of dynamically changing number -of apiservers in the cluster, but also the problem of non-responsive apiservers -that should be removed from the `kubernetes` service endpoints list. - -#### Certificates - -Certificate generation will work as today. In particular, on GCE, we will -generate it for the public IP used to access the cluster (see `load balancing` -section) and local IP of the master replica VM. - -That means that with multiple master replicas and a load balancer in front -of them, accessing one of the replicas directly (using it's ephemeral public -IP) will not work on GCE without appropriate flags: - -- `kubectl --insecure-skip-tls-verify=true` -- `curl --insecure` -- `wget --no-check-certificate` - -For other deployment tools and providers the details of certificate generation -may be different, but it must be possible to access the cluster by using either -the main cluster endpoint (DNS name or IP address) or internal service called -`kubernetes` that points directly to the apiservers. - -### controller manager, scheduler & cluster autoscaler - -Controller manager and scheduler will by default use a lease mechanism to choose an active instance -among all masters. Only one instance will be performing any operations. -All other will be waiting in standby mode. - -We will use the same configuration in non-replicated mode to simplify deployment scripts. - -### add-on manager - -All add-on managers will be working independently. Each of them will observe current state of -add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on -can be updated multiple times in a row after upgrading the master. Long-term we should fix this -by using a similar mechanisms as controller manager or scheduler. However, currently add-on -manager is just a bash script and adding a master election mechanism would not be easy. - -## Adding replica - -Command to add new replica on GCE using kube-up script: - -``` -KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh -``` - -A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following: - -``` -1. If there is no load balancer for this cluster: - 1. Create load balancer using ephemeral IP address - 2. Add existing apiserver to the load balancer - 3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) - 4. Update DNS to point to the load balancer. -2. Clone existing master (create a new VM with the same configuration) including - all env variables (certificates, IP ranges etc), with the exception of - `INITIAL_ETCD_CLUSTER`. -3. SSH to an existing master and run the following command to extend etcd cluster - with the new instance: - `curl :4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://:2380"]}'` -4. Add IP address of the new apiserver to the load balancer. -``` - -A simplified algorithm for adding a new master replica and promoting master IP to the load balancer -is identical to the one when using DNS, with a different step to setup load balancer: - -``` -1. If there is no load balancer for this cluster: - 1. Unassign IP from the existing master replica - 2. Create load balancer using static IP reclaimed in the previous step - 3. Add existing apiserver to the load balancer - 4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!) -... -``` - -## Deleting replica - -Command to delete one replica on GCE using kube-up script: - -``` -KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh -``` - -A pseudo-code for deleting an existing replica for the master is the following: - -``` -1. Remove replica IP address from the load balancer or DNS configuration -2. SSH to one of the remaining masters and run the following command to remove replica from the cluster: - `curl etcd-0:4001/v2/members/ -XDELETE -L` -3. Delete replica VM -4. If load balancer has only a single target instance, then delete load balancer -5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM. -``` - -## Upgrades - -Upgrading replicated master will be possible by upgrading them one by one using existing tools -(e.g. upgrade.sh for GCE). This will work out of the box because: -* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible. -* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components -will be in the same version. -* Apiserver talks only to a local etcd replica which will be in a compatible version -* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]() - diff --git a/horizontal-pod-autoscaler.md b/horizontal-pod-autoscaler.md deleted file mode 100644 index 1ac9c24b..00000000 --- a/horizontal-pod-autoscaler.md +++ /dev/null @@ -1,263 +0,0 @@ -

Warning! This document might be outdated.

- -# Horizontal Pod Autoscaling - -## Preface - -This document briefly describes the design of the horizontal autoscaler for -pods. The autoscaler (implemented as a Kubernetes API resource and controller) -is responsible for dynamically controlling the number of replicas of some -collection (e.g. the pods of a ReplicationController) to meet some objective(s), -for example a target per-pod CPU utilization. - -This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md). - -## Overview - -The resource usage of a serving application usually varies over time: sometimes -the demand for the application rises, and sometimes it drops. In Kubernetes -version 1.0, a user can only manually set the number of serving pods. Our aim is -to provide a mechanism for the automatic adjustment of the number of pods based -on CPU utilization statistics (a future version will allow autoscaling based on -other resources/metrics). - -## Scale Subresource - -In Kubernetes version 1.1, we are introducing Scale subresource and implementing -horizontal autoscaling of pods based on it. Scale subresource is supported for -replication controllers and deployments. Scale subresource is a Virtual Resource -(does not correspond to an object stored in etcd). It is only present in the API -as an interface that a controller (in this case the HorizontalPodAutoscaler) can -use to dynamically scale the number of replicas controlled by some other API -object (currently ReplicationController and Deployment) and to learn the current -number of replicas. Scale is a subresource of the API object that it serves as -the interface for. The Scale subresource is useful because whenever we introduce -another type we want to autoscale, we just need to implement the Scale -subresource for it. The wider discussion regarding Scale took place in issue -[#1629](https://github.com/kubernetes/kubernetes/issues/1629). - -Scale subresource is in API for replication controller or deployment under the -following paths: - -`apis/extensions/v1beta1/replicationcontrollers/myrc/scale` - -`apis/extensions/v1beta1/deployments/mydeployment/scale` - -It has the following structure: - -```go -// represents a scaling request for a resource. -type Scale struct { - unversioned.TypeMeta - api.ObjectMeta - - // defines the behavior of the scale. - Spec ScaleSpec - - // current status of the scale. - Status ScaleStatus -} - -// describes the attributes of a scale subresource -type ScaleSpec struct { - // desired number of instances for the scaled object. - Replicas int `json:"replicas,omitempty"` -} - -// represents the current status of a scale subresource. -type ScaleStatus struct { - // actual number of observed instances of the scaled object. - Replicas int `json:"replicas"` - - // label query over pods that should match the replicas count. - Selector map[string]string `json:"selector,omitempty"` -} -``` - -Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment -associated with the given Scale subresource. `ScaleStatus.Replicas` reports how -many pods are currently running in the replication controller/deployment, and -`ScaleStatus.Selector` returns selector for the pods. - -## HorizontalPodAutoscaler Object - -In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It -is accessible under: - -`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler` - -It has the following structure: - -```go -// configuration of a horizontal pod autoscaler. -type HorizontalPodAutoscaler struct { - unversioned.TypeMeta - api.ObjectMeta - - // behavior of autoscaler. - Spec HorizontalPodAutoscalerSpec - - // current information about the autoscaler. - Status HorizontalPodAutoscalerStatus -} - -// specification of a horizontal pod autoscaler. -type HorizontalPodAutoscalerSpec struct { - // reference to Scale subresource; horizontal pod autoscaler will learn the current resource - // consumption from its status,and will set the desired number of pods by modifying its spec. - ScaleRef SubresourceReference - // lower limit for the number of pods that can be set by the autoscaler, default 1. - MinReplicas *int - // upper limit for the number of pods that can be set by the autoscaler. - // It cannot be smaller than MinReplicas. - MaxReplicas int - // target average CPU utilization (represented as a percentage of requested CPU) over all the pods; - // if not specified it defaults to the target CPU utilization at 80% of the requested resources. - CPUUtilization *CPUTargetUtilization -} - -type CPUTargetUtilization struct { - // fraction of the requested CPU that should be utilized/used, - // e.g. 70 means that 70% of the requested CPU should be in use. - TargetPercentage int -} - -// current status of a horizontal pod autoscaler -type HorizontalPodAutoscalerStatus struct { - // most recent generation observed by this autoscaler. - ObservedGeneration *int64 - - // last time the HorizontalPodAutoscaler scaled the number of pods; - // used by the autoscaler to control how often the number of pods is changed. - LastScaleTime *unversioned.Time - - // current number of replicas of pods managed by this autoscaler. - CurrentReplicas int - - // desired number of replicas of pods managed by this autoscaler. - DesiredReplicas int - - // current average CPU utilization over all pods, represented as a percentage of requested CPU, - // e.g. 70 means that an average pod is using now 70% of its requested CPU. - CurrentCPUUtilizationPercentage *int -} -``` - -`ScaleRef` is a reference to the Scale subresource. -`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler -configuration. We are also introducing HorizontalPodAutoscalerList object to -enable listing all autoscalers in a namespace: - -```go -// list of horizontal pod autoscaler objects. -type HorizontalPodAutoscalerList struct { - unversioned.TypeMeta - unversioned.ListMeta - - // list of horizontal pod autoscaler objects. - Items []HorizontalPodAutoscaler -} -``` - -## Autoscaling Algorithm - -The autoscaler is implemented as a control loop. It periodically queries pods -described by `Status.PodSelector` of Scale subresource, and collects their CPU -utilization. Then, it compares the arithmetic mean of the pods' CPU utilization -with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of -the Scale if needed to match the target (preserving condition: MinReplicas <= -Replicas <= MaxReplicas). - -The period of the autoscaler is controlled by the -`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The -default value is 30 seconds. - - -CPU utilization is the recent CPU usage of a pod (average across the last 1 -minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU -usage is taken directly from Heapster. In future, there will be API on master -for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)). - -The target number of pods is calculated from the following formula: - -``` -TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target) -``` - -Starting and stopping pods may introduce noise to the metric (for instance, -starting may temporarily increase CPU). So, after each action, the autoscaler -should wait some time for reliable data. Scale-up can only happen if there was -no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from -the last rescaling. Moreover any scaling will only be made if: -`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 -(10% tolerance). Such approach has two benefits: - -* Autoscaler works in a conservative way. If new user load appears, it is -important for us to rapidly increase the number of pods, so that user requests -will not be rejected. Lowering the number of pods is not that urgent. - -* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting -decision if the load is not stable. - -## Relative vs. absolute metrics - -We chose values of the target metric to be relative (e.g. 90% of requested CPU -resource) rather than absolute (e.g. 0.6 core) for the following reason. If we -choose absolute metric, user will need to guarantee that the target is lower -than the request. Otherwise, overloaded pods may not be able to consume more -than the autoscaler's absolute target utilization, thereby preventing the -autoscaler from seeing high enough utilization to trigger it to scale up. This -may be especially troublesome when user changes requested resources for a pod -because they would need to also change the autoscaler utilization threshold. -Therefore, we decided to choose relative metric. For user, it is enough to set -it to a value smaller than 100%, and further changes of requested resources will -not invalidate it. - -## Support in kubectl - -To make manipulation of HorizontalPodAutoscaler object simpler, we added support -for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In -addition, in future, we are planning to add kubectl support for the following -use-cases: -* When creating a replication controller or deployment with -`kubectl create [-f]`, there should be a possibility to specify an additional -autoscaler object. (This should work out-of-the-box when creation of autoscaler -is supported by kubectl as we may include multiple objects in the same config -file). -* *[future]* When running an image with `kubectl run`, there should be an -additional option to create an autoscaler for it. -* *[future]* We will add a new command `kubectl autoscale` that will allow for -easy creation of an autoscaler object for already existing replication -controller/deployment. - -## Next steps - -We list here some features that are not supported in Kubernetes version 1.1. -However, we want to keep them in mind, as they will most probably be needed in -the future. -Our design is in general compatible with them. -* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. -memory, network traffic, qps). This includes scaling based on a custom/application metric. -* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler, -instead of computing average for a target metric across pods, will use a single, -external, metric (e.g. qps metric from load balancer). The metric will be -aggregated while the target will remain per-pod (e.g. when observing 100 qps on -load balancer while the target is 20 qps per pod, autoscaler will set the number -of replicas to 5). -* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers -of pods for different metrics are different, choose the largest target number of -pods. -* *[future]* **Scale the number of pods starting from 0.** All pods can be -turned-off, and then turned-on when there is a demand for them. When a request -to service with no pods arrives, kube-proxy will generate an event for -autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247). -* *[future]* **When scaling down, make more educated decision which pods to -kill.** E.g.: if two or more pods from the same replication controller are on -the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301). - - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]() - diff --git a/identifiers.md b/identifiers.md deleted file mode 100644 index a37411f9..00000000 --- a/identifiers.md +++ /dev/null @@ -1,113 +0,0 @@ -# Identifiers and Names in Kubernetes - -A summarization of the goals and recommendations for identifiers in Kubernetes. -Described in GitHub issue [#199](http://issue.k8s.io/199). - - -## Definitions - -`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time -and space; intended to distinguish between historical occurrences of similar -entities. - -`Name`: A non-empty string guaranteed to be unique within a given scope at a -particular time; used in resource URLs; provided by clients at creation time and -encouraged to be human friendly; intended to facilitate creation idempotence and -space-uniqueness of singleton objects, distinguish distinct entities, and -reference particular entities across operations. - -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL): -An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, -with the '-' character allowed anywhere except the first or last character, -suitable for use as a hostname or segment in a domain name. - -[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN): -One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum -length of 253 characters. - -[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID): -A 128 bit generated value that is extremely unlikely to collide across time and -space and requires no central coordination. - -[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME): -An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, -with the '-' character allowed anywhere except the first or the last character -or adjacent to another '-' character, it must contain at least a (a-z) -character. - -## Objectives for names and UIDs - -1. Uniquely identify (via a UID) an object across space and time. -2. Uniquely name (via a name) an object across space. -3. Provide human-friendly names in API operations and/or configuration files. -4. Allow idempotent creation of API resources (#148) and enforcement of -space-uniqueness of singleton objects. -5. Allow DNS names to be automatically generated for some objects. - - -## General design - -1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must -be specified. Name must be non-empty and unique within the apiserver. This -enables idempotent and space-unique creation operations. Parts of the system -(e.g. replication controller) may join strings (e.g. a base name and a random -suffix) to create a unique Name. For situations where generating a name is -impractical, some or all objects may support a param to auto-generate a name. -Generating random names will defeat idempotency. - * Examples: "guestbook.user", "backend-x4eb1" -2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? -format TBD via #1114) may be specified. Depending on the API receiver, -namespaces might be validated (e.g. apiserver might ensure that the namespace -actually exists). If a namespace is not specified, one will be assigned by the -API receiver. This assignment policy might vary across API receivers (e.g. -apiserver might have a default, kubelet might generate something semi-random). - * Example: "api.k8s.example.com" -3. Upon acceptance of an object via an API, the object is assigned a UID -(a UUID). UID must be non-empty and unique across space and time. - * Example: "01234567-89ab-cdef-0123-456789abcdef" - -## Case study: Scheduling a pod - -Pods can be placed onto a particular node in a number of ways. This case study -demonstrates how the above design can be applied to satisfy the objectives. - -### A pod scheduled by a user through the apiserver - -1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver. -2. The apiserver validates the input. - 1. A default Namespace is assigned. - 2. The pod name must be space-unique within the Namespace. - 3. Each container within the pod has a name which must be space-unique within -the pod. -3. The pod is accepted. - 1. A new UID is assigned. -4. The pod is bound to a node. - 1. The kubelet on the node is passed the pod's UID, Namespace, and Name. -5. Kubelet validates the input. -6. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod -from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is -unique across time. * This may correspond to Docker's container ID. - -### A pod placed by a config file on the node - -1. A config file is stored on the node, containing a pod with UID="", -Namespace="", and Name="cadvisor". -2. Kubelet validates the input. - 1. Since UID is not provided, kubelet generates one. - 2. Since Namespace is not provided, kubelet generates one. - 1. The generated namespace should be deterministic and cluster-unique for -the source, such as a hash of the hostname and file path. - * E.g. Namespace="file-f4231812554558a718a01ca942782d81" -3. Kubelet runs the pod. - 1. Each container is started up with enough metadata to distinguish the pod -from whence it came. - 2. Each attempt to run a container is assigned a UID (a string) that is -unique across time. - 1. This may correspond to Docker's container ID. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]() - diff --git a/indexed-job.md b/indexed-job.md deleted file mode 100644 index 5a089c22..00000000 --- a/indexed-job.md +++ /dev/null @@ -1,900 +0,0 @@ -# Design: Indexed Feature of Job object - - -## Summary - -This design extends kubernetes with user-friendly support for -running embarrassingly parallel jobs. - -Here, *parallel* means on multiple nodes, which means multiple pods. -By *embarrassingly parallel*, it is meant that the pods -have no dependencies between each other. In particular, neither -ordering between pods nor gang scheduling are supported. - -Users already have two other options for running embarrassingly parallel -Jobs (described in the next section), but both have ease-of-use issues. - -Therefore, this document proposes extending the Job resource type to support -a third way to run embarrassingly parallel programs, with a focus on -ease of use. - -This new style of Job is called an *indexed job*, because each Pod of the Job -is specialized to work on a particular *index* from a fixed length array of work -items. - -## Background - -The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports -the embarrassingly parallel use case through *workqueue jobs*. -While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very -flexible, they can be difficult to use. They: (1) typically require running a -message queue or other database service, (2) typically require modifications -to existing binaries and images and (3) subtle race conditions are easy to - overlook. - -Users also have another option for parallel jobs: creating [multiple Job objects -from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of -Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job -objects at once. But, that approach also has its drawbacks: (1) for large levels -of parallelism (hundreds or thousands of pods) this approach means that listing -all jobs presents too much information, (2) users want a single source of -information about the success or failure of what the user views as a single -logical process. - -Indexed job fills provides a third option with better ease-of-use for common -use cases. - -## Requirements - -### User Requirements - -- Users want an easy way to run a Pod to completion *for each* item within a -[work list](#example-use-cases). - -- Users want to run these pods in parallel for speed, but to vary the level of -parallelism as needed, independent of the number of work items. - -- Users want to do this without requiring changes to existing images, -or source-to-image pipelines. - -- Users want a single object that encompasses the lifetime of the parallel -program. Deleting it should delete all dependent objects. It should report the -status of the overall process. Users should be able to wait for it to complete, -and can refer to it from other resource types, such as -[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980). - - -### Example Use Cases - -Here are several examples of *work lists*: lists of command lines that the user -wants to run, each line its own Pod. (Note that in practice, a work list may not -ever be written out in this form, but it exists in the mind of the Job creator, -and it is a useful way to talk about the intent of the user when discussing -alternatives for specifying Indexed Jobs). - -Note that we will not have the user express their requirements in work list -form; it is just a format for presenting use cases. Subsequent discussion will -reference these work lists. - -#### Work List 1 - -Process several files with the same program: - -``` -/usr/local/bin/process_file 12342.dat -/usr/local/bin/process_file 97283.dat -/usr/local/bin/process_file 38732.dat -``` - -#### Work List 2 - -Process a matrix (or image, etc) in rectangular blocks: - -``` -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 -``` - -#### Work List 3 - -Build a program at several different git commits: - -``` -HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH -HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH -HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH -``` - -#### Work List 4 - -Render several frames of a movie: - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1 -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2 -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3 -``` - -#### Work List 5 - -Render several blocks of frames (Render blocks to avoid Pod startup overhead for -every frame): - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100 -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200 -./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300 -``` - -## Design Discussion - -### Converting Work Lists into Indexed Jobs. - -Given a work list, like in the [work list examples](#work-list-examples), -the information from the work list needs to get into each Pod of the Job. - -Users will typically not want to create a new image for each job they -run. They will want to use existing images. So, the image is not the place -for the work list. - -A work list can be stored on networked storage, and mounted by pods of the job. -Also, as a shortcut, for small worklists, it can be included in an annotation on -the Job object, which is then exposed as a volume in the pod via the downward -API. - -### What Varies Between Pods of a Job - -Pods need to differ in some way to do something different. (They do not differ -in the work-queue style of Job, but that style has ease-of-use issues). - -A general approach would be to allow pods to differ from each other in arbitrary -ways. For example, the Job object could have a list of PodSpecs to run. -However, this is so general that it provides little value. It would: - -- make the Job Spec very verbose, especially for jobs with thousands of work -items -- Job becomes such a vague concept that it is hard to explain to users -- in practice, we do not see cases where many pods which differ across many -fields of their specs, and need to run as a group, with no ordering constraints. -- CLIs and UIs need to support more options for creating Job -- it is useful for monitoring and accounting databases want to aggregate data -for pods with the same controller. However, pods with very different Specs may -not make sense to aggregate. -- profiling, debugging, accounting, auditing and monitoring tools cannot assume -common images/files, behaviors, provenance and so on between Pods of a Job. - -Also, variety has another cost. Pods which differ in ways that affect scheduling -(node constraints, resource requirements, labels) prevent the scheduler from -treating them as fungible, which is an important optimization for the scheduler. - -Therefore, we will not allow Pods from the same Job to differ arbitrarily -(anyway, users can use multiple Job objects for that case). We will try to -allow as little as possible to differ between pods of the same Job, while still -allowing users to express common parallel patterns easily. For users who need to -run jobs which differ in other ways, they can create multiple Jobs, and manage -them as a group using labels. - -From the above work lists, we see a need for Pods which differ in their command -lines, and in their environment variables. These work lists do not require the -pods to differ in other ways. - -Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf) -has shown this model to be applicable to a very broad range of problems, despite -this restriction. - -Therefore we to allow pods in the same Job to differ **only** in the following - aspects: -- command line -- environment variables - -### Composition of existing images - -The docker image that is used in a job may not be maintained by the person -running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD. -If we require people to specify the complete command line to use Indexed Job, -then they will not automatically pick up changes in the default -command or args. - -This needs more thought. - -### Running Ad-Hoc Jobs using kubectl - -A user should be able to easily start an Indexed Job using `kubectl`. For -example to run [work list 1](#work-list-1), a user should be able to type -something simple like: - -``` -kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - -- \ - /usr/local/bin/process_file '$F' -``` - -In the above example: - -- `--restart=OnFailure` implies creating a job instead of replicationController. -- Each pods command line is `/usr/local/bin/process_file $F`. -- `--per-completion-env=` implies the jobs `.spec.completions` is set to the -length of the argument array (3 in the example). -- `--per-completion-env=F=` causes env var with `F` to be available in -the environment when the command line is evaluated. - -How exactly this happens is discussed later in the doc: this is a sketch of the -user experience. - -In practice, the list of files might be much longer and stored in a file on the -users local host, like: - -``` -$ cat files-to-process.txt -12342.dat -97283.dat -38732.dat -... -``` - -So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`. - -However, `kubectl` should also support a format like: - `--per-completion-env=F=@files-to-process.txt`. -That allows `kubectl` to parse the file, point out any syntax errors, and would -not run up against command line length limits (2MB is common, as low as 4kB is -POSIX compliant). - -One case we do not try to handle is where the file of work is stored on a cloud -filesystem, and not accessible from the users local host. Then we cannot easily -use indexed job, because we do not know the number of completions. The user -needs to copy the file locally first or use the Work-Queue style of Job (already -supported). - -Another case we do not try to handle is where the input file does not exist yet -because this Job is to be run at a future time, or depends on another job. The -workflow and scheduled job proposal need to consider this case. For that case, -you could use an indexed job which runs a program which shards the input file -(map-reduce-style). - -#### Multiple parameters - -The user may also have multiple parameters, like in [work list 2](#work-list-2). -One way is to just list all the command lines already expanded, one per line, in -a file, like this: - -``` -$ cat matrix-commandlines.txt -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15 -/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31 -/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31 -``` - -and run the Job like this: - -``` -kubectl run process-matrix --image=my/matrix \ - --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \ - --restart=OnFailure \ - -- \ - 'eval "$COMMAND_LINE"' -``` - -However, this may have some subtleties with shell escaping. Also, it depends on -the user knowing all the correct arguments to the docker image being used (more -on this later). - -Instead, kubectl should support multiple instances of the `--per-completion-env` -flag. For example, to implement work list 2, a user could do: - -``` -kubectl run process-matrix --image=my/matrix \ - --per-completion-env=SR="0 16 0 16" \ - --per-completion-env=ER="15 31 15 31" \ - --per-completion-env=SC="0 0 16 16" \ - --per-completion-env=EC="15 15 31 31" \ - --restart=OnFailure \ - -- \ - /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC -``` - -### Composition With Workflows and ScheduledJob - -A user should be able to create a job (Indexed or not) which runs at a specific -time(s). For example: - -``` -$ kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - --runAt=2015-07-21T14:00:00Z - -- \ - /usr/local/bin/process_file '$F' -created "scheduledJob/process-files-37dt3" -``` - -Kubectl should build the same JobSpec, and then put it into a ScheduledJob -(#11980) and create that. - -For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a -complete workflow from a single command line would be messy, because of the need -to specify all the arguments multiple times. - -For that use case, the user could create a workflow message by hand. Or the user -could create a job template, and then make a workflow from the templates, -perhaps like this: - -``` -$ kubectl run process-files --image=myfileprocessor \ - --per-completion-env=F="12342.dat 97283.dat 38732.dat" \ - --restart=OnFailure \ - --asTemplate \ - -- \ - /usr/local/bin/process_file '$F' -created "jobTemplate/process-files" -$ kubectl run merge-files --image=mymerger \ - --restart=OnFailure \ - --asTemplate \ - -- \ - /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \ -created "jobTemplate/merge-files" -$ kubectl create-workflow process-and-merge \ - --job=jobTemplate/process-files - --job=jobTemplate/merge-files - --dependency=process-files:merge-files -created "workflow/process-and-merge" -``` - -### Completion Indexes - -A JobSpec specifies the number of times a pod needs to complete successfully, -through the `job.Spec.Completions` field. The number of completions will be -equal to the number of work items in the work list. - -Each pod that the job controller creates is intended to complete one work item -from the work list. Since a pod may fail, several pods may, serially, attempt to -complete the same index. Therefore, we call it a *completion index* (or just -*index*), but not a *pod index*. - -For each completion index, in the range 1 to `.job.Spec.Completions`, the job -controller will create a pod with that index, and keep creating them on failure, -until each index is completed. - -An dense integer index, rather than a sparse string index (e.g. using just -`metadata.generate-name`) makes it easy to use the index to lookup parameters -in, for example, an array in shared storage. - -### Pod Identity and Template Substitution in Job Controller - -The JobSpec contains a single pod template. When the job controller creates a -particular pod, it copies the pod template and modifies it in some way to make -that pod distinctive. Whatever is distinctive about that pod is its *identity*. - -We consider several options. - -#### Index Substitution Only - -The job controller substitutes only the *completion index* of the pod into the -pod template when creating it. The JSON it POSTs differs only in a single -fields. - -We would put the completion index as a stringified integer, into an annotation -of the pod. The user can extract it from the annotation into an env var via the -downward API, or put it in a file via a Downward API volume, and parse it -himself. - -Once it is an environment variable in the pod (say `$INDEX`), then one of two -things can happen. - -First, the main program can know how to map from an integer index to what it -needs to do. For example, from Work List 4 above: - -``` -./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX -``` - -Second, a shell script can be prepended to the original command line which maps -the index to one or more string parameters. For example, to implement Work List -5 above, you could do: - -``` -/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME -``` - -In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX` -and exports `$START_FRAME` and `$END_FRAME`. - -The shell could be part of the image, but more usefully, it could be generated -by a program and stuffed in an annotation or a configMap, and from there added -to a volume. - -The first approach may require the user to modify an existing image (see next -section) to be able to accept an `$INDEX` env var or argument. The second -approach requires that the image have a shell. We think that together these two -options cover a wide range of use cases (though not all). - -#### Multiple Substitution - -In this option, the JobSpec is extended to include a list of values to -substitute, and which fields to substitute them into. For example, a worklist -like this: - -``` -FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds -FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt -FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit -``` - -Can be broken down into a template like this, with three parameters: - -``` -; process-fruit -a -b -c -``` - -and a list of parameter tuples, like this: - -``` -("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds") -("FRUIT_COLOR=yellow", "-f banana.txt", "") -("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit") -``` - -The JobSpec can be extended to hold a list of parameter tuples (which are more -easily expressed as a list of lists of individual parameters). For example: - -``` -apiVersion: extensions/v1beta1 -kind: Job -... -spec: - completions: 3 - ... - template: - ... - perCompletionArgs: - container: 0 - - - - "-f apple.txt" - - "-f banana.txt" - - "-f cherry.txt" - - - - "--remove-seeds" - - "" - - "--remove-pit" - perCompletionEnvVars: - - name: "FRUIT_COLOR" - - "green" - - "yellow" - - "red" -``` - -However, just providing custom env vars, and not arguments, is sufficient for -many use cases: parameter can be put into env vars, and then substituted on the -command line. - -#### Comparison - -The multiple substitution approach: - -- keeps the *per completion parameters* in the JobSpec. -- Drawback: makes the job spec large for job with thousands of completions. (But -for very large jobs, the work-queue style or another type of controller, such as -map-reduce or spark, may be a better fit.) -- Drawback: is a form of server-side templating, which we want in Kubernetes but -have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)). - -The index-only approach: - -- Requires that the user keep the *per completion parameters* in a separate -storage, such as a configData or networked storage. -- Makes no changes to the JobSpec. -- Drawback: while in separate storage, they could be mutated, which would have -unexpected effects. -- Drawback: Logic for using index to lookup parameters needs to be in the Pod. -- Drawback: CLIs and UIs are limited to using the "index" as the identity of a -pod from a job. They cannot easily say, for example `repeated failures on the -pod processing banana.txt`. - -Index-only approach relies on at least one of the following being true: - -1. Image containing a shell and certain shell commands (not all images have -this). -1. Use directly consumes the index from annotations (file or env var) and -expands to specific behavior in the main program. - -Also Using the index-only approach from non-kubectl clients requires that they -mimic the script-generation step, or only use the second style. - -#### Decision - -It is decided to implement the Index-only approach now. Once the server-side -templating design is complete for Kubernetes, and we have feedback from users, -we can consider if Multiple Substitution. - -## Detailed Design - -#### Job Resource Schema Changes - -No changes are made to the JobSpec. - - -The JobStatus is also not changed. The user can gauge the progress of the job by -the `.status.succeeded` count. - - -#### Job Spec Compatilibity - -A job spec written before this change will work exactly the same as before with -the new controller. The Pods it creates will have the same environment as -before. They will have a new annotation, but pod are expected to tolerate -unfamiliar annotations. - -However, if the job controller version is reverted, to a version before this -change, the jobs whose pod specs depend on the new annotation will fail. -This is okay for a Beta resource. - -#### Job Controller Changes - -The Job controller will maintain for each Job a data structed which -indicates the status of each completion index. We call this the -*scoreboard* for short. It is an array of length `.spec.completions`. -Elements of the array are `enum` type with possible values including -`complete`, `running`, and `notStarted`. - -The scoreboard is stored in Job Controller memory for efficiency. In either -case, the Status can be reconstructed from watching pods of the job (such as on -a controller manager restart). The index of the pods can be extracted from the -pod annotation. - -When Job controller sees that the number of running pods is less than the -desired parallelism of the job, it finds the first index in the scoreboard with -value `notRunning`. It creates a pod with this creation index. - -When it creates a pod with creation index `i`, it makes a copy of the -`.spec.template`, and sets -`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to -`i`. It does this in both the index-only and multiple-substitutions options. - -Then it creates the pod. - -When the controller notices that a pod has completed or is running or failed, -it updates the scoreboard. - -When all entries in the scoreboard are `complete`, then the job is complete. - - -#### Downward API Changes - -The downward API is changed to support extracting specific key names into a -single environment variable. So, the following would be supported: - -``` -kind: Pod -version: v1 -spec: - containers: - - name: foo - env: - - name: MY_INDEX - valueFrom: - fieldRef: - fieldPath: metadata.annotations[kubernetes.io/job/completion-index] -``` - -This requires kubelet changes. - -Users who fail to upgrade their kubelets at the same time as they upgrade their -controller manager will see a failure for pods to run when they are created by -the controller. The Kubelet will send an event about failure to create the pod. -The `kubectl describe job` will show many failed pods. - - -#### Kubectl Interface Changes - -The `--completions` and `--completion-index-var-name` flags are added to -kubectl. - -For example, this command: - -``` -kubectl run say-number --image=busybox \ - --completions=3 \ - --completion-index-var-name=I \ - -- \ - sh -c 'echo "My index is $I" && sleep 5' -``` - -will run 3 pods to completion, each printing one of the following lines: - -``` -My index is 1 -My index is 2 -My index is 0 -``` - -Kubectl would create the following pod: - - - -Kubectl will also support the `--per-completion-env` flag, as described -previously. For example, this command: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="apple banana cherry" \ - --per-completion-env=COLOR="green yellow red" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -or equivalently: - -``` -echo "apple banana cherry" > fruits.txt -echo "green yellow red" > colors.txt - -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="$(cat fruits.txt)" \ - --per-completion-env=COLOR="$(cat fruits.txt)" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -or similarly: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT=@fruits.txt \ - --per-completion-env=COLOR=@fruits.txt \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -will all run 3 pods in parallel. Index 0 pod will log: - -``` -Have a nice grenn apple -``` - -and so on. - - -Notes: - -- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a -quoted space separated list or `@` and the name of a text file containing a -list. -- `--per-completion-env=` can be specified several times, but all must have the -same length list. -- `--completions=N` with `N` equal to list length is implied. -- The flag `--completions=3` sets `job.spec.completions=3`. -- The flag `--completion-index-var-name=I` causes an env var to be created named -I in each pod, with the index in it. -- The flag `--restart=OnFailure` is implied by `--completions` or any -job-specific arguments. The user can also specify `--restart=Never` if they -desire but may not specify `--restart=Always` with job-related flags. -- Setting any of these flags in turn tells kubectl to create a Job, not a -replicationController. - -#### How Kubectl Creates Job Specs. - -To pass in the parameters, kubectl will generate a shell script which -can: -- parse the index from the annotation -- hold all the parameter lists. -- lookup the correct index in each parameter list and set an env var. - -For example, consider this command: - -``` -kubectl run say-fruit --image=busybox \ - --per-completion-env=FRUIT="apple banana cherry" \ - --per-completion-env=COLOR="green yellow red" \ - -- \ - sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' -``` - -First, kubectl generates the PodSpec as it normally does for `kubectl run`. - -But, then it will generate this script: - -```sh -#!/bin/sh -# Generated by kubectl run ... -# Check for needed commands -if [[ ! type cat ]] -then - echo "$0: Image does not include required command: cat" - exit 2 -fi -if [[ ! type grep ]] -then - echo "$0: Image does not include required command: grep" - exit 2 -fi -# Check that annotations are mounted from downward API -if [[ ! -e /etc/annotations ]] -then - echo "$0: Cannot find /etc/annotations" - exit 2 -fi -# Get our index from annotations file -I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index" -export I - -# Our parameter lists are stored inline in this script. -FRUIT_0="apple" -FRUIT_1="banana" -FRUIT_2="cherry" -# Extract the right parameter value based on our index. -# This works on any Bourne-based shell. -FRUIT=$(eval echo \$"FRUIT_$I") -export FRUIT - -COLOR_0="green" -COLOR_1="yellow" -COLOR_2="red" - -COLOR=$(eval echo \$"FRUIT_$I") -export COLOR -``` - -Then it POSTs this script, encoded, inside a ConfigData. -It attaches this volume to the PodSpec. - -Then it will edit the command line of the Pod to run this script before the rest of -the command line. - -Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this: -It also appends the Secret (later configData) volume with the script in it. - -So, the Pod template that kubectl creates (inside the job template) looks like this: - -``` -apiVersion: v1 -kind: Job -... -spec: - ... - template: - ... - spec: - containers: - - name: c - image: gcr.io/google_containers/busybox - command: - - 'sh' - - '-c' - - '/etc/job-params.sh; echo "this is the rest of the command"' - volumeMounts: - - name: annotations - mountPath: /etc - - name: script - mountPath: /etc - volumes: - - name: annotations - downwardAPI: - items: - - path: "annotations" - ieldRef: - fieldPath: metadata.annotations - - name: script - secret: - secretName: jobparams-abc123 -``` - -###### Alternatives - -Kubectl could append a `valueFrom` line like this to -get the index into the environment: - -```yaml -apiVersion: extensions/v1beta1 -kind: Job -metadata: - ... -spec: - ... - template: - ... - spec: - containers: - - name: foo - ... - env: - # following block added: - - name: I - valueFrom: - fieldRef: - fieldPath: metadata.annotations."kubernetes.io/job-idx" -``` - -However, in order to inject other env vars from parameter list, -kubectl still needs to edit the command line. - -Parameter lists could be passed via a configData volume instead of a secret. -Kubectl can be changed to work that way once the configData implementation is -complete. - -Parameter lists could be passed inside an EnvVar. This would have length -limitations, would pollute the output of `kubectl describe pods` and `kubectl -get pods -o json`. - -Parameter lists could be passed inside an annotation. This would have length -limitations, would pollute the output of `kubectl describe pods` and `kubectl -get pods -o json`. Also, currently annotations can only be extracted into a -single file. Complex logic is then needed to filter out exactly the desired -annotation data. - -Bash array variables could simplify extraction of a particular parameter from a -list of parameters. However, some popular base images do not include -`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation -that does not support array syntax. - -Kubelet does support [expanding variables without a -shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not -allow for recursive substitution, which is required to extract the correct -parameter from a list based on the completion index of the pod. The syntax -could be extended, but doing so seems complex and will be an unfamiliar syntax -for users. - -Putting all the command line editing into a script and running that causes -the least pollution to the original command line, and it allows -for complex error handling. - -Kubectl could store the script in an [Inline Volume]( -https://github.com/kubernetes/kubernetes/issues/13610) if that proposal -is approved. That would remove the need to manage the lifetime of the -configData/secret, and prevent the case where someone changes the -configData mid-job, and breaks things in a hard-to-debug way. - - -## Interactions with other features - -#### Supporting Work Queue Jobs too - -For Work Queue Jobs, completions has no meaning. Parallelism should be allowed -to be greater than it, and pods have no identity. So, the job controller should -not create a scoreboard in the JobStatus, just a count. Therefore, we need to -add one of the following to JobSpec: - -- allow unset `.spec.completions` to indicate no scoreboard, and no index for -tasks (identical tasks). -- allow `.spec.completions=-1` to indicate the same. -- add `.spec.indexed` to job to indicate need for scoreboard. - -#### Interaction with vertical autoscaling - -Since pods of the same job will not be created with different resources, -a vertical autoscaler will need to: - -- if it has index-specific initial resource suggestions, suggest those at -admission time; it will need to understand indexes. -- mutate resource requests on already created pods based on usage trend or -previous container failures. -- modify the job template, affecting all indexes. - -#### Comparison to StatefulSets (previously named PetSets) - -The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b. -The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more -restrictive and thus less verbose. - -It would be easier for users if Indexed Job and StatefulSet are similar where -possible. However, StatefulSet differs in several key respects: - -- StatefulSet is for ones to tens of instances. Indexed job should work with tens of -thousands of instances. -- When you have few instances, you may want to give them names. When you have many instances, -integer indexes make more sense. -- When you have thousands of instances, storing the work-list in the JobSpec -is verbose. For StatefulSet, this is less of a problem. -- StatefulSets (apparently) need to differ in more fields than indexed Jobs. - -This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is -intended to support ones to tens of things. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]() - diff --git a/metadata-policy.md b/metadata-policy.md deleted file mode 100644 index 57416f11..00000000 --- a/metadata-policy.md +++ /dev/null @@ -1,137 +0,0 @@ -# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system - -## Introduction - -This document describes a new API resource, `MetadataPolicy`, that configures an -admission controller to take one or more actions based on an object's metadata. -Initially the metadata fields that the predicates can examine are labels and -annotations, and the actions are to add one or more labels and/or annotations, -or to reject creation/update of the object. In the future other actions might be -supported, such as applying an initializer. - -The first use of `MetadataPolicy` will be to decide which scheduler should -schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md) -Kubernetes system. In particular, the policy will add the scheduler name -annotation to a pod based on an annotation that is already on the pod that -indicates the QoS of the pod. (That annotation was presumably set by a simpler -admission controller that uses code, rather than configuration, to map the -resource requests and limits of a pod to QoS, and attaches the corresponding -annotation.) - -We anticipate a number of other uses for `MetadataPolicy`, such as defaulting -for labels and annotations, prohibiting/requiring particular labels or -annotations, or choosing a scheduling policy within a scheduler. We do not -discuss them in this doc. - - -## API - -```go -// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource. -// Every rule is applied, in an unspecified order, but if the action for any rule -// that matches is to reject the object, then the object is rejected without being mutated. -type MetadataPolicySpec struct { - Rules []MetadataPolicyRule `json:"rules,omitempty"` -} - -// If the PolicyPredicate is met, then the PolicyAction is applied. -// Example rules: -// reject object if label with key X is present (i.e. require X) -// reject object if label with key X is not present (i.e. forbid X) -// add label X=Y if label with key X is not present (i.e. default X) -// add annotation A=B if object has annotation C=D or E=F -type MetadataPolicyRule struct { - PolicyPredicate PolicyPredicate `json:"policyPredicate"` - PolicyAction PolicyAction `json:policyAction"` -} - -// All criteria must be met for the PolicyPredicate to be considered met. -type PolicyPredicate struct { - // Note that Namespace is not listed here because MetadataPolicy is per-Namespace. - LabelSelector *LabelSelector `json:"labelSelector,omitempty"` - AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"` -} - -// Apply the indicated Labels and/or Annotations (if present), unless Reject is set -// to true, in which case reject the object without mutating it. -type PolicyAction struct { - // If true, the object will be rejected and not mutated. - Reject bool `json:"reject"` - // The labels to add or update, if any. - UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"` - // The annotations to add or update, if any. - UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"` -} - -// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying -// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's -// are applied to all objects in the namespace; the order of evaluation is not guaranteed, -// but if any of the matching policies have an action of rejecting the object, then the object -// will be rejected without being mutated. -type MetadataPolicy struct { - unversioned.TypeMeta `json:",inline"` - // Standard object's metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata - ObjectMeta `json:"metadata,omitempty"` - - // Spec defines the metadata policy that should be enforced. - // http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status - Spec MetadataPolicySpec `json:"spec,omitempty"` -} - -// MetadataPolicyList is a list of MetadataPolicy items. -type MetadataPolicyList struct { - unversioned.TypeMeta `json:",inline"` - // Standard list metadata. - // More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds - unversioned.ListMeta `json:"metadata,omitempty"` - - // Items is a list of MetadataPolicy objects. - // More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota - Items []MetadataPolicy `json:"items"` -} -``` - -## Implementation plan - -1. Create `MetadataPolicy` API resource -1. Create admission controller that implements policies defined in -`MetadataPolicy` -1. Create admission controller that sets annotation -`scheduler.alpha.kubernetes.io/qos: ` -(where `QOS` is one of `Guaranteed, Burstable, BestEffort`) -based on pod's resource request and limit. - -## Future work - -Longer-term we will have QoS be set on create and update by the registry, -similar to `Pending` phase today, instead of having an admission controller -(that runs before the one that takes `MetadataPolicy` as input) do it. - -We plan to eventually move from having an admission controller set the scheduler -name as a pod annotation, to using the initializer concept. In particular, the -scheduler will be an initializer, and the admission controller that decides -which scheduler to use will add the scheduler's name to the list of initializers -for the pod (presumably the scheduler will be the last initializer to run on -each pod). The admission controller would still be configured using the -`MetadataPolicy` described here, only the mechanism the admission controller -uses to record its decision of which scheduler to use would change. - -## Related issues - -The main issue for multiple schedulers is #11793. There was also a lot of -discussion in PRs #17197 and #17865. - -We could use the approach described here to choose a scheduling policy within a -single scheduler, as opposed to choosing a scheduler, a desire mentioned in - -# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where - -`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized -API for matching "claims" to "service classes"; matching a pod to a scheduler -would be one use for such an API. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]() - diff --git a/monitoring_architecture.md b/monitoring_architecture.md deleted file mode 100644 index b819eeca..00000000 --- a/monitoring_architecture.md +++ /dev/null @@ -1,203 +0,0 @@ -# Kubernetes monitoring architecture - -## Executive Summary - -Monitoring is split into two pipelines: - -* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down -Heapster called metrics-server, and the API server serving the master metrics API. These -metrics are used by core system components, such as scheduling logic (e.g. scheduler and -horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components -(e.g. `kubectl top`). This pipeline is not intended for integration with third-party -monitoring systems. -* A **monitoring pipeline** used for collecting various metrics from the system and exposing -them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore -via adapters. Users can choose from many monitoring system vendors, or run none at all. In -open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options -will be easy to install. We expect that such pipelines will typically consist of a per-node -agent and a cluster-level aggregator. - -The architecture is illustrated in the diagram in the Appendix of this doc. - -## Introduction and Objectives - -This document proposes a high-level monitoring architecture for Kubernetes. It covers -a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc, -specifically focusing on an architecture (components and their interactions) that -hopefully meets the numerous requirements. We do not specify any particular timeframe -for implementing this architecture, nor any particular roadmap for getting there. - -### Terminology - -There are two types of metrics, system metrics and service metrics. System metrics are -generic metrics that are generally available from every entity that is monitored (e.g. -usage of CPU and memory by container and node). Service metrics are explicitly defined -in application code and exported (e.g. number of 500s served by the API server). Both -system metrics and service metrics can originate from users’ containers or from system -infrastructure components (master components like the API server, addon pods running on -the master, and addon pods running on user nodes). - -We divide system metrics into - -* *core metrics*, which are metrics that Kubernetes understands and uses for operation -of its internal components and core utilities -- for example, metrics used for scheduling -(including the inputs to the algorithms for resource estimation, initial resources/vertical -autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics), -the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage, -memory instantaneous usage, disk usage of pods, disk usage of containers -* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they -include the core metrics (though not necessarily in a format Kubernetes understands) plus -additional metrics. - -Service metrics can be divided into those produced by Kubernetes infrastructure components -(and thus useful for operation of the Kubernetes cluster) and those produced by user applications. -Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics. -Of course horizontal pod autoscaling also uses core metrics. - -We consider logging to be separate from monitoring, so logging is outside the scope of -this doc. - -### Requirements - -The monitoring architecture should - -* include a solution that is part of core Kubernetes and - * makes core system metrics about nodes, pods, and containers available via a standard - master API (today the master metrics API), such that core Kubernetes features do not - depend on non-core components - * requires Kubelet to only export a limited set of metrics, namely those required for - core Kubernetes components to correctly operate (this is related to #18770) - * can scale up to at least 5000 nodes - * is small enough that we can require that all of its components be running in all deployment - configurations -* include an out-of-the-box solution that can serve historical data, e.g. to support Initial -Resources and vertical pod autoscaling as well as cluster analytics queries, that depends -only on core Kubernetes -* allow for third-party monitoring solutions that are not part of core Kubernetes and can -be integrated with components like Horizontal Pod Autoscaler that require service metrics - -## Architecture - -We divide our description of the long-term architecture plan into the core metrics pipeline -and the monitoring pipeline. For each, it is necessary to think about how to deal with each -type of metric (core metrics, non-core metrics, and service metrics) from both the master -and minions. - -### Core metrics pipeline - -The core metrics pipeline collects a set of core system metrics. There are two sources for -these metrics - -* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that -is part of Kubelet will be slimmed down to provide only core system metrics) -* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from -Kubelet into resource estimates (values used by scheduler for a more advanced usage-based -scheduler) - -These sources are scraped by a component we call *metrics-server* which is like a slimmed-down -version of today's Heapster. metrics-server stores locally only latest values and has no sinks. -metrics-server exposes the master metrics API. (The configuration described here is similar -to the current Heapster in “standalone” mode.) -[Discovery summarizer](../../docs/proposals/federated-api-servers.md) -makes the master metrics API available to external clients such that from the client’s perspective -it looks the same as talking to the API server. - -Core (system) metrics are handled as described above in all deployment environments. The only -easily replaceable part is resource estimator, which could be replaced by power users. In -theory, metric-server itself can also be substituted, but it’d be similar to substituting -apiserver itself or controller-manager - possible, but not recommended and not supported. - -Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon -themselves (e.g. CPU usage of Kubelet), even though they do not run in containers. - -The core metrics pipeline is intentionally small and not designed for third-party integrations. -“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline -(see next section) and can run on Kubernetes without having to make changes to upstream components. -In this way we can remove the burden we have today that comes with maintaining Heapster as the -integration point for every possible metrics source, sink, and feature. - -#### Infrastore - -We will build an open-source Infrastore component (most likely reusing existing technologies) -for serving historical queries over core system metrics and events, which it will fetch from -the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries -- -this is TBD) to handle the following use cases - -* initial resources -* vertical autoscaling -* oldtimer API -* decision-support queries for debugging, capacity planning, etc. -* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) - -In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes -infrastructure containers), described in the upcoming sections. - -### Monitoring pipeline - -One of the goals of building a dedicated metrics pipeline for core metrics, as described in the -previous section, is to allow for a separate monitoring pipeline that can be very flexible -because core Kubernetes components do not need to rely on it. By default we will not provide -one, but we will provide an easy way to install one (using a single command, most likely using -Helm). We described the monitoring pipeline in this section. - -Data collected by the monitoring pipeline may contain any sub- or superset of the following groups -of metrics: - -* core system metrics -* non-core system metrics -* service metrics from user application containers -* service metrics from Kubernetes infrastructure containers; these metrics are exposed using -Prometheus instrumentation - -It is up to the monitoring solution to decide which of these are collected. - -In order to enable horizontal pod autoscaling based on custom metrics, the provider of the -monitoring pipeline would also have to create a stateless API adapter that pulls the custom -metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such -API will be a well defined, versioned API similar to regular APIs. Details of how it will be -exposed or discovered will be covered in a detailed design doc for this component. - -The same approach applies if it is desired to make monitoring pipeline metrics available in -Infrastore. These adapters could be standalone components, libraries, or part of the monitoring -solution itself. - -There are many possible combinations of node and cluster-level agents that could comprise a -monitoring pipeline, including -cAdvisor + Heapster + InfluxDB (or any other sink) -* cAdvisor + collectd + Heapster -* cAdvisor + Prometheus -* snapd + Heapster -* snapd + SNAP cluster-level agent -* Sysdig - -As an example we’ll describe a potential integration with cAdvisor + Prometheus. - -Prometheus has the following metric sources on a node: -* core and non-core system metrics from cAdvisor -* service metrics exposed by containers via HTTP handler in Prometheus format -* [optional] metrics about node itself from Node Exporter (a Prometheus component) - -All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus -cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a -standalone API adapter that proxies/translates between the Prometheus Query Language endpoint -on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be -used to make the metrics from the monitoring pipeline available in Infrastore. Neither -adapter is necessary if the user does not need the corresponding feature. - -The command that installs cAdvisor+Prometheus should also automatically set up collection -of the metrics from infrastructure containers. This is possible because the names of the -infrastructure containers and metrics of interest are part of the Kubernetes control plane -configuration itself, and because the infrastructure containers export their metrics in -Prometheus format. - -## Appendix: Architecture diagram - -### Open-source monitoring pipeline - -![Architecture Diagram](monitoring_architecture.png?raw=true "Architecture overview") - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/monitoring_architecture.md?pixel)]() - diff --git a/monitoring_architecture.png b/monitoring_architecture.png deleted file mode 100644 index 570996b7..00000000 Binary files a/monitoring_architecture.png and /dev/null differ diff --git a/namespaces.md b/namespaces.md deleted file mode 100644 index 8a9c97c8..00000000 --- a/namespaces.md +++ /dev/null @@ -1,370 +0,0 @@ -# Namespaces - -## Abstract - -A Namespace is a mechanism to partition resources created by users into -a logically named group. - -## Motivation - -A single cluster should be able to satisfy the needs of multiple user -communities. - -Each user community wants to be able to work in isolation from other -communities. - -Each user community has its own: - -1. resources (pods, services, replication controllers, etc.) -2. policies (who can or cannot perform actions in their community) -3. constraints (this community is allowed this much quota, etc.) - -A cluster operator may create a Namespace for each unique user community. - -The Namespace provides a unique scope for: - -1. named resources (to avoid basic naming collisions) -2. delegated management authority to trusted users -3. ability to limit community resource consumption - -## Use cases - -1. As a cluster operator, I want to support multiple user communities on a -single cluster. -2. As a cluster operator, I want to delegate authority to partitions of the -cluster to trusted users in those communities. -3. As a cluster operator, I want to limit the amount of resources each -community can consume in order to limit the impact to other communities using -the cluster. -4. As a cluster user, I want to interact with resources that are pertinent to -my user community in isolation of what other user communities are doing on the -cluster. - -## Design - -### Data Model - -A *Namespace* defines a logically named group for multiple *Kind*s of resources. - -```go -type Namespace struct { - TypeMeta `json:",inline"` - ObjectMeta `json:"metadata,omitempty"` - - Spec NamespaceSpec `json:"spec,omitempty"` - Status NamespaceStatus `json:"status,omitempty"` -} -``` - -A *Namespace* name is a DNS compatible label. - -A *Namespace* must exist prior to associating content with it. - -A *Namespace* must not be deleted if there is content associated with it. - -To associate a resource with a *Namespace* the following conditions must be -satisfied: - -1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with -the server -2. The resource's *TypeMeta.Namespace* field must have a value that references -an existing *Namespace* - -The *Name* of a resource associated with a *Namespace* is unique to that *Kind* -in that *Namespace*. - -It is intended to be used in resource URLs; provided by clients at creation -time, and encouraged to be human friendly; intended to facilitate idempotent -creation, space-uniqueness of singleton objects, distinguish distinct entities, -and reference particular entities across operations. - -### Authorization - -A *Namespace* provides an authorization scope for accessing content associated -with the *Namespace*. - -See [Authorization plugins](../admin/authorization.md) - -### Limit Resource Consumption - -A *Namespace* provides a scope to limit resource consumption. - -A *LimitRange* defines min/max constraints on the amount of resources a single -entity can consume in a *Namespace*. - -See [Admission control: Limit Range](admission_control_limit_range.md) - -A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and -allows cluster operators to define *Hard* resource usage limits that a -*Namespace* may consume. - -See [Admission control: Resource Quota](admission_control_resource_quota.md) - -### Finalizers - -Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* -objects. - -```go -type FinalizerName string - -// These are internal finalizers to Kubernetes, must be qualified name unless defined here -const ( - FinalizerKubernetes FinalizerName = "kubernetes" -) - -// NamespaceSpec describes the attributes on a Namespace -type NamespaceSpec struct { - // Finalizers is an opaque list of values that must be empty to permanently remove object from storage - Finalizers []FinalizerName -} -``` - -A *FinalizerName* is a qualified name. - -The API Server enforces that a *Namespace* can only be deleted from storage if -and only if it's *Namespace.Spec.Finalizers* is empty. - -A *finalize* operation is the only mechanism to modify the -*Namespace.Spec.Finalizers* field post creation. - -Each *Namespace* created has *kubernetes* as an item in its list of initial -*Namespace.Spec.Finalizers* set by default. - -### Phases - -A *Namespace* may exist in the following phases. - -```go -type NamespacePhase string -const( - NamespaceActive NamespacePhase = "Active" - NamespaceTerminating NamespaceTerminating = "Terminating" -) - -type NamespaceStatus struct { - ... - Phase NamespacePhase -} -``` - -A *Namespace* is in the **Active** phase if it does not have a -*ObjectMeta.DeletionTimestamp*. - -A *Namespace* is in the **Terminating** phase if it has a -*ObjectMeta.DeletionTimestamp*. - -**Active** - -Upon creation, a *Namespace* goes in the *Active* phase. This means that content -may be associated with a namespace, and all normal interactions with the -namespace are allowed to occur in the cluster. - -If a DELETE request occurs for a *Namespace*, the -*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A -*namespace controller* observes the change, and sets the -*Namespace.Status.Phase* to *Terminating*. - -**Terminating** - -A *namespace controller* watches for *Namespace* objects that have a -*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to -initiate graceful termination of the *Namespace* associated content that are -known to the cluster. - -The *namespace controller* enumerates each known resource type in that namespace -and deletes it one by one. - -Admission control blocks creation of new resources in that namespace in order to -prevent a race-condition where the controller could believe all of a given -resource type had been deleted from the namespace, when in fact some other rogue -client agent had created new objects. Using admission control in this scenario -allows each of registry implementations for the individual objects to not need -to take into account Namespace life-cycle. - -Once all objects known to the *namespace controller* have been deleted, the -*namespace controller* executes a *finalize* operation on the namespace that -removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list. - -If the *namespace controller* sees a *Namespace* whose -*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers* -list is empty, it will signal the server to permanently remove the *Namespace* -from storage by sending a final DELETE action to the API server. - -### REST API - -To interact with the Namespace API: - -| Action | HTTP Verb | Path | Description | -| ------ | --------- | ---- | ----------- | -| CREATE | POST | /api/{version}/namespaces | Create a namespace | -| LIST | GET | /api/{version}/namespaces | List all namespaces | -| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} | -| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} | -| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} | -| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces | - -This specification reserves the name *finalize* as a sub-resource to namespace. - -As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*. - -To interact with content associated with a Namespace: - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} | -| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} | -| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} | -| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} | -| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} | -| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} | -| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces | -| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces | - -The API server verifies the *Namespace* on resource creation matches the -*{namespace}* on the path. - -The API server will associate a resource with a *Namespace* if not populated by -the end-user based on the *Namespace* context of the incoming request. If the -*Namespace* of the resource being created, or updated does not match the -*Namespace* on the request, then the API server will reject the request. - -### Storage - -A namespace provides a unique identifier space and therefore must be in the -storage path of a resource. - -In etcd, we want to continue to still support efficient WATCH across namespaces. - -Resources that persist content in etcd will have storage paths as follows: - -/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name} - -This enables consumers to WATCH /registry/{resourceType} for changes across -namespace of a particular {resourceType}. - -### Kubelet - -The kubelet will register pod's it sources from a file or http source with a -namespace associated with the *cluster-id* - -### Example: OpenShift Origin managing a Kubernetes Namespace - -In this example, we demonstrate how the design allows for agents built on-top of -Kubernetes that manage their own set of resource types associated with a -*Namespace* to take part in Namespace termination. - -OpenShift creates a Namespace in Kubernetes - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"] - }, - "status": { - "phase": "Active" - } -} -``` - -OpenShift then goes and creates a set of resources (pods, services, etc) -associated with the "development" namespace. It also creates its own set of -resources in its own storage associated with the "development" namespace unknown -to Kubernetes. - -User deletes the Namespace in Kubernetes, and Namespace now has following state: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin", "kubernetes"] - }, - "status": { - "phase": "Terminating" - } -} -``` - -The Kubernetes *namespace controller* observes the namespace has a -*deletionTimestamp* and begins to terminate all of the content in the namespace -that it knows about. Upon success, it executes a *finalize* action that modifies -the *Namespace* by removing *kubernetes* from the list of finalizers: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": ["openshift.com/origin"] - }, - "status": { - "phase": "Terminating" - } -} -``` - -OpenShift Origin has its own *namespace controller* that is observing cluster -state, and it observes the same namespace had a *deletionTimestamp* assigned to -it. It too will go and purge resources from its own storage that it manages -associated with that namespace. Upon completion, it executes a *finalize* action -and removes the reference to "openshift.com/origin" from the list of finalizers. - -This results in the following state: - -```json -{ - "apiVersion":"v1", - "kind": "Namespace", - "metadata": { - "name": "development", - "deletionTimestamp": "...", - "labels": { - "name": "development" - } - }, - "spec": { - "finalizers": [] - }, - "status": { - "phase": "Terminating" - } -} -``` - -At this point, the Kubernetes *namespace controller* in its sync loop will see -that the namespace has a deletion timestamp and that its list of finalizers is -empty. As a result, it knows all content associated from that namespace has been -purged. It performs a final DELETE action to remove that Namespace from the -storage. - -At this point, all content associated with that Namespace, and the Namespace -itself are gone. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]() - diff --git a/networking.md b/networking.md deleted file mode 100644 index 6e269481..00000000 --- a/networking.md +++ /dev/null @@ -1,190 +0,0 @@ -# Networking - -There are 4 distinct networking problems to solve: - -1. Highly-coupled container-to-container communications -2. Pod-to-Pod communications -3. Pod-to-Service communications -4. External-to-internal communications - -## Model and motivation - -Kubernetes deviates from the default Docker networking model (though as of -Docker 1.8 their network plugins are getting closer). The goal is for each pod -to have an IP in a flat shared networking namespace that has full communication -with other physical computers and containers across the network. IP-per-pod -creates a clean, backward-compatible model where pods can be treated much like -VMs or physical hosts from the perspectives of port allocation, networking, -naming, service discovery, load balancing, application configuration, and -migration. - -Dynamic port allocation, on the other hand, requires supporting both static -ports (e.g., for externally accessible services) and dynamically allocated -ports, requires partitioning centrally allocated and locally acquired dynamic -ports, complicates scheduling (since ports are a scarce resource), is -inconvenient for users, complicates application configuration, is plagued by -port conflicts and reuse and exhaustion, requires non-standard approaches to -naming (e.g. consul or etcd rather than DNS), requires proxies and/or -redirection for programs using standard naming/addressing mechanisms (e.g. web -browsers), requires watching and cache invalidation for address/port changes -for instances in addition to watching group membership changes, and obstructs -container/pod migration (e.g. using CRIU). NAT introduces additional complexity -by fragmenting the addressing space, which breaks self-registration mechanisms, -among other problems. - -## Container to container - -All containers within a pod behave as if they are on the same host with regard -to networking. They can all reach each other’s ports on localhost. This offers -simplicity (static ports know a priori), security (ports bound to localhost -are visible within the pod but never outside it), and performance. This also -reduces friction for applications moving from the world of uncontainerized apps -on physical or virtual hosts. People running application stacks together on -the same host have already figured out how to make ports not conflict and have -arranged for clients to find them. - -The approach does reduce isolation between containers within a pod — -ports could conflict, and there can be no container-private ports, but these -seem to be relatively minor issues with plausible future workarounds. Besides, -the premise of pods is that containers within a pod share some resources -(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. -Additionally, the user can control what containers belong to the same pod -whereas, in general, they don't control what pods land together on a host. - -## Pod to pod - -Because every pod gets a "real" (not machine-private) IP address, pods can -communicate without proxies or translations. The pod can use well-known port -numbers and can avoid the use of higher-level service discovery systems like -DNS-SD, Consul, or Etcd. - -When any container calls ioctl(SIOCGIFADDR) (get the address of an interface), -it sees the same IP that any peer container would see them coming from — -each pod has its own IP address that other pods can know. By making IP addresses -and ports the same both inside and outside the pods, we create a NAT-less, flat -address space. Running "ip addr show" should work as expected. This would enable -all existing naming/discovery mechanisms to work out of the box, including -self-registration mechanisms and applications that distribute IP addresses. We -should be optimizing for inter-pod network communication. Within a pod, -containers are more likely to use communication through volumes (e.g., tmpfs) or -IPC. - -This is different from the standard Docker model. In that mode, each container -gets an IP in the 172-dot space and would only see that 172-dot address from -SIOCGIFADDR. If these containers connect to another container the peer would see -the connect coming from a different IP than the container itself knows. In short -— you can never self-register anything from a container, because a -container can not be reached on its private IP. - -An alternative we considered was an additional layer of addressing: pod-centric -IP per container. Each container would have its own local IP address, visible -only within that pod. This would perhaps make it easier for containerized -applications to move from physical/virtual hosts to pods, but would be more -complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) -and to reason about, due to the additional layer of address translation, and -would break self-registration and IP distribution mechanisms. - -Like Docker, ports can still be published to the host node's interface(s), but -the need for this is radically diminished. - -## Implementation - -For the Google Compute Engine cluster configuration scripts, we use [advanced -routing rules](https://developers.google.com/compute/docs/networking#routing) -and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that -get routed to it. This is in addition to the 'main' IP address assigned to the -VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to -differentiate it from `docker0`) is set up outside of Docker proper. - -Example of GCE's advanced routing rules: - -```sh -gcloud compute routes add "${NODE_NAMES[$i]}" \ - --project "${PROJECT}" \ - --destination-range "${NODE_IP_RANGES[$i]}" \ - --network "${NETWORK}" \ - --next-hop-instance "${NODE_NAMES[$i]}" \ - --next-hop-instance-zone "${ZONE}" & -``` - -GCE itself does not know anything about these IPs, though. This means that when -a pod tries to egress beyond GCE's project the packets must be SNAT'ed -(masqueraded) to the VM's IP, which GCE recognizes and allows. - -### Other implementations - -With the primary aim of providing IP-per-pod-model, other implementations exist -to serve the purpose outside of GCE. - - [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md) - - [Flannel](https://github.com/coreos/flannel#flannel) - - [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) - ("With Linux Bridge devices" section) - - [Weave](https://github.com/zettio/weave) is yet another way to build an - overlay network, primarily aiming at Docker integration. - - [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real - container IPs. - -## Pod to service - -The [service](../user-guide/services.md) abstraction provides a way to group pods under a -common access policy (e.g. load-balanced). The implementation of this creates a -virtual IP which clients can access and which is transparently proxied to the -pods in a Service. Each node runs a kube-proxy process which programs -`iptables` rules to trap access to service IPs and redirect them to the correct -backends. This provides a highly-available load-balancing solution with low -performance overhead by balancing client traffic from a node on that same node. - -## External to internal - -So far the discussion has been about how to access a pod or service from within -the cluster. Accessing a pod from outside the cluster is a bit more tricky. We -want to offer highly-available, high-performance load balancing to target -Kubernetes Services. Most public cloud providers are simply not flexible enough -yet. - -The way this is generally implemented is to set up external load balancers (e.g. -GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When -traffic arrives at a node it is recognized as being part of a particular Service -and routed to an appropriate backend Pod. This does mean that some traffic will -get double-bounced on the network. Once cloud providers have better offerings -we can take advantage of those. - -## Challenges and future work - -### Docker API - -Right now, docker inspect doesn't show the networking configuration of the -containers, since they derive it from another container. That information should -be exposed somehow. - -### External IP assignment - -We want to be able to assign IP addresses externally from Docker -[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need -to statically allocate fixed-size IP ranges to each node, so that IP addresses -can be made stable across pod infra container restarts -([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate -pod migration. Right now, if the pod infra container dies, all the user -containers must be stopped and restarted because the netns of the pod infra -container will change on restart, and any subsequent user container restart -will join that new netns, thereby not being able to see its peers. -Additionally, a change in IP address would encounter DNS caching/TTL problems. -External IP assignment would also simplify DNS support (see below). - -### IPv6 - -IPv6 support would be nice but requires significant internal changes in a few -areas. First pods should be able to report multiple IP addresses -[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398) -and the network plugin architecture Kubernetes uses needs to allow returning -IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245). -Kubernetes code that deals with IP addresses must then be audited and fixed to -support both IPv4 and IPv6 addresses and not assume IPv4. -Additionally, direct ipv6 assignment to instances doesn't appear to be supported -by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull -requests from people running Kubernetes on bare metal, though. :-) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]() - diff --git a/nodeaffinity.md b/nodeaffinity.md deleted file mode 100644 index 61e04169..00000000 --- a/nodeaffinity.md +++ /dev/null @@ -1,246 +0,0 @@ -# Node affinity and NodeSelector - -## Introduction - -This document proposes a new label selector representation, called -`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit -more flexible and is intended to be used only for selecting nodes. - -In addition, we propose to replace the `map[string]string` in `PodSpec` that the -scheduler currently uses as part of restricting the set of nodes onto which a -pod is eligible to schedule, with a field of type `Affinity` that contains one -or more affinity specifications. In this document we discuss `NodeAffinity`, -which contains one or more of the following: -* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be -represented by a `NodeSelector`, and thus generalizes the scheduling behavior of -the current `map[string]string` but still serves the purpose of restricting -the set of nodes onto which the pod can schedule. In addition, unlike the -behavior of the current `map[string]string`, when it becomes violated the system -will try to eventually evict the pod from its node. -* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is -identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the -system may or may not try to eventually evict the pod from its node. -* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that -specifies which nodes are preferred for scheduling among those that meet all -scheduling requirements. - -(In practice, as discussed later, we will actually *add* the `Affinity` field -rather than replacing `map[string]string`, due to backward compatibility -requirements.) - -The affinity specifications described above allow a pod to request various -properties that are inherent to nodes, for example "run this pod on a node with -an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z." -([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes -some of the properties that a node might publish as labels, which affinity -expressions can match against.) They do *not* allow a pod to request to schedule -(or not schedule) on a node based on what other pods are running on the node. -That feature is called "inter-pod topological affinity/anti-affinity" and is -described [here](https://github.com/kubernetes/kubernetes/pull/18265). - -## API - -### NodeSelector - -```go -// A node selector represents the union of the results of one or more label queries -// over a set of nodes; that is, it represents the OR of the selectors represented -// by the nodeSelectorTerms. -type NodeSelector struct { - // nodeSelectorTerms is a list of node selector terms. The terms are ORed. - NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"` -} - -// An empty node selector term matches all objects. A null node selector term -// matches no objects. -type NodeSelectorTerm struct { - // matchExpressions is a list of node selector requirements. The requirements are ANDed. - MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` -} - -// A node selector requirement is a selector that contains values, a key, and an operator -// that relates the key and values. -type NodeSelectorRequirement struct { - // key is the label key that the selector applies to. - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - // operator represents a key's relationship to a set of values. - // Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt. - Operator NodeSelectorOperator `json:"operator"` - // values is an array of string values. If the operator is In or NotIn, - // the values array must be non-empty. If the operator is Exists or DoesNotExist, - // the values array must be empty. If the operator is Gt or Lt, the values - // array must have a single element, which will be interpreted as an integer. - // This array is replaced during a strategic merge patch. - Values []string `json:"values,omitempty"` -} - -// A node selector operator is the set of operators that can be used in -// a node selector requirement. -type NodeSelectorOperator string - -const ( - NodeSelectorOpIn NodeSelectorOperator = "In" - NodeSelectorOpNotIn NodeSelectorOperator = "NotIn" - NodeSelectorOpExists NodeSelectorOperator = "Exists" - NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist" - NodeSelectorOpGt NodeSelectorOperator = "Gt" - NodeSelectorOpLt NodeSelectorOperator = "Lt" -) -``` - -### NodeAffinity - -We will add one field to `PodSpec` - -```go -Affinity *Affinity `json:"affinity,omitempty"` -``` - -The `Affinity` type is defined as follows - -```go -type Affinity struct { - NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"` -} - -type NodeAffinity struct { - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a node label update), - // the system will try to eventually evict the pod from its node. - RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a node label update), - // the system may or may not try to eventually evict the pod from its node. - RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -// An empty preferred scheduling term matches all objects with implicit weight 0 -// (i.e. it's a no-op). A null preferred scheduling term matches no objects. -type PreferredSchedulingTerm struct { - // weight is in the range 1-100 - Weight int `json:"weight"` - // matchExpressions is a list of node selector requirements. The requirements are ANDed. - MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"` -} -``` - -Unfortunately, the name of the existing `map[string]string` field in PodSpec is -`NodeSelector` and we can't change it since this name is part of the API. -Hopefully this won't cause too much confusion. - -## Examples - -** TODO: fill in this section ** - -* Run this pod on a node with an Intel or AMD CPU - -* Run this pod on a node in availability zone Z - - -## Backward compatibility - -When we add `Affinity` to PodSpec, we will deprecate, but not remove, the -current field in PodSpec - -```go -NodeSelector map[string]string `json:"nodeSelector,omitempty"` -``` - -Old version of the scheduler will ignore the `Affinity` field. New versions of -the scheduler will apply their scheduling predicates to both `Affinity` and -`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets -of requirements. We will not attempt to convert between `Affinity` and -`nodeSelector`. - -Old versions of non-scheduling clients will not know how to do anything -semantically meaningful with `Affinity`, but we don't expect that this will -cause a problem. - -See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259) -for more discussion. - -Users should not start using `NodeAffinity` until the full implementation has -been in Kubelet and the master for enough binary versions that we feel -comfortable that we will not need to roll back either Kubelet or master to a -version that does not support them. Longer-term we will use a programatic -approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Implementation plan - -1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, -`PreferredDuringSchedulingIgnoredDuringExecution`, and -`RequiredDuringSchedulingIgnoredDuringExecution` types to the API. -2. Implement a scheduler predicate that takes -`RequiredDuringSchedulingIgnoredDuringExecution` into account. -3. Implement a scheduler priority function that takes -`PreferredDuringSchedulingIgnoredDuringExecution` into account. -4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be -marked as deprecated. -5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API. -6. Modify the scheduler predicate from step 2 to also take -`RequiredDuringSchedulingRequiredDuringExecution` into account. -7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission -decision. -8. Implement code in Kubelet *or* the controllers that evicts a pod that no -longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). - -We assume Kubelet publishes labels describing the node's membership in all of -the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). - -## Extensibility - -The design described here is the result of careful analysis of use cases, a -decade of experience with Borg at Google, and a review of similar features in -other open-source container orchestration systems. We believe that it properly -balances the goal of expressiveness against the goals of simplicity and -efficiency of implementation. However, we recognize that use cases may arise in -the future that cannot be expressed using the syntax described here. Although we -are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire -for Kubernetes users to get a consistent experience, etc.), the regular -Kubernetes annotation mechanism can be used to add or replace affinity rules. -The way this work would is: - -1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to -request the desired scheduling behavior. If the new rule(s) *replace* one or -more fields of `Affinity` then the user would omit those fields from `Affinity`; -if they are *additional rules*, then the user would fill in `Affinity` as well -as the annotation(s). -1. Scheduler takes the annotation(s) into account when scheduling. - -If some particular new syntax becomes popular, we would consider upstreaming it -by integrating it into the standard `Affinity`. - -## Future work - -Are there any other fields we should convert from `map[string]string` to -`NodeSelector`? - -## Related issues - -The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261). - -The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341). -Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related. -Those issues reference other related issues. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]() - diff --git a/persistent-storage.md b/persistent-storage.md deleted file mode 100644 index 70bcde97..00000000 --- a/persistent-storage.md +++ /dev/null @@ -1,292 +0,0 @@ -# Persistent Storage - -This document proposes a model for managing persistent, cluster-scoped storage -for applications requiring long lived data. - -### Abstract - -Two new API kinds: - -A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. -It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) -for how to use it. - -A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to -use in a pod. It is analogous to a pod. - -One new system component: - -`PersistentVolumeClaimBinder` is a singleton running in master that watches all -PersistentVolumeClaims in the system and binds them to the closest matching -available PersistentVolume. The volume manager watches the API for newly created -volumes to manage. - -One new volume: - -`PersistentVolumeClaimVolumeSource` references the user's PVC in the same -namespace. This volume finds the bound PV and mounts that volume for the pod. A -`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another -type of volume that is owned by someone else (the system). - -Kubernetes makes no guarantees at runtime that the underlying storage exists or -is available. High availability is left to the storage provider. - -### Goals - -* Allow administrators to describe available storage. -* Allow pod authors to discover and request persistent volumes to use with pods. -* Enforce security through access control lists and securing storage to the same -namespace as the pod volume. -* Enforce quotas through admission control. -* Enforce scheduler rules by resource counting. -* Ensure developers can rely on storage being available without being closely -bound to a particular disk, server, network, or storage device. - -#### Describe available storage - -Cluster administrators use the API to manage *PersistentVolumes*. A custom store -`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by -storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for -storage and binds them to an available volume by matching the volume's -characteristics (AccessModes and storage size) to the user's request. - -PVs are system objects and, thus, have no namespace. - -Many means of dynamic provisioning will be eventually be implemented for various -storage types. - - -##### PersistentVolume API - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume | -| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} | -| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} | -| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} | -| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume | -| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume | - - -#### Request Storage - -Kubernetes users request persistent storage for their pod by creating a -```PersistentVolumeClaim```. Their request for storage is described by their -requirements for resources and mount capabilities. - -Requests for volumes are bound to available volumes by the volume manager, if a -suitable match is found. Requests for resources can go unfulfilled. - -Users attach their claim to their pod using a new -```PersistentVolumeClaimVolumeSource``` volume source. - - -##### PersistentVolumeClaim API - - -| Action | HTTP Verb | Path | Description | -| ---- | ---- | ---- | ---- | -| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} | -| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} | -| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} | -| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} | -| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} | -| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} | - - - -#### Scheduling constraints - -Scheduling constraints are to be handled similar to pod resource constraints. -Pods will need to be annotated or decorated with the number of resources it -requires on a node. Similarly, a node will need to list how many it has used or -available. - -TBD - - -#### Events - -The implementation of persistent storage will not require events to communicate -to the user the state of their claim. The CLI for bound claims contains a -reference to the backing persistent volume. This is always present in the API -and CLI, making an event to communicate the same unnecessary. - -Events that communicate the state of a mounted volume are left to the volume -plugins. - -### Example - -#### Admin provisions storage - -An administrator provisions storage by posting PVs to the API. Various ways to -automate this task can be scripted. Dynamic provisioning is a future feature -that can maintain levels of PVs. - -```yaml -POST: - -kind: PersistentVolume -apiVersion: v1 -metadata: - name: pv0001 -spec: - capacity: - storage: 10 - persistentDisk: - pdName: "abc123" - fsType: "ext4" -``` - -```console -$ kubectl get pv - -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON -pv0001 map[] 10737418240 RWO Pending -``` - -#### Users request storage - -A user requests storage by posting a PVC to the API. Their request contains the -AccessModes they wish their volume to have and the minimum size needed. - -The user must be within a namespace to create PVCs. - -```yaml -POST: - -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: myclaim-1 -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 3 -``` - -```console -$ kubectl get pvc - -NAME LABELS STATUS VOLUME -myclaim-1 map[] pending -``` - - -#### Matching and binding - -The ```PersistentVolumeClaimBinder``` attempts to find an available volume that -most closely matches the user's request. If one exists, they are bound by -putting a reference on the PV to the PVC. Requests can go unfulfilled if a -suitable match is not found. - -```console -$ kubectl get pv - -NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON -pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e - - -kubectl get pvc - -NAME LABELS STATUS VOLUME -myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e -``` - -A claim must request access modes and storage capacity. This is because internally PVs are -indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity. -A claim may request one of more of the following attributes to better match a PV: volume name, selectors, -and volume class (currently implemented as an annotation). - -A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which -PVC it will match. -A PV may also define labels, annotations, and a volume class (currently implemented as an -annotation) to better target PVCs. - -As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is -matched to a PV: - -1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered. -"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines -the mode requested by the claim. - -1. The potential PVs above are considered in order of the closest access mode match, with the best case -being an exact match, and a worse case being more modes than requested by the claim. - -1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity -is not less than the storage being requested by the claim then this PV will bind to the claim. Done. - -1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is -skipped and will be handled by Dynamic Provisioning. - -1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a -placeholder, then the PV is skipped. - -1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the -PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match -since capacities may differ. - -1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder -for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped. -If the annotations for the PV and PVC are empty they are treated as being equal. - -1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs, -the PV with the smallest capacity that is also equal to or greater than the claim's requested storage -is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all -of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner. - -*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default -`StorageClass` has been defined) then a volume will be dynamically provisioned. - -#### Claim usage - -The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim -and mount its volume for a pod. - -The claim holder owns the claim and its data for as long as the claim exists. -The pod using the claim can be deleted, but the claim remains in the user's -namespace. It can be used again and again by many pods. - -```yaml -POST: - -kind: Pod -apiVersion: v1 -metadata: - name: mypod -spec: - containers: - - image: nginx - name: myfrontend - volumeMounts: - - mountPath: "/var/www/html" - name: mypd - volumes: - - name: mypd - source: - persistentVolumeClaim: - accessMode: ReadWriteOnce - claimRef: - name: myclaim-1 -``` - -#### Releasing a claim and Recycling a volume - -When a claim holder is finished with their data, they can delete their claim. - -```console -$ kubectl delete pvc myclaim-1 -``` - -The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim -reference from the PV and change the PVs status to 'Released'. - -Admins can script the recycling of released volumes. Future dynamic provisioners -will understand how a volume should be recycled. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]() - diff --git a/podaffinity.md b/podaffinity.md deleted file mode 100644 index 9291b8b9..00000000 --- a/podaffinity.md +++ /dev/null @@ -1,673 +0,0 @@ -# Inter-pod topological affinity and anti-affinity - -## Introduction - -NOTE: It is useful to read about [node affinity](nodeaffinity.md) first. - -This document describes a proposal for specifying and implementing inter-pod -topological affinity and anti-affinity. By that we mean: rules that specify that -certain pods should be placed in the same topological domain (e.g. same node, -same rack, same zone, same power domain, etc.) as some other pods, or, -conversely, should *not* be placed in the same topological domain as some other -pods. - -Here are a few example rules; we explain how to express them using the API -described in this doc later, in the section "Examples." -* Affinity - * Co-locate the pods from a particular service or Job in the same availability -zone, without specifying which zone that should be. - * Co-locate the pods from service S1 with pods from service S2 because S1 uses -S2 and thus it is useful to minimize the network latency between them. -Co-location might mean same nodes and/or same availability zone. -* Anti-affinity - * Spread the pods of a service across nodes and/or availability zones, e.g. to -reduce correlated failures. - * Give a pod "exclusive" access to a node to guarantee resource isolation -- -it must never share the node with other pods. - * Don't schedule the pods of a particular service on the same nodes as pods of -another service that are known to interfere with the performance of the pods of -the first service. - -For both affinity and anti-affinity, there are three variants. Two variants have -the property of requiring the affinity/anti-affinity to be satisfied for the pod -to be allowed to schedule onto a node; the difference between them is that if -the condition ceases to be met later on at runtime, for one of them the system -will try to eventually evict the pod, while for the other the system may not try -to do so. The third variant simply provides scheduling-time *hints* that the -scheduler will try to satisfy but may not be able to. These three variants are -directly analogous to the three variants of [node affinity](nodeaffinity.md). - -Note that this proposal is only about *inter-pod* topological affinity and -anti-affinity. There are other forms of topological affinity and anti-affinity. -For example, you can use [node affinity](nodeaffinity.md) to require (prefer) -that a set of pods all be scheduled in some specific zone Z. Node affinity is -not capable of expressing inter-pod dependencies, and conversely the API we -describe in this document is not capable of expressing node affinity rules. For -simplicity, we will use the terms "affinity" and "anti-affinity" to mean -"inter-pod topological affinity" and "inter-pod topological anti-affinity," -respectively, in the remainder of this document. - -## API - -We will add one field to `PodSpec` - -```go -Affinity *Affinity `json:"affinity,omitempty"` -``` - -The `Affinity` type is defined as follows - -```go -type Affinity struct { - PodAffinity *PodAffinity `json:"podAffinity,omitempty"` - PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"` -} - -type PodAffinity struct { - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -type PodAntiAffinity struct { - // If the anti-affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the anti-affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system will try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"` - // If the anti-affinity requirements specified by this field are not met at - // scheduling time, the pod will not be scheduled onto the node. - // If the anti-affinity requirements specified by this field cease to be met - // at some point during pod execution (e.g. due to a pod label update), the - // system may or may not try to eventually evict the pod from its node. - // When there are multiple elements, the lists of nodes corresponding to each - // PodAffinityTerm are intersected, i.e. all terms must be satisfied. - RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"` - // The scheduler will prefer to schedule pods to nodes that satisfy - // the anti-affinity expressions specified by this field, but it may choose - // a node that violates one or more of the expressions. The node that is - // most preferred is the one with the greatest sum of weights, i.e. - // for each node that meets all of the scheduling requirements (resource - // request, RequiredDuringScheduling anti-affinity expressions, etc.), - // compute a sum by iterating through the elements of this field and adding - // "weight" to the sum if the node matches the corresponding MatchExpressions; the - // node(s) with the highest sum are the most preferred. - PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"` -} - -type WeightedPodAffinityTerm struct { - // weight is in the range 1-100 - Weight int `json:"weight"` - PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"` -} - -type PodAffinityTerm struct { - LabelSelector *LabelSelector `json:"labelSelector,omitempty"` - // namespaces specifies which namespaces the LabelSelector applies to (matches against); - // nil list means "this pod's namespace," empty list means "all namespaces" - // The json tag here is not "omitempty" since we need to distinguish nil and empty. - // See https://golang.org/pkg/encoding/json/#Marshal for more details. - Namespaces []api.Namespace `json:"namespaces,omitempty"` - // empty topology key is interpreted by the scheduler as "all topologies" - TopologyKey string `json:"topologyKey,omitempty"` -} -``` - -Note that the `Namespaces` field is necessary because normal `LabelSelector` is -scoped to the pod's namespace, but we need to be able to match against all pods -globally. - -To explain how this API works, let's say that the `PodSpec` of a pod `P` has an -`Affinity` that is configured as follows (note that we've omitted and collapsed -some fields for simplicity, but this should sufficiently convey the intent of -the design): - -```go -PodAffinity { - RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}}, - PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}}, -} -PodAntiAffinity { - RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}}, - PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}} -} -``` - -Then when scheduling pod P, the scheduler: -* Can only schedule P onto nodes that are running pods that satisfy `P1`. -(Assumes all nodes have a label with key `node` and value specifying their node -name.) -* Should try to schedule P onto zones that are running pods that satisfy `P2`. -(Assumes all nodes have a label with key `zone` and value specifying their -zone.) -* Cannot schedule P onto any racks that are running pods that satisfy `P3`. -(Assumes all nodes have a label with key `rack` and value specifying their rack -name.) -* Should try not to schedule P onto any power domains that are running pods that -satisfy `P4`. (Assumes all nodes have a label with key `power` and value -specifying their power domain.) - -When `RequiredDuringScheduling` has multiple elements, the requirements are -ANDed. For `PreferredDuringScheduling` the weights are added for the terms that -are satisfied for each node, and the node(s) with the highest weight(s) are the -most preferred. - -In reality there are two variants of `RequiredDuringScheduling`: one suffixed -with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`. -For the first variant, if the affinity/anti-affinity ceases to be met at some -point during pod execution (e.g. due to a pod label update), the system will try -to eventually evict the pod from its node. In the second variant, the system may -or may not try to eventually evict the pod from its node. - -## A comment on symmetry - -One thing that makes affinity and anti-affinity tricky is symmetry. - -Imagine a cluster that is running pods from two services, S1 and S2. Imagine -that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not -run me on nodes that are running pods from S2." It is not sufficient just to -check that there are no S2 pods on a node when you are scheduling a S1 pod. You -also need to ensure that there are no S1 pods on a node when you are scheduling -a S2 pod, *even though the S2 pod does not have any anti-affinity rules*. -Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's -RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving -S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling -anti-affinity rule, then: -* if a node is empty, you can schedule S1 or S2 onto the node -* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node - -Note that while RequiredDuringScheduling anti-affinity is symmetric, -RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 -have a RequiredDuringScheduling affinity rule "run me on nodes that are running -pods from S2," it is not required that there be S1 pods on a node in order to -schedule a S2 pod onto that node. More specifically, if S1 has the -aforementioned RequiredDuringScheduling affinity rule, then: -* if a node is empty, you can schedule S2 onto the node -* if a node is empty, you cannot schedule S1 onto the node -* if a node is running S2, you can schedule S1 onto the node -* if a node is running S1+S2 and S1 terminates, S2 continues running -* if a node is running S1+S2 and S2 terminates, the system terminates S1 -(eventually) - -However, although RequiredDuringScheduling affinity is not symmetric, there is -an implicit PreferredDuringScheduling affinity rule corresponding to every -RequiredDuringScheduling affinity rule: if the pods of S1 have a -RequiredDuringScheduling affinity rule "run me on nodes that are running pods -from S2" then it is not required that there be S1 pods on a node in order to -schedule a S2 pod onto that node, but it would be better if there are. - -PreferredDuringScheduling is symmetric. If the pods of S1 had a -PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that -are running pods from S2" then we would prefer to keep a S1 pod that we are -scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that -we are scheduling off of nodes that are running S1 pods. Likewise if the pods of -S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that -are running pods from S2" then we would prefer to place a S1 pod that we are -scheduling onto a node that is running a S2 pod, and also to place a S2 pod that -we are scheduling onto a node that is running a S1 pod. - -## Examples - -Here are some examples of how you would express various affinity and -anti-affinity rules using the API we described. - -### Affinity - -In the examples below, the word "put" is intentionally ambiguous; the rules are -the same whether "put" means "must put" (RequiredDuringScheduling) or "try to -put" (PreferredDuringScheduling)--all that changes is which field the rule goes -into. Also, we only discuss scheduling-time, and ignore the execution-time. -Finally, some of the examples use "zone" and some use "node," just to make the -examples more interesting; any of the examples with "zone" will also work for -"node" if you change the `TopologyKey`, and vice-versa. - -* **Put the pod in zone Z**: -Tricked you! It is not possible express this using the API described here. For -this you should use node affinity. - -* **Put the pod in a zone that is running at least one pod from service S**: -`{LabelSelector: , TopologyKey: "zone"}` - -* **Put the pod on a node that is already running a pod that requires a license -for software package P**: Assuming pods that require a license for software -package P have a label `{key=license, value=P}`: -`{LabelSelector: "license" In "P", TopologyKey: "node"}` - -* **Put this pod in the same zone as other pods from its same service**: -Assuming pods from this pod's service have some label `{key=service, value=S}`: -`{LabelSelector: "service" In "S", TopologyKey: "zone"}` - -This last example illustrates a small issue with this API when it is used with a -scheduler that processes the pending queue one pod at a time, like the current -Kubernetes scheduler. The RequiredDuringScheduling rule -`{LabelSelector: "service" In "S", TopologyKey: "zone"}` -only "works" once one pod from service S has been scheduled. But if all pods in -service S have this RequiredDuringScheduling rule in their PodSpec, then the -RequiredDuringScheduling rule will block the first pod of the service from ever -scheduling, since it is only allowed to run in a zone with another pod from the -same service. And of course that means none of the pods of the service will be -able to schedule. This problem *only* applies to RequiredDuringScheduling -affinity, not PreferredDuringScheduling affinity or any variant of -anti-affinity. There are at least three ways to solve this problem: -* **short-term**: have the scheduler use a rule that if the -RequiredDuringScheduling affinity requirement matches a pod's own labels, and -there are no other such pods anywhere, then disregard the requirement. This -approach has a corner case when running parallel schedulers that are allowed to -schedule pods from the same replicated set (e.g. a single PodTemplate): both -schedulers may try to schedule pods from the set at the same time and think -there are no other pods from that set scheduled yet (e.g. they are trying to -schedule the first two pods from the set), but by the time the second binding is -committed, the first one has already been committed, leaving you with two pods -running that do not respect their RequiredDuringScheduling affinity. There is no -simple way to detect this "conflict" at scheduling time given the current system -implementation. -* **longer-term**: when a controller creates pods from a PodTemplate, for -exactly *one* of those pods, it should omit any RequiredDuringScheduling -affinity rules that select the pods of that PodTemplate. -* **very long-term/speculative**: controllers could present the scheduler with a -group of pods from the same PodTemplate as a single unit. This is similar to the -first approach described above but avoids the corner case. No special logic is -needed in the controllers. Moreover, this would allow the scheduler to do proper -[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since -it could receive an entire gang simultaneously as a single unit. - -### Anti-affinity - -As with the affinity examples, the examples here can be RequiredDuringScheduling -or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as -"must not" or as "try not to" depending on whether the rule appears in -`RequiredDuringScheduling` or `PreferredDuringScheduling`. - -* **Spread the pods of this service S across nodes and zones**: -`{{LabelSelector: , TopologyKey: "node"}, -{LabelSelector: , TopologyKey: "zone"}}` -(note that if this is specified as a RequiredDuringScheduling anti-affinity, -then the first clause is redundant, since the second clause will force the -scheduler to not put more than one pod from S in the same zone, and thus by -definition it will not put more than one pod from S on the same node, assuming -each node is in one zone. This rule is more useful as PreferredDuringScheduling -anti-affinity, e.g. one might expect it to be common in -[Cluster Federation](../../docs/proposals/federation.md) clusters.) - -* **Don't co-locate pods of this service with pods from service "evilService"**: -`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}` - -* **Don't co-locate pods of this service with any other pods including pods of this service**: -`{LabelSelector: empty, TopologyKey: "node"}` - -* **Don't co-locate pods of this service with any other pods except other pods of this service**: -Assuming pods from the service have some label `{key=service, value=S}`: -`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}` -Note that this works because `"service" NotIn "S"` matches pods with no key -"service" as well as pods with key "service" and a corresponding value that is -not "S." - -## Algorithm - -An example algorithm a scheduler might use to implement affinity and -anti-affinity rules is as follows. There are certainly more efficient ways to -do it; this is just intended to demonstrate that the API's semantics are -implementable. - -Terminology definition: We say a pod P is "feasible" on a node N if P meets all -of the scheduler predicates for scheduling P onto N. Note that this algorithm is -only concerned about scheduling time, thus it makes no distinction between -RequiredDuringExecution and IgnoredDuringExecution. - -To make the algorithm slightly more readable, we use the term "HardPodAffinity" -as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and -"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity." -Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity." - -** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} -into account; currently it assumes all terms have weight 1. ** - -``` -Z = the pod you are scheduling -{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z -// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction -X = {Z's PodSpec's HardPodAffinity} -foreach element H of {X} - P = {all pods in the system that match H.LabelSelector} - M map[string]int // topology value -> number of pods running on nodes with that topology value - foreach pod Q of {P} - L = {labels of the node on which Q is running, represented as a map from label key to label value} - M[L[H.TopologyKey]]++ - {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]} -// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity -// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0 -X = {Z's PodSpec's HardPodAntiAffinity} -foreach element H of {X} - P = {all pods in the system that match H.LabelSelector} - M map[string]int // topology value -> number of pods running on nodes with that topology value - foreach pod Q of {P} - L = {labels of the node on which Q is running, represented as a map from label key to label value} - M[L[H.TopologyKey]]++ - {N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]} -// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity -foreach node A of {N} - foreach pod B that is bound to A - if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N} -// At this point, all node in {N} are feasible for Z. -// Step 3a: Soft version of Step 1a -Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node -Initialize the keys of Y to all of the nodes in {N}, and the values to 0 -X = {Z's PodSpec's SoftPodAffinity} -Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" -// Step 3b: Soft version of Step 1b -X = {Z's PodSpec's SoftPodAntiAffinity} -Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++" -// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft -foreach node A of {N} - foreach pod B that is bound to A - increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A -// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is -// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with -// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better). -``` - -## Special considerations for RequiredDuringScheduling anti-affinity - -In this section we discuss three issues with RequiredDuringScheduling -anti-affinity: Denial of Service (DoS), co-existing with daemons, and -determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265) -for additional discussion of these topics. - -### Denial of Service - -Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity -can intentionally or unintentionally cause various problems for other pods, due -to the symmetry property of anti-affinity. - -The most notable danger is the ability for a pod that arrives first to some -topology domain, to block all other pods from scheduling there by stating a -conflict with all other pods. The standard approach to preventing resource -hogging is quota, but simple resource quota cannot prevent this scenario because -the pod may request very little resources. Addressing this using quota requires -a quota scheme that charges based on "opportunity cost" rather than based simply -on requested resources. For example, when handling a pod that expresses -RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey` -(i.e. exclusive access to a node), it could charge for the resources of the -average or largest node in the cluster. Likewise if a pod expresses -RequiredDuringScheduling anti-affinity for all pods using a "cluster" -`TopologyKey`, it could charge for the resources of the entire cluster. If node -affinity is used to constrain the pod to a particular topology domain, then the -admission-time quota charging should take that into account (e.g. not charge for -the average/largest machine if the PodSpec constrains the pod to a specific -machine with a known size; instead charge for the size of the actual machine -that the pod was constrained to). In all cases once the pod is scheduled, the -quota charge should be adjusted down to the actual amount of resources allocated -(e.g. the size of the actual machine that was assigned, not the -average/largest). If a cluster administrator wants to overcommit quota, for -example to allow more than N pods across all users to request exclusive node -access in a cluster with N nodes, then a priority/preemption scheme should be -added so that the most important pods run when resource demand exceeds supply. - -An alternative approach, which is a bit of a blunt hammer, is to use a -capability mechanism to restrict use of RequiredDuringScheduling anti-affinity -to trusted users. A more complex capability mechanism might only restrict it -when using a non-"node" TopologyKey. - -Our initial implementation will use a variant of the capability approach, which -requires no configuration: we will simply reject ALL requests, regardless of -user, that specify "all namespaces" with non-"node" TopologyKey for -RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use -case while prohibiting the more dangerous ones. - -A weaker variant of the problem described in the previous paragraph is a pod's -ability to use anti-affinity to degrade the scheduling quality of another pod, -but not completely block it from scheduling. For example, a set of pods S1 could -use node affinity to request to schedule onto a set of nodes that some other set -of pods S2 prefers to schedule onto. If the pods in S1 have -RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for -S2, then due to the symmetry property of anti-affinity, they can prevent the -pods in S2 from scheduling onto their preferred nodes if they arrive first (for -sure in the RequiredDuringScheduling case, and with some probability that -depends on the weighting scheme for the PreferredDuringScheduling case). A very -sophisticated priority and/or quota scheme could mitigate this, or alternatively -we could eliminate the symmetry property of the implementation of -PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling -anti-affinity could affect scheduling quality of another pod, and as we -described in the previous paragraph, such pods could be charged quota for the -full topology domain, thereby reducing the potential for abuse. - -We won't try to address this issue in our initial implementation; we can -consider one of the approaches mentioned above if it turns out to be a problem -in practice. - -### Co-existing with daemons - -A cluster administrator may wish to allow pods that express anti-affinity -against all pods, to nonetheless co-exist with system daemon pods, such as those -run by DaemonSet. In principle, we would like the specification for -RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or -more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263) -for a more detailed explanation of the toleration concept). -There are at least two ways to accomplish this: - -* Scheduler special-cases the namespace(s) where daemons live, in the - sense that it ignores pods in those namespaces when it is - determining feasibility for pods with anti-affinity. The name(s) of - the special namespace(s) could be a scheduler configuration - parameter, and default to `kube-system`. We could allow - multiple namespaces to be specified if we want cluster admins to be - able to give their own daemons this special power (they would add - their namespace to the list in the scheduler configuration). And of - course this would be symmetric, so daemons could schedule onto a node - that is already running a pod with anti-affinity. - -* We could add an explicit "toleration" concept/field to allow the - user to specify namespaces that are excluded when they use - RequiredDuringScheduling anti-affinity, and use an admission - controller/defaulter to ensure these namespaces are always listed. - -Our initial implementation will use the first approach. - -### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution) - -Because anti-affinity is symmetric, in the case of -RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must -determine which pod(s) to kill when a pod's labels are updated in such as way as -to cause them to conflict with one or more other pods' -RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the -absence of a priority/preemption scheme, our rule will be that the pod with the -anti-affinity rule that becomes violated should be the one killed. A pod should -only specify constraints that apply to namespaces it trusts to not do malicious -things. Once we have priority/preemption, we can change the rule to say that the -lowest-priority pod(s) are killed until all -RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied. - -## Special considerations for RequiredDuringScheduling affinity - -The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its -symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with -conflicting pods, and pods that conflict with P cannot schedule onto the node -one P has been scheduled there. The design we have described says that the -symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P -says it can only schedule onto nodes running pod Q, this does not mean Q can -only run on a node that is running P, but the scheduler will try to schedule Q -onto a node that is running P (i.e. treats the reverse direction as preferred). -This raises the same scheduling quality concern as we mentioned at the end of -the Denial of Service section above, and can be addressed in similar ways. - -The nature of affinity (as opposed to anti-affinity) means that there is no -issue of determining which pod(s) to kill when a pod's labels change: it is -obviously the pod with the affinity rule that becomes violated that must be -killed. (Killing a pod never "fixes" violation of an affinity rule; it can only -"fix" violation an anti-affinity rule.) However, affinity does have a different -question related to killing: how long should the system wait before declaring -that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met -at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q -is temporarily killed so that it can be updated to a new binary version, should -that trigger killing of P? More generally, how long should the system wait -before declaring that P's affinity is violated? (Of course affinity is expressed -in terms of label selectors, not for a specific pod, but the scenario is easier -to describe using a concrete pod.) This is closely related to the concept of -forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)). -In theory we could make this time duration be configurable by the user on a per-pod -basis, but for the first version of this feature we will make it a configurable -property of whichever component does the killing and that applies across all pods -using the feature. Making it configurable by the user would require a nontrivial -change to the API syntax (since the field would only apply to -RequiredDuringSchedulingRequiredDuringExecution affinity). - -## Implementation plan - -1. Add the `Affinity` field to PodSpec and the `PodAffinity` and -`PodAntiAffinity` types to the API along with all of their descendant types. -2. Implement a scheduler predicate that takes -`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into -account. Include a workaround for the issue described at the end of the Affinity -section of the Examples section (can't schedule first pod). -3. Implement a scheduler priority function that takes -`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity -into account. -4. Implement admission controller that rejects requests that specify "all -namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` -anti-affinity. This admission controller should be enabled by default. -5. Implement the recommended solution to the "co-existing with daemons" issue -6. At this point, the feature can be deployed. -7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity -and anti-affinity, and make sure the pieces of the system already implemented -for `RequiredDuringSchedulingIgnoredDuringExecution` also take -`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the -scheduler predicate, the quota mechanism, the "co-existing with daemons" -solution). -8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" -`TopologyKey` to Kubelet's admission decision. -9. Implement code in Kubelet *or* the controllers that evicts a pod that no -longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet -then only for "node" `TopologyKey`; if controller then potentially for all -`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)). -Do so in a way that addresses the "determining which pod(s) to kill" issue. - -We assume Kubelet publishes labels describing the node's membership in all of -the relevant scheduling domains (e.g. node name, rack name, availability zone -name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044). - -## Backward compatibility - -Old versions of the scheduler will ignore `Affinity`. - -Users should not start using `Affinity` until the full implementation has been -in Kubelet and the master for enough binary versions that we feel comfortable -that we will not need to roll back either Kubelet or master to a version that -does not support them. Longer-term we will use a programmatic approach to -enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Extensibility - -The design described here is the result of careful analysis of use cases, a -decade of experience with Borg at Google, and a review of similar features in -other open-source container orchestration systems. We believe that it properly -balances the goal of expressiveness against the goals of simplicity and -efficiency of implementation. However, we recognize that use cases may arise in -the future that cannot be expressed using the syntax described here. Although we -are not implementing an affinity-specific extensibility mechanism for a variety -of reasons (simplicity of the codebase, simplicity of cluster deployment, desire -for Kubernetes users to get a consistent experience, etc.), the regular -Kubernetes annotation mechanism can be used to add or replace affinity rules. -The way this work would is: -1. Define one or more annotations to describe the new affinity rule(s) -1. User (or an admission controller) attaches the annotation(s) to pods to -request the desired scheduling behavior. If the new rule(s) *replace* one or -more fields of `Affinity` then the user would omit those fields from `Affinity`; -if they are *additional rules*, then the user would fill in `Affinity` as well -as the annotation(s). -1. Scheduler takes the annotation(s) into account when scheduling. - -If some particular new syntax becomes popular, we would consider upstreaming it -by integrating it into the standard `Affinity`. - -## Future work and non-work - -One can imagine that in the anti-affinity RequiredDuringScheduling case one -might want to associate a number with the rule, for example "do not allow this -pod to share a rack with more than three other pods (in total, or from the same -service as the pod)." We could allow this to be specified by adding an integer -`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case. -However, this flexibility complicates the system and we do not intend to -implement it. - -It is likely that the specification and implementation of pod anti-affinity -can be unified with [taints and tolerations](taint-toleration-dedicated.md), -and likewise that the specification and implementation of pod affinity -can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod -labels would be "inherited" by the node, and pods would only be able to specify -affinity and anti-affinity for a node's labels. Our main motivation for not -unifying taints and tolerations with pod anti-affinity is that we foresee taints -and tolerations as being a concept that only cluster administrators need to -understand (and indeed in some setups taints and tolerations wouldn't even be -directly manipulated by a cluster administrator, instead they would only be set -by an admission controller that is implementing the administrator's high-level -policy about different classes of special machines and the users who belong to -the groups allowed to access them). Moreover, the concept of nodes "inheriting" -labels from pods seems complicated; it seems conceptually simpler to separate -rules involving relatively static properties of nodes from rules involving which -other pods are running on the same node or larger topology domain. - -Data/storage affinity is related to pod affinity, and is likely to draw on some -of the ideas we have used for pod affinity. Today, data/storage affinity is -expressed using node affinity, on the assumption that the pod knows which -node(s) store(s) the data it wants. But a more flexible approach would allow the -pod to name the data rather than the node. - -## Related issues - -The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265). - -The topic of affinity/anti-affinity has generated a lot of discussion. The main -issue is [#367](https://github.com/kubernetes/kubernetes/issues/367) -but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485), -[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369), -[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707), -[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341), -[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906) -all have additional discussion and use cases. - -As the examples in this document have demonstrated, topological affinity is very -useful in clusters that are spread across availability zones, e.g. to co-locate -pods of a service in the same zone to avoid a wide-area network hop, or to -spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059), -[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063), -and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant. - -Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related. - -This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816). - -## Related work - -** TODO: cite references ** - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]() - diff --git a/principles.md b/principles.md deleted file mode 100644 index 4e0b663c..00000000 --- a/principles.md +++ /dev/null @@ -1,101 +0,0 @@ -# Design Principles - -Principles to follow when extending Kubernetes. - -## API - -See also the [API conventions](../devel/api-conventions.md). - -* All APIs should be declarative. -* API objects should be complementary and composable, not opaque wrappers. -* The control plane should be transparent -- there are no hidden internal APIs. -* The cost of API operations should be proportional to the number of objects -intentionally operated upon. Therefore, common filtered lookups must be indexed. -Beware of patterns of multiple API calls that would incur quadratic behavior. -* Object status must be 100% reconstructable by observation. Any history kept -must be just an optimization and not required for correct operation. -* Cluster-wide invariants are difficult to enforce correctly. Try not to add -them. If you must have them, don't enforce them atomically in master components, -that is contention-prone and doesn't provide a recovery path in the case of a -bug allowing the invariant to be violated. Instead, provide a series of checks -to reduce the probability of a violation, and make every component involved able -to recover from an invariant violation. -* Low-level APIs should be designed for control by higher-level systems. -Higher-level APIs should be intent-oriented (think SLOs) rather than -implementation-oriented (think control knobs). - -## Control logic - -* Functionality must be *level-based*, meaning the system must operate correctly -given the desired state and the current/observed state, regardless of how many -intermediate state updates may have been missed. Edge-triggered behavior must be -just an optimization. -* Assume an open world: continually verify assumptions and gracefully adapt to -external events and/or actors. Example: we allow users to kill pods under -control of a replication controller; it just replaces them. -* Do not define comprehensive state machines for objects with behaviors -associated with state transitions and/or "assumed" states that cannot be -ascertained by observation. -* Don't assume a component's decisions will not be overridden or rejected, nor -for the component to always understand why. For example, etcd may reject writes. -Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, -but back off and/or make alternative decisions. -* Components should be self-healing. For example, if you must keep some state -(e.g., cache) the content needs to be periodically refreshed, so that if an item -does get erroneously stored or a deletion event is missed etc, it will be soon -fixed, ideally on timescales that are shorter than what will attract attention -from humans. -* Component behavior should degrade gracefully. Prioritize actions so that the -most important activities can continue to function even when overloaded and/or -in states of partial failure. - -## Architecture - -* Only the apiserver should communicate with etcd/store, and not other -components (scheduler, kubelet, etc.). -* Compromising a single node shouldn't compromise the cluster. -* Components should continue to do what they were last told in the absence of -new instructions (e.g., due to network partition or component outage). -* All components should keep all relevant state in memory all the time. The -apiserver should write through to etcd/store, other components should write -through to the apiserver, and they should watch for updates made by other -clients. -* Watch is preferred over polling. - -## Extensibility - -TODO: pluggability - -## Bootstrapping - -* [Self-hosting](http://issue.k8s.io/246) of all components is a goal. -* Minimize the number of dependencies, particularly those required for -steady-state operation. -* Stratify the dependencies that remain via principled layering. -* Break any circular dependencies by converting hard dependencies to soft -dependencies. - * Also accept that data from other components from another source, such as -local files, which can then be manually populated at bootstrap time and then -continuously updated once those other components are available. - * State should be rediscoverable and/or reconstructable. - * Make it easy to run temporary, bootstrap instances of all components in -order to create the runtime state needed to run the components in the steady -state; use a lock (master election for distributed components, file lock for -local components like Kubelet) to coordinate handoff. We call this technique -"pivoting". - * Have a solution to restart dead components. For distributed components, -replication works well. For local components such as Kubelet, a process manager -or even a simple shell loop works. - -## Availability - -TODO - -## General principles - -* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules) - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]() - diff --git a/resource-qos.md b/resource-qos.md deleted file mode 100644 index cfbe4faf..00000000 --- a/resource-qos.md +++ /dev/null @@ -1,218 +0,0 @@ -# Resource Quality of Service in Kubernetes - -**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) -**Last Updated**: 5/17/2016 - -**Status**: Implemented - -*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.* - -## Introduction - -This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*. -Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee. - -Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. -The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. -When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers. -This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees. -Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes. - -## Requests and Limits - -For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](../proposals/node-allocatable.md) & `request <= limit <= Infinity`. -If a pod is successfully scheduled, the container is guaranteed the amount of resources requested. -Scheduling is based on `requests` and not `limits`. -The pods and its containers will not be allowed to exceed the specified limit. -How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md). - -### Compressible Resource Guarantees - -- For now, we are only supporting CPU. -- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal. -- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections). -- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available. - -### Incompressible Resource Guarantees - -- For now, we are only supporting memory. -- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory). -- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel. - -### Admission/Scheduling Policy - -- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU). - -## QoS Classes - -In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority. - -The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section. - -Pods can be of one of 3 different classes: - -- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**. - -Examples: - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - name: bar - resources: - limits: - cpu: 100m - memory: 100Mi -``` - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - requests: - cpu: 10m - memory: 1Gi - - name: bar - resources: - limits: - cpu: 100m - memory: 100Mi - requests: - cpu: 100m - memory: 100Mi -``` - -- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**. -When `limits` are not specified, they default to the node capacity. - -Examples: - -Container `bar` has not resources specified. - -```yaml -containers: - name: foo - resources: - limits: - cpu: 10m - memory: 1Gi - requests: - cpu: 10m - memory: 1Gi - - name: bar -``` - -Container `foo` and `bar` have limits set for different resources. - -```yaml -containers: - name: foo - resources: - limits: - memory: 1Gi - - name: bar - resources: - limits: - cpu: 100m -``` - -Container `foo` has no limits set, and `bar` has neither requests nor limits specified. - -```yaml -containers: - name: foo - resources: - requests: - cpu: 10m - memory: 1Gi - - name: bar -``` - -- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**. - -Examples: - -```yaml -containers: - name: foo - resources: - name: bar - resources: -``` - -Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled. - -Memory is an incompressible resource and so let's discuss the semantics of memory management a bit. - -- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. -These containers can use any amount of free memory in the node though. - -- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted. - -- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available. -Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist. - -### OOM Score configuration at the Nodes - -Pod OOM score configuration -- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed. -- The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ - process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B. -- The final OOM score of a process is also between 0 and 1000 - -*Best-effort* - - Set OOM_SCORE_ADJ: 1000 - - So processes in best-effort containers will have an OOM_SCORE of 1000 - -*Guaranteed* - - Set OOM_SCORE_ADJ: -998 - - So processes in guaranteed containers will have an OOM_SCORE of 0 or 1 - -*Burstable* - - If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2 - - Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested) - - This ensures that the OOM_SCORE of burstable pod is > 1 - - If memory request is `0`, OOM_SCORE_ADJ is set to `999`. - - So burstable pods will be killed if they conflict with guaranteed pods - - If a burstable pod uses less memory than requested, its OOM_SCORE < 1000 - - So best-effort pods will be killed if they conflict with burstable pods using less than requested memory - - If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000 - - Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed - - If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. - -*Pod infra containers* or *Special Pod init process* - - OOM_SCORE_ADJ: -998 - -*Kubelet, Docker* - - OOM_SCORE_ADJ: -999 (won’t be OOM killed) - - Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume. - -## Known issues and possible improvements - -The above implementation provides for basic oversubscription with protection, but there are a few known limitations. - -#### Support for Swap - -- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior. - -## Alternative QoS Class Policy - -An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). -A strict hierarchy of user-specified numerical priorities is not desirable because: - -1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively -2. Changes to desired priority bands would require changes to all user pod configurations. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]() - diff --git a/resources.md b/resources.md deleted file mode 100644 index bb66885b..00000000 --- a/resources.md +++ /dev/null @@ -1,370 +0,0 @@ -**Note: this is a design doc, which describes features that have not been -completely implemented. User documentation of the current state is -[here](../user-guide/compute-resources.md). The tracking issue for -implementation of this model is [#168](http://issue.k8s.io/168). Currently, both -limits and requests of memory and cpu on containers (not pods) are supported. -"memory" is in bytes and "cpu" is in milli-cores.** - -# The Kubernetes resource model - -To do good pod placement, Kubernetes needs to know how big pods are, as well as -the sizes of the nodes onto which they are being placed. The definition of "how -big" is given by the Kubernetes resource model — the subject of this -document. - -The resource model aims to be: -* simple, for common cases; -* extensible, to accommodate future growth; -* regular, with few special cases; and -* precise, to avoid misunderstandings and promote pod portability. - -## The resource model - -A Kubernetes _resource_ is something that can be requested by, allocated to, or -consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, -and network bandwidth. - -Once resources on a node have been allocated to one pod, they should not be -allocated to another until that pod is removed or exits. This means that -Kubernetes schedulers should ensure that the sum of the resources allocated -(requested and granted) to its pods never exceeds the usable capacity of the -node. Testing whether a pod will fit on a node is called _feasibility checking_. - -Note that the resource model currently prohibits over-committing resources; we -will want to relax that restriction later. - -### Resource types - -All resources have a _type_ that is identified by their _typename_ (a string, -e.g., "memory"). Several resource types are predefined by Kubernetes (a full -list is below), although only two will be supported at first: CPU and memory. -Users and system administrators can define their own resource types if they wish -(e.g., Hadoop slots). - -A fully-qualified resource typename is constructed from a DNS-style _subdomain_, -followed by a slash `/`, followed by a name. -* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) -(e.g., `kubernetes.io`, `example.com`). -* The name must be not more than 63 characters, consisting of upper- or -lower-case alphanumeric characters, with the `-`, `_`, and `.` characters -allowed anywhere except the first or last character. -* As a shorthand, any resource typename that does not start with a subdomain and -a slash will automatically be prefixed with the built-in Kubernetes _namespace_, -`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for -code in the open source Kubernetes repository; as a result, all user typenames -MUST be fully qualified, and cannot be created in this namespace. - -Some example typenames include `memory` (which will be fully-qualified as -`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`. - -For future reference, note that some resources, such as CPU and network -bandwidth, are _compressible_, which means that their usage can potentially be -throttled in a relatively benign manner. All other resources are -_incompressible_, which means that any attempt to throttle them is likely to -cause grief. This distinction will be important if a Kubernetes implementation -supports over-committing of resources. - -### Resource quantities - -Initially, all Kubernetes resource types are _quantitative_, and have an -associated _unit_ for quantities of the associated resource (e.g., bytes for -memory, bytes per seconds for bandwidth, instances for software licences). The -units will always be a resource type's natural base units (e.g., bytes, not MB), -to avoid confusion between binary and decimal multipliers and the underlying -unit multiplier (e.g., is memory measured in MiB, MB, or GB?). - -Resource quantities can be added and subtracted: for example, a node has a fixed -quantity of each resource type that can be allocated to pods/containers; once -such an allocation has been made, the allocated resources cannot be made -available to other pods/containers without over-committing the resources. - -To make life easier for people, quantities can be represented externally as -unadorned integers, or as fixed-point integers with one of these SI suffices -(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, - Ki). For example, the following represent roughly the same value: 128974848, -"129e6", "129M" , "123Mi". Small quantities can be represented directly as -decimals (e.g., 0.3), or using milli-units (e.g., "300m"). - * "Externally" means in user interfaces, reports, graphs, and in JSON or YAML -resource specifications that might be generated or read by people. - * Case is significant: "m" and "M" are not the same, so "k" is not a valid SI -suffix. There are no power-of-two equivalents for SI suffixes that represent -multipliers less than 1. - * These conventions only apply to resource quantities, not arbitrary values. - -Internally (i.e., everywhere else), Kubernetes will represent resource -quantities as integers so it can avoid problems with rounding errors, and will -not use strings to represent numeric values. To achieve this, quantities that -naturally have fractional parts (e.g., CPU seconds/second) will be scaled to -integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. -Internal APIs, data structures, and protobufs will use these scaled integer -units. Raw measurement data such as usage may still need to be tracked and -calculated using floating point values, but internally they should be rescaled -to avoid some values being in milli-units and some not. - * Note that reading in a resource quantity and writing it out again may change -the way its values are represented, and truncate precision (e.g., 1.0001 may -become 1.000), so comparison and difference operations (e.g., by an updater) -must be done on the internal representations. - * Avoiding milli-units in external representations has advantages for people -who will use Kubernetes, but runs the risk of developers forgetting to rescale -or accidentally using floating-point representations. That seems like the right -choice. We will try to reduce the risk by providing libraries that automatically -do the quantization for JSON/YAML inputs. - -### Resource specifications - -Both users and a number of system components, such as schedulers, (horizontal) -auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers -need to reason about resource requirements of workloads, resource capacities of -nodes, and resource usage. Kubernetes divides specifications of *desired state*, -aka the Spec, and representations of *current state*, aka the Status. Resource -requirements and total node capacity fall into the specification category, while -resource usage, characterizations derived from usage (e.g., maximum usage, -histograms), and other resource demand signals (e.g., CPU load) clearly fall -into the status category and are discussed in the Appendix for now. - -Resource requirements for a container or pod should have the following form: - -```yaml -resourceRequirementSpec: [ - request: [ cpu: 2.5, memory: "40Mi" ], - limit: [ cpu: 4.0, memory: "99Mi" ], -] -``` - -Where: -* _request_ [optional]: the amount of resources being requested, or that were -requested and have been allocated. Scheduler algorithms will use these -quantities to test feasibility (whether a pod will fit onto a node). -If a container (or pod) tries to use more resources than its _request_, any -associated SLOs are voided — e.g., the program it is running may be -throttled (compressible resource types), or the attempt may be denied. If -_request_ is omitted for a container, it defaults to _limit_ if that is -explicitly specified, otherwise to an implementation-defined value; this will -always be 0 for a user-defined resource type. If _request_ is omitted for a pod, -it defaults to the sum of the (explicit or implicit) _request_ values for the -containers it encloses. - -* _limit_ [optional]: an upper bound or cap on the maximum amount of resources -that will be made available to a container or pod; if a container or pod uses -more resources than its _limit_, it may be terminated. The _limit_ defaults to -"unbounded"; in practice, this probably means the capacity of an enclosing -container, pod, or node, but may result in non-deterministic behavior, -especially for memory. - -Total capacity for a node should have a similar structure: - -```yaml -resourceCapacitySpec: [ - total: [ cpu: 12, memory: "128Gi" ] -] -``` - -Where: -* _total_: the total allocatable resources of a node. Initially, the resources -at a given scope will bound the resources of the sum of inner scopes. - -#### Notes - - * It is an error to specify the same resource type more than once in each -list. - - * It is an error for the _request_ or _limit_ values for a pod to be less than -the sum of the (explicit or defaulted) values for the containers it encloses. -(We may relax this later.) - - * If multiple pods are running on the same node and attempting to use more -resources than they have requested, the result is implementation-defined. For -example: unallocated or unused resources might be spread equally across -claimants, or the assignment might be weighted by the size of the original -request, or as a function of limits, or priority, or the phase of the moon, -perhaps modulated by the direction of the tide. Thus, although it's not -mandatory to provide a _request_, it's probably a good idea. (Note that the -_request_ could be filled in by an automated system that is observing actual -usage and/or historical data.) - - * Internally, the Kubernetes master can decide the defaulting behavior and the -kubelet implementation may expected an absolute specification. For example, if -the master decided that "the default is unbounded" it would pass 2^64 to the -kubelet. - - -## Kubernetes-defined resource types - -The following resource types are predefined ("reserved") by Kubernetes in the -`kubernetes.io` namespace, and so cannot be used for user-defined resources. -Note that the syntax of all resource types in the resource spec is deliberately -similar, but some resource types (e.g., CPU) may receive significantly more -support than simply tracking quantities in the schedulers and/or the Kubelet. - -### Processor cycles - - * Name: `cpu` (or `kubernetes.io/cpu`) - * Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to -a canonical "Kubernetes CPU") - * Internal representation: milli-KCUs - * Compressible? yes - * Qualities: this is a placeholder for the kind of thing that may be supported -in the future — see [#147](http://issue.k8s.io/147) - * [future] `schedulingLatency`: as per lmctfy - * [future] `cpuConversionFactor`: property of a node: the speed of a CPU -core on the node's processor divided by the speed of the canonical Kubernetes -CPU (a floating point value; default = 1.0). - -To reduce performance portability problems for pods, and to avoid worse-case -provisioning behavior, the units of CPU will be normalized to a canonical -"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be -equivalent to a single CPU hyperthreaded core for some recent x86 processor. The -normalization may be implementation-defined, although some reasonable defaults -will be provided in the open-source Kubernetes code. - -Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will -be allocated — control of aspects like this will be handled by resource -_qualities_ (a future feature). - - -### Memory - - * Name: `memory` (or `kubernetes.io/memory`) - * Units: bytes - * Compressible? no (at least initially) - -The precise meaning of what "memory" means is implementation dependent, but the -basic idea is to rely on the underlying `memcg` mechanisms, support, and -definitions. - -Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory -quantities rather than decimal ones: "64MiB" rather than "64MB". - - -## Resource metadata - -A resource type may have an associated read-only ResourceType structure, that -contains metadata about the type. For example: - -```yaml -resourceTypes: [ - "kubernetes.io/memory": [ - isCompressible: false, ... - ] - "kubernetes.io/cpu": [ - isCompressible: true, - internalScaleExponent: 3, ... - ] - "kubernetes.io/disk-space": [ ... ] -] -``` - -Kubernetes will provide ResourceType metadata for its predefined types. If no -resource metadata can be found for a resource type, Kubernetes will assume that -it is a quantified, incompressible resource that is not specified in -milli-units, and has no default value. - -The defined properties are as follows: - -| field name | type | contents | -| ---------- | ---- | -------- | -| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) | -| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) | -| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". | -| isCompressible | bool, default=false | true if the resource type is compressible | -| defaultRequest | string, default=none | in the same format as a user-supplied value | -| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). | - - -# Appendix: future extensions - -The following are planned future extensions to the resource model, included here -to encourage comments. - -## Usage data - -Because resource usage and related metrics change continuously, need to be -tracked over time (i.e., historically), can be characterized in a variety of -ways, and are fairly voluminous, we will not include usage in core API objects, -such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs -for accessing and managing that data. See the Appendix for possible -representations of usage data, but the representation we'll use is TBD. - -Singleton values for observed and predicted future usage will rapidly prove -inadequate, so we will support the following structure for extended usage -information: - -```yaml -resourceStatus: [ - usage: [ cpu: , memory: ], - maxusage: [ cpu: , memory: ], - predicted: [ cpu: , memory: ], -] -``` - -where a `` or `` structure looks like this: - -```yaml -{ - mean: # arithmetic mean - max: # minimum value - min: # maximum value - count: # number of data points - percentiles: [ # map from %iles to values - "10": <10th-percentile-value>, - "50": , - "99": <99th-percentile-value>, - "99.9": <99.9th-percentile-value>, - ... - ] -} -``` - -All parts of this structure are optional, although we strongly encourage -including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. -_[In practice, it will be important to include additional info such as the -length of the time window over which the averages are calculated, the -confidence level, and information-quality metrics such as the number of dropped -or discarded data points.]_ and predicted - -## Future resource types - -### _[future] Network bandwidth_ - - * Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`) - * Units: bytes per second - * Compressible? yes - -### _[future] Network operations_ - - * Name: "network-iops" (or `kubernetes.io/network-iops`) - * Units: operations (messages) per second - * Compressible? yes - -### _[future] Storage space_ - - * Name: "storage-space" (or `kubernetes.io/storage-space`) - * Units: bytes - * Compressible? no - -The amount of secondary storage space available to a container. The main target -is local disk drives and SSDs, although this could also be used to qualify -remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a -disk array, or a file system fronting any of these, is left for future work. - -### _[future] Storage time_ - - * Name: storage-time (or `kubernetes.io/storage-time`) - * Units: seconds per second of disk time - * Internal representation: milli-units - * Compressible? yes - -This is the amount of time a container spends accessing disk, including actuator -and transfer time. A standard disk drive provides 1.0 diskTime seconds per -second. - -### _[future] Storage operations_ - - * Name: "storage-iops" (or `kubernetes.io/storage-iops`) - * Units: operations per second - * Compressible? yes - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]() - diff --git a/scheduler_extender.md b/scheduler_extender.md deleted file mode 100644 index 1f362242..00000000 --- a/scheduler_extender.md +++ /dev/null @@ -1,105 +0,0 @@ -# Scheduler extender - -There are three ways to add new scheduling rules (predicates and priority -functions) to Kubernetes: (1) by adding these rules to the scheduler and -recompiling (described here: -https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md), -(2) implementing your own scheduler process that runs instead of, or alongside -of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender" -process that the standard Kubernetes scheduler calls out to as a final pass when -making scheduling decisions. - -This document describes the third approach. This approach is needed for use -cases where scheduling decisions need to be made on resources not directly -managed by the standard Kubernetes scheduler. The extender helps make scheduling -decisions based on such resources. (Note that the three approaches are not -mutually exclusive.) - -When scheduling a pod, the extender allows an external process to filter and -prioritize nodes. Two separate http/https calls are issued to the extender, one -for "filter" and one for "prioritize" actions. To use the extender, you must -create a scheduler policy configuration file. The configuration specifies how to -reach the extender, whether to use http or https and the timeout. - -```go -// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty, -// it is assumed that the extender chose not to provide that extension. -type ExtenderConfig struct { - // URLPrefix at which the extender is available - URLPrefix string `json:"urlPrefix"` - // Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. - FilterVerb string `json:"filterVerb,omitempty"` - // Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. - PrioritizeVerb string `json:"prioritizeVerb,omitempty"` - // The numeric multiplier for the node scores that the prioritize call generates. - // The weight should be a positive integer - Weight int `json:"weight,omitempty"` - // EnableHttps specifies whether https should be used to communicate with the extender - EnableHttps bool `json:"enableHttps,omitempty"` - // TLSConfig specifies the transport layer security config - TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"` - // HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize - // timeout is ignored, k8s/other extenders priorities are used to select the node. - HTTPTimeout time.Duration `json:"httpTimeout,omitempty"` -} -``` - -A sample scheduler policy file with extender configuration: - -```json -{ - "predicates": [ - { - "name": "HostName" - }, - { - "name": "MatchNodeSelector" - }, - { - "name": "PodFitsResources" - } - ], - "priorities": [ - { - "name": "LeastRequestedPriority", - "weight": 1 - } - ], - "extenders": [ - { - "urlPrefix": "http://127.0.0.1:12345/api/scheduler", - "filterVerb": "filter", - "enableHttps": false - } - ] -} -``` - -Arguments passed to the FilterVerb endpoint on the extender are the set of nodes -filtered through the k8s predicates and the pod. Arguments passed to the -PrioritizeVerb endpoint on the extender are the set of nodes filtered through -the k8s predicates and extender predicates and the pod. - -```go -// ExtenderArgs represents the arguments needed by the extender to filter/prioritize -// nodes for a pod. -type ExtenderArgs struct { - // Pod being scheduled - Pod api.Pod `json:"pod"` - // List of candidate nodes where the pod can be scheduled - Nodes api.NodeList `json:"nodes"` -} -``` - -The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call -returns priorities for each node (schedulerapi.HostPriorityList). - -The "filter" call may prune the set of nodes based on its predicates. Scores -returned by the "prioritize" call are added to the k8s scores (computed through -its priority functions) and used for final host selection. - -Multiple extenders can be configured in the scheduler policy. - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]() - diff --git a/seccomp.md b/seccomp.md deleted file mode 100644 index de00cbc0..00000000 --- a/seccomp.md +++ /dev/null @@ -1,266 +0,0 @@ -## Abstract - -A proposal for adding **alpha** support for -[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a -system call filtering facility in the Linux kernel which lets applications -define limits on system calls they may make, and what should happen when -system calls are made. Seccomp is used to reduce the attack surface available -to applications. - -## Motivation - -Applications use seccomp to restrict the set of system calls they can make. -Recently, container runtimes have begun adding features to allow the runtime -to interact with seccomp on behalf of the application, which eliminates the -need for applications to link against libseccomp directly. Adding support in -the Kubernetes API for describing seccomp profiles will allow administrators -greater control over the security of workloads running in Kubernetes. - -Goals of this design: - -1. Describe how to reference seccomp profiles in containers that use them - -## Constraints and Assumptions - -This design should: - -* build upon previous security context work -* be container-runtime agnostic -* allow use of custom profiles -* facilitate containerized applications that link directly to libseccomp - -## Use Cases - -1. As an administrator, I want to be able to grant access to a seccomp profile - to a class of users -2. As a user, I want to run an application with a seccomp profile similar to - the default one provided by my container runtime -3. As a user, I want to run an application which is already libseccomp-aware - in a container, and for my application to manage interacting with seccomp - unmediated by Kubernetes -4. As a user, I want to be able to use a custom seccomp profile and use - it with my containers - -### Use Case: Administrator access control - -Controlling access to seccomp profiles is a cluster administrator -concern. It should be possible for an administrator to control which users -have access to which profiles. - -The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893) -API extension governs the ability of users to make requests that affect pod -and container security contexts. The proposed design should deal with -required changes to control access to new functionality. - -### Use Case: Seccomp profiles similar to container runtime defaults - -Many users will want to use images that make assumptions about running in the -context of their chosen container runtime. Such images are likely to -frequently assume that they are running in the context of the container -runtime's default seccomp settings. Therefore, it should be possible to -express a seccomp profile similar to a container runtime's defaults. - -As an example, all dockerhub 'official' images are compatible with the Docker -default seccomp profile. So, any user who wanted to run one of these images -with seccomp would want the default profile to be accessible. - -### Use Case: Applications that link to libseccomp - -Some applications already link to libseccomp and control seccomp directly. It -should be possible to run these applications unmodified in Kubernetes; this -implies there should be a way to disable seccomp control in Kubernetes for -certain containers, or to run with a "no-op" or "unconfined" profile. - -Sometimes, applications that link to seccomp can use the default profile for a -container runtime, and restrict further on top of that. It is important to -note here that in this case, applications can only place _further_ -restrictions on themselves. It is not possible to re-grant the ability of a -process to make a system call once it has been removed with seccomp. - -As an example, elasticsearch manages its own seccomp filters in its code. -Currently, elasticsearch is capable of running in the context of the default -Docker profile, but if in the future, elasticsearch needed to be able to call -`ioperm` or `iopr` (both of which are disallowed in the default profile), it -should be possible to run elasticsearch by delegating the seccomp controls to -the pod. - -### Use Case: Custom profiles - -Different applications have different requirements for seccomp profiles; it -should be possible to specify an arbitrary seccomp profile and use it in a -container. This is more of a concern for applications which need a higher -level of privilege than what is granted by the default profile for a cluster, -since applications that want to restrict privileges further can always make -additional calls in their own code. - -An example of an application that requires the use of a syscall disallowed in -the Docker default profile is Chrome, which needs `clone` to create a new user -namespace. Another example would be a program which uses `ptrace` to -implement a sandbox for user-provided code, such as -[eval.in](https://eval.in/). - -## Community Work - -### Container runtime support for seccomp - -#### Docker / opencontainers - -Docker supports the open container initiative's API for -seccomp, which is very close to the libseccomp API. It allows full -specification of seccomp filters, with arguments, operators, and actions. - -Docker allows the specification of a single seccomp filter. There are -community requests for: - -Issues: - -* [docker/22109](https://github.com/docker/docker/issues/22109): composable - seccomp filters -* [docker/21105](https://github.com/docker/docker/issues/22105): custom - seccomp filters for builds - -#### rkt / appcontainers - -The `rkt` runtime delegates to systemd for seccomp support; there is an open -issue to add support once `appc` supports it. The `appc` project has an open -issue to be able to describe seccomp as an isolator in an appc pod. - -The systemd seccomp facility is based on a whitelist of system calls that can -be made, rather than a full filter specification. - -Issues: - -* [appc/529](https://github.com/appc/spec/issues/529) -* [rkt/1614](https://github.com/coreos/rkt/issues/1614) - -#### HyperContainer - -[HyperContainer](https://hypercontainer.io) does not support seccomp. - -### Other platforms and seccomp-like capabilities - -FreeBSD has a seccomp/capability-like facility called -[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4). - -#### lxd - -[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile. - -Issues: - -* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp - -## Proposed Design - -### Seccomp API Resource? - -An earlier draft of this proposal described a new global API resource that -could be used to describe seccomp profiles. After some discussion, it was -determined that without a feedback signal from users indicating a need to -describe new profiles in the Kubernetes API, it is not possible to know -whether a new API resource is warranted. - -That being the case, we will not propose a new API resource at this time. If -there is strong community desire for such a resource, we may consider it in -the future. - -Instead of implementing a new API resource, we propose that pods be able to -reference seccomp profiles by name. Since this is an alpha feature, we will -use annotations instead of extending the API with new fields. - -### API changes? - -In the alpha version of this feature we will use annotations to store the -names of seccomp profiles. The keys will be: - -`container.seccomp.security.alpha.kubernetes.io/` - -which will be used to set the seccomp profile of a container, and: - -`seccomp.security.alpha.kubernetes.io/pod` - -which will set the seccomp profile for the containers of an entire pod. If a -pod-level annotation is present, and a container-level annotation present for -a container, then the container-level profile takes precedence. - -The value of these keys should be container-runtime agnostic. We will -establish a format that expresses the conventions for distinguishing between -an unconfined profile, the container runtime's default, or a custom profile. -Since format of profile is likely to be runtime dependent, we will consider -profiles to be opaque to kubernetes for now. - -The following format is scoped as follows: - -1. `runtime/default` - the default profile for the container runtime -2. `unconfined` - unconfined profile, ie, no seccomp sandboxing -3. `localhost/` - the profile installed to the node's local seccomp profile root - -Since seccomp profile schemes may vary between container runtimes, we will -treat the contents of profiles as opaque for now and avoid attempting to find -a common way to describe them. It is up to the container runtime to be -sensitive to the annotations proposed here and to interpret instructions about -local profiles. - -A new area on disk (which we will call the seccomp profile root) must be -established to hold seccomp profiles. A field will be added to the Kubelet -for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to -allow admins to set it. If unset, it should default to the `seccomp` -subdirectory of the kubelet root directory. - -### Pod Security Policy annotation - -The `PodSecurityPolicy` type should be annotated with the allowed seccomp -profiles using the key -`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this -key should be a comma delimited list. - -## Examples - -### Unconfined profile - -Here's an example of a pod that uses the unconfined profile: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: trustworthy-pod - annotations: - seccomp.security.alpha.kubernetes.io/pod: unconfined -spec: - containers: - - name: trustworthy-container - image: sotrustworthy:latest -``` - -### Custom profile - -Here's an example of a pod that uses a profile called `example-explorer- -profile` using the container-level annotation: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: explorer - annotations: - container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile -spec: - containers: - - name: explorer - image: gcr.io/google_containers/explorer:1.0 - args: ["-port=8080"] - ports: - - containerPort: 8080 - protocol: TCP - volumeMounts: - - mountPath: "/mount/test-volume" - name: test-volume - volumes: - - name: test-volume - emptyDir: {} -``` - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]() - diff --git a/secrets.md b/secrets.md deleted file mode 100644 index 29d18411..00000000 --- a/secrets.md +++ /dev/null @@ -1,628 +0,0 @@ -## Abstract - -A proposal for the distribution of [secrets](../user-guide/secrets.md) -(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using -a custom [volume](../user-guide/volumes.md#secrets) type. See the -[secrets example](../user-guide/secrets/) for more information. - -## Motivation - -Secrets are needed in containers to access internal resources like the -Kubernetes master or external resources such as git repositories, databases, -etc. Users may also want behaviors in the kubelet that depend on secret data -(credentials for image pull from a docker registry) associated with pods. - -Goals of this design: - -1. Describe a secret resource -2. Define the various challenges attendant to managing secrets on the node -3. Define a mechanism for consuming secrets in containers without modification - -## Constraints and Assumptions - -* This design does not prescribe a method for storing secrets; storage of -secrets should be pluggable to accommodate different use-cases -* Encryption of secret data and node security are orthogonal concerns -* It is assumed that node and master are secure and that compromising their -security could also compromise secrets: - * If a node is compromised, the only secrets that could potentially be -exposed should be the secrets belonging to containers scheduled onto it - * If the master is compromised, all secrets in the cluster may be exposed -* Secret rotation is an orthogonal concern, but it should be facilitated by -this proposal -* A user who can consume a secret in a container can know the value of the -secret; secrets must be provisioned judiciously - -## Use Cases - -1. As a user, I want to store secret artifacts for my applications and consume -them securely in containers, so that I can keep the configuration for my -applications separate from the images that use them: - 1. As a cluster operator, I want to allow a pod to access the Kubernetes -master using a custom `.kubeconfig` file, so that I can securely reach the -master - 2. As a cluster operator, I want to allow a pod to access a Docker registry -using credentials from a `.dockercfg` file, so that containers can push images - 3. As a cluster operator, I want to allow a pod to access a git repository -using SSH keys, so that I can push to and fetch from the repository -2. As a user, I want to allow containers to consume supplemental information -about services such as username and password which should be kept secret, so -that I can share secrets about a service amongst the containers in my -application securely -3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a -secret and have the kubelet implement some reserved behaviors based on the types -of secrets the service account consumes: - 1. Use credentials for a docker registry to pull the pod's docker image - 2. Present Kubernetes auth token to the pod or transparently decorate -traffic between the pod and master service -4. As a user, I want to be able to indicate that a secret expires and for that -secret's value to be rotated once it expires, so that the system can help me -follow good practices - -### Use-Case: Configuration artifacts - -Many configuration files contain secrets intermixed with other configuration -information. For example, a user's application may contain a properties file -than contains database credentials, SaaS API tokens, etc. Users should be able -to consume configuration artifacts in their containers and be able to control -the path on the container's filesystems where the artifact will be presented. - -### Use-Case: Metadata about services - -Most pieces of information about how to use a service are secrets. For example, -a service that provides a MySQL database needs to provide the username, -password, and database name to consumers so that they can authenticate and use -the correct database. Containers in pods consuming the MySQL service would also -consume the secrets associated with the MySQL service. - -### Use-Case: Secrets associated with service accounts - -[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple -capabilities and security contexts from individual human users. A -`ServiceAccount` contains references to some number of secrets. A `Pod` can -specify that it is associated with a `ServiceAccount`. Secrets should have a -`Type` field to allow the Kubelet and other system components to take action -based on the secret's type. - -#### Example: service account consumes auth token secret - -As an example, the service account proposal discusses service accounts consuming -secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod -associated with a service account which consumes this type of secret, the -Kubelet may take a number of actions: - -1. Expose the secret in a `.kubernetes_auth` file in a well-known location in -the container's file system -2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod -to the `kubernetes-master` service with the auth token, e. g. by adding a header -to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal) - -#### Example: service account consumes docker registry credentials - -Another example use case is where a pod is associated with a secret containing -docker registry credentials. The Kubelet could use these credentials for the -docker pull to retrieve the image. - -### Use-Case: Secret expiry and rotation - -Rotation is considered a good practice for many types of secret data. It should -be possible to express that a secret has an expiry date; this would make it -possible to implement a system component that could regenerate expired secrets. -As an example, consider a component that rotates expired secrets. The rotator -could periodically regenerate the values for expired secrets of common types and -update their expiry dates. - -## Deferral: Consuming secrets as environment variables - -Some images will expect to receive configuration items as environment variables -instead of files. We should consider what the best way to allow this is; there -are a few different options: - -1. Force the user to adapt files into environment variables. Users can store -secrets that need to be presented as environment variables in a format that is -easy to consume from a shell: - - $ cat /etc/secrets/my-secret.txt - export MY_SECRET_ENV=MY_SECRET_VALUE - - The user could `source` the file at `/etc/secrets/my-secret` prior to -executing the command for the image either inline in the command or in an init -script. - -2. Give secrets an attribute that allows users to express the intent that the -platform should generate the above syntax in the file used to present a secret. -The user could consume these files in the same manner as the above option. - -3. Give secrets attributes that allow the user to express that the secret -should be presented to the container as an environment variable. The container's -environment would contain the desired values and the software in the container -could use them without accommodation the command or setup script. - -For our initial work, we will treat all secrets as files to narrow the problem -space. There will be a future proposal that handles exposing secrets as -environment variables. - -## Flow analysis of secret data with respect to the API server - -There are two fundamentally different use-cases for access to secrets: - -1. CRUD operations on secrets by their owners -2. Read-only access to the secrets needed for a particular node by the kubelet - -### Use-Case: CRUD operations by owners - -In use cases for CRUD operations, the user experience for secrets should be no -different than for other API resources. - -#### Data store backing the REST API - -The data store backing the REST API should be pluggable because different -cluster operators will have different preferences for the central store of -secret data. Some possibilities for storage: - -1. An etcd collection alongside the storage for other API resources -2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module) -3. A secrets server like [Vault](https://www.vaultproject.io/) or -[Keywhiz](https://square.github.io/keywhiz/) -4. An external datastore such as an external etcd, RDBMS, etc. - -#### Size limit for secrets - -There should be a size limit for secrets in order to: - -1. Prevent DOS attacks against the API server -2. Allow kubelet implementations that prevent secret data from touching the -node's filesystem - -The size limit should satisfy the following conditions: - -1. Large enough to store common artifact types (encryption keypairs, -certificates, small configuration files) -2. Small enough to avoid large impact on node resource consumption (storage, -RAM for tmpfs, etc) - -To begin discussion, we propose an initial value for this size limit of **1MB**. - -#### Other limitations on secrets - -Defining a policy for limitations on how a secret may be referenced by another -API resource and how constraints should be applied throughout the cluster is -tricky due to the number of variables involved: - -1. Should there be a maximum number of secrets a pod can reference via a -volume? -2. Should there be a maximum number of secrets a service account can reference? -3. Should there be a total maximum number of secrets a pod can reference via -its own spec and its associated service account? -4. Should there be a total size limit on the amount of secret data consumed by -a pod? -5. How will cluster operators want to be able to configure these limits? -6. How will these limits impact API server validations? -7. How will these limits affect scheduling? - -For now, we will not implement validations around these limits. Cluster -operators will decide how much node storage is allocated to secrets. It will be -the operator's responsibility to ensure that the allocated storage is sufficient -for the workload scheduled onto a node. - -For now, kubelets will only attach secrets to api-sourced pods, and not file- -or http-sourced ones. Doing so would: - - confuse the secrets admission controller in the case of mirror pods. - - create an apiserver-liveness dependency -- avoiding this dependency is a -main reason to use non-api-source pods. - -### Use-Case: Kubelet read of secrets for node - -The use-case where the kubelet reads secrets has several additional requirements: - -1. Kubelets should only be able to receive secret data which is required by -pods scheduled onto the kubelet's node -2. Kubelets should have read-only access to secret data -3. Secret data should not be transmitted over the wire insecurely -4. Kubelets must ensure pods do not have access to each other's secrets - -#### Read of secret data by the Kubelet - -The Kubelet should only be allowed to read secrets which are consumed by pods -scheduled onto that Kubelet's node and their associated service accounts. -Authorization of the Kubelet to read this data would be delegated to an -authorization plugin and associated policy rule. - -#### Secret data on the node: data at rest - -Consideration must be given to whether secret data should be allowed to be at -rest on the node: - -1. If secret data is not allowed to be at rest, the size of secret data becomes -another draw on the node's RAM - should it affect scheduling? -2. If secret data is allowed to be at rest, should it be encrypted? - 1. If so, how should be this be done? - 2. If not, what threats exist? What types of secret are appropriate to -store this way? - -For the sake of limiting complexity, we propose that initially secret data -should not be allowed to be at rest on a node; secret data should be stored on a -node-level tmpfs filesystem. This filesystem can be subdivided into directories -for use by the kubelet and by the volume plugin. - -#### Secret data on the node: resource consumption - -The Kubelet will be responsible for creating the per-node tmpfs file system for -secret storage. It is hard to make a prescriptive declaration about how much -storage is appropriate to reserve for secrets because different installations -will vary widely in available resources, desired pod to node density, overcommit -policy, and other operation dimensions. That being the case, we propose for -simplicity that the amount of secret storage be controlled by a new parameter to -the kubelet with a default value of **64MB**. It is the cluster operator's -responsibility to handle choosing the right storage size for their installation -and configuring their Kubelets correctly. - -Configuring each Kubelet is not the ideal story for operator experience; it is -more intuitive that the cluster-wide storage size be readable from a central -configuration store like the one proposed in [#1553](http://issue.k8s.io/1553). -When such a store exists, the Kubelet could be modified to read this -configuration item from the store. - -When the Kubelet is modified to advertise node resources (as proposed in -[#4441](http://issue.k8s.io/4441)), the capacity calculation -for available memory should factor in the potential size of the node-level tmpfs -in order to avoid memory overcommit on the node. - -#### Secret data on the node: isolation - -Every pod will have a [security context](security_context.md). -Secret data on the node should be isolated according to the security context of -the container. The Kubelet volume plugin API will be changed so that a volume -plugin receives the security context of a volume along with the volume spec. -This will allow volume plugins to implement setting the security context of -volumes they manage. - -## Community work - -Several proposals / upstream patches are notable as background for this -proposal: - -1. [Docker vault proposal](https://github.com/docker/docker/issues/10310) -2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277) -3. [Kubernetes service account proposal](service_accounts.md) -4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075) -5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697) - -## Proposed Design - -We propose a new `Secret` resource which is mounted into containers with a new -volume type. Secret volumes will be handled by a volume plugin that does the -actual work of fetching the secret and storing it. Secrets contain multiple -pieces of data that are presented as different files within the secret volume -(example: SSH key pair). - -In order to remove the burden from the end user in specifying every file that a -secret consists of, it should be possible to mount all files provided by a -secret with a single `VolumeMount` entry in the container specification. - -### Secret API Resource - -A new resource for secrets will be added to the API: - -```go -type Secret struct { - TypeMeta - ObjectMeta - - // Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN. - // The serialized form of the secret data is a base64 encoded string, - // representing the arbitrary (possibly non-string) data value here. - Data map[string][]byte `json:"data,omitempty"` - - // Used to facilitate programmatic handling of secret data. - Type SecretType `json:"type,omitempty"` -} - -type SecretType string - -const ( - SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default) - SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token - SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth - SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth - // FUTURE: other type values -) - -const MaxSecretSize = 1 * 1024 * 1024 -``` - -A Secret can declare a type in order to provide type information to system -components that work with secrets. The default type is `opaque`, which -represents arbitrary user-owned data. - -Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must -be valid DNS subdomains. - -A new REST API and registry interface will be added to accompany the `Secret` -resource. The default implementation of the registry will store `Secret` -information in etcd. Future registry implementations could store the `TypeMeta` -and `ObjectMeta` fields in etcd and store the secret data in another data store -entirely, or store the whole object in another data store. - -#### Other validations related to secrets - -Initially there will be no validations for the number of secrets a pod -references, or the number of secrets that can be associated with a service -account. These may be added in the future as the finer points of secrets and -resource allocation are fleshed out. - -### Secret Volume Source - -A new `SecretSource` type of volume source will be added to the `VolumeSource` -struct in the API: - -```go -type VolumeSource struct { - // Other fields omitted - - // SecretSource represents a secret that should be presented in a volume - SecretSource *SecretSource `json:"secret"` -} - -type SecretSource struct { - Target ObjectReference -} -``` - -Secret volume sources are validated to ensure that the specified object -reference actually points to an object of type `Secret`. - -In the future, the `SecretSource` will be extended to allow: - -1. Fine-grained control over which pieces of secret data are exposed in the -volume -2. The paths and filenames for how secret data are exposed - -### Secret Volume Plugin - -A new Kubelet volume plugin will be added to handle volumes with a secret -source. This plugin will require access to the API server to retrieve secret -data and therefore the volume `Host` interface will have to change to expose a -client interface: - -```go -type Host interface { - // Other methods omitted - - // GetKubeClient returns a client interface - GetKubeClient() client.Interface -} -``` - -The secret volume plugin will be responsible for: - -1. Returning a `volume.Mounter` implementation from `NewMounter` that: - 1. Retrieves the secret data for the volume from the API server - 2. Places the secret data onto the container's filesystem - 3. Sets the correct security attributes for the volume based on the pod's -`SecurityContext` -2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that -cleans the volume from the container's filesystem - -### Kubelet: Node-level secret storage - -The Kubelet must be modified to accept a new parameter for the secret storage -size and to create a tmpfs file system of that size to store secret data. Rough -accounting of specific changes: - -1. The Kubelet should have a new field added called `secretStorageSize`; units -are megabytes -2. `NewMainKubelet` should accept a value for secret storage size -3. The Kubelet server should have a new flag added for secret storage size -4. The Kubelet's `setupDataDirs` method should be changed to create the secret -storage - -### Kubelet: New behaviors for secrets associated with service accounts - -For use-cases where the Kubelet's behavior is affected by the secrets associated -with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example, -if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the -Kubelet will need to be changed to accommodate this. Subsequent proposals can -address this on a type-by-type basis. - -## Examples - -For clarity, let's examine some detailed examples of some common use-cases in -terms of the suggested changes. All of these examples are assumed to be created -in a namespace called `example`. - -### Use-Case: Pod with ssh keys - -To create a pod that uses an ssh key stored as a secret, we first need to create -a secret: - -```json -{ - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "ssh-key-secret" - }, - "data": { - "id-rsa": "dmFsdWUtMg0KDQo=", - "id-rsa.pub": "dmFsdWUtMQ0K" - } -} -``` - -**Note:** The serialized JSON and YAML values of secret data are encoded as -base64 strings. Newlines are not valid within these strings and must be -omitted. - -Now we can create a pod which references the secret with the ssh key and -consumes it in a volume: - -```json -{ - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "secret-test-pod", - "labels": { - "name": "secret-test" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "ssh-key-secret" - } - } - ], - "containers": [ - { - "name": "ssh-test-container", - "image": "mySshImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } -} -``` - -When the container's command runs, the pieces of the key will be available in: - - /etc/secret-volume/id-rsa.pub - /etc/secret-volume/id-rsa - -The container is then free to use the secret data to establish an ssh -connection. - -### Use-Case: Pods with prod / test credentials - -This example illustrates a pod which consumes a secret containing prod -credentials and another pod which consumes a secret with test environment -credentials. - -The secrets: - -```json -{ - "apiVersion": "v1", - "kind": "List", - "items": - [{ - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "prod-db-secret" - }, - "data": { - "password": "dmFsdWUtMg0KDQo=", - "username": "dmFsdWUtMQ0K" - } - }, - { - "kind": "Secret", - "apiVersion": "v1", - "metadata": { - "name": "test-db-secret" - }, - "data": { - "password": "dmFsdWUtMg0KDQo=", - "username": "dmFsdWUtMQ0K" - } - }] -} -``` - -The pods: - -```json -{ - "apiVersion": "v1", - "kind": "List", - "items": - [{ - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "prod-db-client-pod", - "labels": { - "name": "prod-db-client" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "prod-db-secret" - } - } - ], - "containers": [ - { - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } - }, - { - "kind": "Pod", - "apiVersion": "v1", - "metadata": { - "name": "test-db-client-pod", - "labels": { - "name": "test-db-client" - } - }, - "spec": { - "volumes": [ - { - "name": "secret-volume", - "secret": { - "secretName": "test-db-secret" - } - } - ], - "containers": [ - { - "name": "db-client-container", - "image": "myClientImage", - "volumeMounts": [ - { - "name": "secret-volume", - "readOnly": true, - "mountPath": "/etc/secret-volume" - } - ] - } - ] - } - }] -} -``` - -The specs for the two pods differ only in the value of the object referred to by -the secret volume source. Both containers will have the following files present -on their filesystems: - - /etc/secret-volume/username - /etc/secret-volume/password - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]() - diff --git a/security.md b/security.md deleted file mode 100644 index b1aeacbd..00000000 --- a/security.md +++ /dev/null @@ -1,218 +0,0 @@ -# Security in Kubernetes - -Kubernetes should define a reasonable set of security best practices that allows -processes to be isolated from each other, from the cluster infrastructure, and -which preserves important boundaries between those who manage the cluster, and -those who use the cluster. - -While Kubernetes today is not primarily a multi-tenant system, the long term -evolution of Kubernetes will increasingly rely on proper boundaries between -users and administrators. The code running on the cluster must be appropriately -isolated and secured to prevent malicious parties from affecting the entire -cluster. - - -## High Level Goals - -1. Ensure a clear isolation between the container and the underlying host it -runs on -2. Limit the ability of the container to negatively impact the infrastructure -or other containers -3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - -ensure components are only authorized to perform the actions they need, and -limit the scope of a compromise by limiting the capabilities of individual -components -4. Reduce the number of systems that have to be hardened and secured by -defining clear boundaries between components -5. Allow users of the system to be cleanly separated from administrators -6. Allow administrative functions to be delegated to users where necessary -7. Allow applications to be run on the cluster that have "secret" data (keys, -certs, passwords) which is properly abstracted from "public" data. - -## Use cases - -### Roles - -We define "user" as a unique identity accessing the Kubernetes API server, which -may be a human or an automated process. Human users fall into the following -categories: - -1. k8s admin - administers a Kubernetes cluster and has access to the underlying -components of the system -2. k8s project administrator - administrates the security of a small subset of -the cluster -3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster -resources - -Automated process users fall into the following categories: - -1. k8s container user - a user that processes running inside a container (on the -cluster) can use to access other cluster resources independent of the human -users attached to a project -2. k8s infrastructure user - the user that Kubernetes infrastructure components -use to perform cluster functions with clearly defined roles - -### Description of roles - -* Developers: - * write pod specs. - * making some of their own images, and using some "community" docker images - * know which pods need to talk to which other pods - * decide which pods should share files with other pods, and which should not. - * reason about application level security, such as containing the effects of a -local-file-read exploit in a webserver pod. - * do not often reason about operating system or organizational security. - * are not necessarily comfortable reasoning about the security properties of a -system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc. - -* Project Admins: - * allocate identity and roles within a namespace - * reason about organizational security within a namespace - * don't give a developer permissions that are not needed for role. - * protect files on shared storage from unnecessary cross-team access - * are less focused about application security - -* Administrators: - * are less focused on application security. Focused on operating system -security. - * protect the node from bad actors in containers, and properly-configured -innocent containers from bad actors in other containers. - * comfortable reasoning about the security properties of a system at the level -of detail of Linux Capabilities, SELinux, AppArmor, etc. - * decides who can use which Linux Capabilities, run privileged containers, use -hostPath, etc. - * e.g. a team that manages Ceph or a mysql server might be trusted to have -raw access to storage devices in some organizations, but teams that develop the -applications at higher layers would not. - - -## Proposed Design - -A pod runs in a *security context* under a *service account* that is defined by -an administrator or project administrator, and the *secrets* a pod has access to -is limited by that *service account*. - - -1. The API should authenticate and authorize user actions [authn and authz](access.md) -2. All infrastructure components (kubelets, kube-proxies, controllers, -scheduler) should have an infrastructure user that they can authenticate with -and be authorized to perform only the functions they require against the API. -3. Most infrastructure components should use the API as a way of exchanging data -and changing the system, and only the API should have access to the underlying -data store (etcd) -4. When containers run on the cluster and need to talk to other containers or -the API server, they should be identified and authorized clearly as an -autonomous process via a [service account](service_accounts.md) - 1. If the user who started a long-lived process is removed from access to -the cluster, the process should be able to continue without interruption - 2. If the user who started processes are removed from the cluster, -administrators may wish to terminate their processes in bulk - 3. When containers run with a service account, the user that created / -triggered the service account behavior must be associated with the container's -action -5. When container processes run on the cluster, they should run in a -[security context](security_context.md) that isolates those processes via Linux -user security, user namespaces, and permissions. - 1. Administrators should be able to configure the cluster to automatically -confine all container processes as a non-root, randomly assigned UID - 2. Administrators should be able to ensure that container processes within -the same namespace are all assigned the same unix user UID - 3. Administrators should be able to limit which developers and project -administrators have access to higher privilege actions - 4. Project administrators should be able to run pods within a namespace -under different security contexts, and developers must be able to specify which -of the available security contexts they may use - 5. Developers should be able to run their own images or images from the -community and expect those images to run correctly - 6. Developers may need to ensure their images work within higher security -requirements specified by administrators - 7. When available, Linux kernel user namespaces can be used to ensure 5.2 -and 5.4 are met. - 8. When application developers want to share filesystem data via distributed -filesystems, the Unix user ids on those filesystems must be consistent across -different container processes -6. Developers should be able to define [secrets](secrets.md) that are -automatically added to the containers when pods are run - 1. Secrets are files injected into the container whose values should not be -displayed within a pod. Examples: - 1. An SSH private key for git cloning remote data - 2. A client certificate for accessing a remote system - 3. A private key and certificate for a web server - 4. A .kubeconfig file with embedded cert / token data for accessing the -Kubernetes master - 5. A .dockercfg file for pulling images from a protected registry - 2. Developers should be able to define the pod spec so that a secret lands -in a specific location - 3. Project administrators should be able to limit developers within a -namespace from viewing or modifying secrets (anyone who can launch an arbitrary -pod can view secrets) - 4. Secrets are generally not copied from one namespace to another when a -developer's application definitions are copied - - -### Related design discussion - -* [Authorization and authentication](access.md) -* [Secret distribution via files](http://pr.k8s.io/2030) -* [Docker secrets](https://github.com/docker/docker/pull/6697) -* [Docker vault](https://github.com/docker/docker/issues/10310) -* [Service Accounts:](service_accounts.md) -* [Secret volumes](http://pr.k8s.io/4126) - -## Specific Design Points - -### TODO: authorization, authentication - -### Isolate the data store from the nodes and supporting infrastructure - -Access to the central data store (etcd) in Kubernetes allows an attacker to run -arbitrary containers on hosts, to gain access to any protected information -stored in either volumes or in pods (such as access tokens or shared secrets -provided as environment variables), to intercept and redirect traffic from -running services by inserting middlemen, or to simply delete the entire history -of the cluster. - -As a general principle, access to the central data store should be restricted to -the components that need full control over the system and which can apply -appropriate authorization and authentication of change requests. In the future, -etcd may offer granular access control, but that granularity will require an -administrator to understand the schema of the data to properly apply security. -An administrator must be able to properly secure Kubernetes at a policy level, -rather than at an implementation level, and schema changes over time should not -risk unintended security leaks. - -Both the Kubelet and Kube Proxy need information related to their specific roles - -for the Kubelet, the set of pods it should be running, and for the Proxy, the -set of services and endpoints to load balance. The Kubelet also needs to provide -information about running pods and historical termination data. The access -pattern for both Kubelet and Proxy to load their configuration is an efficient -"wait for changes" request over HTTP. It should be possible to limit the Kubelet -and Proxy to only access the information they need to perform their roles and no -more. - -The controller manager for Replication Controllers and other future controllers -act on behalf of a user via delegation to perform automated maintenance on -Kubernetes resources. Their ability to access or modify resource state should be -strictly limited to their intended duties and they should be prevented from -accessing information not pertinent to their role. For example, a replication -controller needs only to create a copy of a known pod configuration, to -determine the running state of an existing pod, or to delete an existing pod -that it created - it does not need to know the contents or current state of a -pod, nor have access to any data in the pods attached volumes. - -The Kubernetes pod scheduler is responsible for reading data from the pod to fit -it onto a node in the cluster. At a minimum, it needs access to view the ID of a -pod (to craft the binding), its current state, any resource information -necessary to identify placement, and other data relevant to concerns like -anti-affinity, zone or region preference, or custom logic. It does not need the -ability to modify pods or see other resources, only to create bindings. It -should not need the ability to delete bindings unless the scheduler takes -control of relocating components on failed hosts (which could be implemented by -a separate component that can delete bindings but not create them). The -scheduler may need read access to user or project-container information to -determine preferential location (underspecified at this time). - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]() - diff --git a/security_context.md b/security_context.md deleted file mode 100644 index 76bc8ee8..00000000 --- a/security_context.md +++ /dev/null @@ -1,192 +0,0 @@ -# Security Contexts - -## Abstract - -A security context is a set of constraints that are applied to a container in -order to achieve the following goals (from [security design](security.md)): - -1. Ensure a clear isolation between container and the underlying host it runs -on -2. Limit the ability of the container to negatively impact the infrastructure -or other containers - -## Background - -The problem of securing containers in Kubernetes has come up -[before](http://issue.k8s.io/398) and the potential problems with container -security are [well known](http://opensource.com/business/14/7/docker-security-selinux). -Although it is not possible to completely isolate Docker containers from their -hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) -make it possible to greatly reduce the attack surface. - -## Motivation - -### Container isolation - -In order to improve container isolation from host and other containers running -on the host, containers should only be granted the access they need to perform -their work. To this end it should be possible to take advantage of Docker -features such as the ability to -[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) -and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) -to the container process. - -Support for user namespaces has recently been -[merged](https://github.com/docker/libcontainer/pull/304) into Docker's -libcontainer project and should soon surface in Docker itself. It will make it -possible to assign a range of unprivileged uids and gids from the host to each -container, improving the isolation between host and container and between -containers. - -### External integration with shared storage - -In order to support external integration with shared storage, processes running -in a Kubernetes cluster should be able to be uniquely identified by their Unix -UID, such that a chain of ownership can be established. Processes in pods will -need to have consistent UID/GID/SELinux category labels in order to access -shared disks. - -## Constraints and Assumptions - -* It is out of the scope of this document to prescribe a specific set of -constraints to isolate containers from their host. Different use cases need -different settings. -* The concept of a security context should not be tied to a particular security -mechanism or platform (i.e. SELinux, AppArmor) -* Applying a different security context to a scope (namespace or pod) requires -a solution such as the one proposed for [service accounts](service_accounts.md). - -## Use Cases - -In order of increasing complexity, following are example use cases that would -be addressed with security contexts: - -1. Kubernetes is used to run a single cloud application. In order to protect -nodes from containers: - * All containers run as a single non-root user - * Privileged containers are disabled - * All containers run with a particular MCS label - * Kernel capabilities like CHOWN and MKNOD are removed from containers - -2. Just like case #1, except that I have more than one application running on -the Kubernetes cluster. - * Each application is run in its own namespace to avoid name collisions - * For each application a different uid and MCS label is used - -3. Kubernetes is used as the base for a PAAS with multiple projects, each -project represented by a namespace. - * Each namespace is associated with a range of uids/gids on the node that -are mapped to uids/gids on containers using linux user namespaces. - * Certain pods in each namespace have special privileges to perform system -actions such as talking back to the server for deployment, run docker builds, -etc. - * External NFS storage is assigned to each namespace and permissions set -using the range of uids/gids assigned to that namespace. - -## Proposed Design - -### Overview - -A *security context* consists of a set of constraints that determine how a -container is secured before getting created and run. A security context resides -on the container and represents the runtime parameters that will be used to -create and run the container via container APIs. A *security context provider* -is passed to the Kubelet so it can have a chance to mutate Docker API calls in -order to apply the security context. - -It is recommended that this design be implemented in two phases: - -1. Implement the security context provider extension point in the Kubelet -so that a default security context can be applied on container run and creation. -2. Implement a security context structure that is part of a service account. The -default context provider can then be used to apply a security context based on -the service account associated with the pod. - -### Security Context Provider - -The Kubelet will have an interface that points to a `SecurityContextProvider`. -The `SecurityContextProvider` is invoked before creating and running a given -container: - -```go -type SecurityContextProvider interface { - // ModifyContainerConfig is called before the Docker createContainer call. - // The security context provider can make changes to the Config with which - // the container is created. - // An error is returned if it's not possible to secure the container as - // requested with a security context. - ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config) - - // ModifyHostConfig is called before the Docker runContainer call. - // The security context provider can make changes to the HostConfig, affecting - // security options, whether the container is privileged, volume binds, etc. - // An error is returned if it's not possible to secure the container as requested - // with a security context. - ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig) -} -``` - -If the value of the SecurityContextProvider field on the Kubelet is nil, the -kubelet will create and run the container as it does today. - -### Security Context - -A security context resides on the container and represents the runtime -parameters that will be used to create and run the container via container APIs. -Following is an example of an initial implementation: - -```go -type Container struct { - ... other fields omitted ... - // Optional: SecurityContext defines the security options the pod should be run with - SecurityContext *SecurityContext -} - -// SecurityContext holds security configuration that will be applied to a container. SecurityContext -// contains duplication of some existing fields from the Container resource. These duplicate fields -// will be populated based on the Container configuration if they are not set. Defining them on -// both the Container AND the SecurityContext will result in an error. -type SecurityContext struct { - // Capabilities are the capabilities to add/drop when running the container - Capabilities *Capabilities - - // Run the container in privileged mode - Privileged *bool - - // SELinuxOptions are the labels to be applied to the container - // and volumes - SELinuxOptions *SELinuxOptions - - // RunAsUser is the UID to run the entrypoint of the container process. - RunAsUser *int64 -} - -// SELinuxOptions are the labels to be applied to the container. -type SELinuxOptions struct { - // SELinux user label - User string - - // SELinux role label - Role string - - // SELinux type label - Type string - - // SELinux level label. - Level string -} -``` - -### Admission - -It is up to an admission plugin to determine if the security context is -acceptable or not. At the time of writing, the admission control plugin for -security contexts will only allow a context that has defined capabilities or -privileged. Contexts that attempt to define a UID or SELinux options will be -denied by default. In the future the admission plugin will base this decision -upon configurable policies that reside within the [service account](http://pr.k8s.io/2297). - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]() - diff --git a/selector-generation.md b/selector-generation.md deleted file mode 100644 index efb32cf2..00000000 --- a/selector-generation.md +++ /dev/null @@ -1,180 +0,0 @@ -Design -============= - -# Goals - -Make it really hard to accidentally create a job which has an overlapping -selector, while still making it possible to chose an arbitrary selector, and -without adding complex constraint solving to the APIserver. - -# Use Cases - -1. user can leave all label and selector fields blank and system will fill in -reasonable ones: non-overlappingness guaranteed. -2. user can put on the pod template some labels that are useful to the user, -without reasoning about non-overlappingness. System adds additional label to -assure not overlapping. -3. If user wants to reparent pods to new job (very rare case) and knows what -they are doing, they can completely disable this behavior and specify explicit -selector. -4. If a controller that makes jobs, like scheduled job, wants to use different -labels, such as the time and date of the run, it can do that. -5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and -just changes the API group, the user should not automatically be allowed to -specify a selector, since this is very rarely what people want to do and is -error prone. -6. If User downloads an existing job definition, e.g. with -`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not -create an overlapping job. -7. If User downloads an existing job definition, e.g. with -`kubectl get jobs/old -o yaml` and tries to modify and post it, and he -accidentally copies the uniquifying label from the old one, then he should not -get an error from a label-key conflict, nor get erratic behavior. -8. If user reads swagger docs and sees the selector field, he should not be able -to set it without realizing the risks. -8. (Deferred requirement:) If user wants to specify a preferred name for the -non-overlappingness key, they can pick a name. - -# Proposed changes - -## API - -`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as -follows. - -Field `job.spec.manualSelector` is added. It controls whether selectors are -automatically generated. In automatic mode, user cannot make the mistake of -creating non-unique selectors. In manual mode, certain rare use cases are -supported. - -Validation is not changed. A selector must be provided, and it must select the -pod template. - -Defaulting changes. Defaulting happens in one of two modes: - -### Automatic Mode - -- User does not specify `job.spec.selector`. -- User is probably unaware of the `job.spec.manualSelector` field and does not -think about it. -- User optionally puts labels on pod template (optional). User does not think -about uniqueness, just labeling for user's own reasons. -- Defaulting logic sets `job.spec.selector` to -`matchLabels["controller-uid"]="$UIDOFJOB"` -- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`. - - The first label is controller-uid=$UIDOFJOB. - - The second label is "job-name=$NAMEOFJOB". - -### Manual Mode - -- User means User or Controller for the rest of this list. -- User does specify `job.spec.selector`. -- User does specify `job.spec.manualSelector=true` -- User puts a unique label or label(s) on pod template (required). User does -think carefully about uniqueness. -- No defaulting of pod labels or the selector happen. - -### Rationale - -UID is better than Name in that: -- it allows cross-namespace control someday if we need it. -- it is unique across all kinds. `controller-name=foo` does not ensure -uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a -problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the -latter cannot use label `job-name=foo`, though there is a temptation to do so. -- it uniquely identifies the controller across time. This prevents the case -where, for example, someone deletes a job via the REST api or client -(where cascade=false), leaving pods around. We don't want those to be picked up -unintentionally. It also prevents the case where a user looks at an old job that -finished but is not deleted, and tries to select its pods, and gets the wrong -impression that it is still running. - -Job name is more user friendly. It is self documenting - -Commands like `kubectl get pods -l job-name=myjob` should do exactly what is -wanted 99.9% of the time. Automated control loops should still use the -controller-uid=label. - -Using both gets the benefits of both, at the cost of some label verbosity. - -The field is a `*bool`. Since false is expected to be much more common, -and since the feature is complex, it is better to leave it unspecified so that -users looking at a stored pod spec do not need to be aware of this field. - -### Overriding Unique Labels - -If user does specify `job.spec.selector` then the user must also specify -`job.spec.manualSelector`. This ensures the user knows that what he is doing is -not the normal thing to do. - -To prevent users from copying the `job.spec.manualSelector` flag from existing -jobs, it will be optional and default to false, which means when you ask GET and -existing job back that didn't use this feature, you don't even see the -`job.spec.manualSelector` flag, so you are not tempted to wonder if you should -fiddle with it. - -## Job Controller - -No changes - -## Kubectl - -No required changes. Suggest moving SELECTOR to wide output of `kubectl get -jobs` since users do not write the selector. - -## Docs - -Remove examples that use selector and remove labels from pod templates. -Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job. - -# Conversion - -The following applies to Job, as well as to other types that adopt this pattern: - -- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`. -- Both the internal type and the `batch/v1` type will get -`job.spec.manualSelector`. -- The fields `manualSelector` and `autoSelector` have opposite meanings. -- Each field defaults to false when unset, and so v1beta1 has a different -default than v1 and internal. This is intentional: we want new uses to default -to the less error-prone behavior, and we do not want to change the behavior of -v1beta1. - -*Note*: since the internal default is changing, client library consumers that -create Jobs may need to add "job.spec.manualSelector=true" to keep working, or -switch to auto selectors. - -Conversion is as follows: -- `extensions/__internal` to `extensions/v1beta1`: the value of -`__internal.Spec.ManualSelector` is defaulted to false if nil, negated, -defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`. -- `extensions/v1beta1` to `extensions/__internal`: the value of -`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to -nil if false, and written to `__internal.Spec.ManualSelector`. - -This conversion gives the following properties. - -1. Users that previously used v1beta1 do not start seeing a new field when they -get back objects. -2. Distinction between originally unset versus explicitly set to false is not -preserved (would have been nice to do so, but requires more complicated -solution). -3. Users who only created v1beta1 examples or v1 examples, will not ever see the -existence of either field. -4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd) -does not need to change, allowing scriptable rollforward/rollback. - -# Future Work - -Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if -it works well for job. - -Docs will be edited to show examples without a `job.spec.selector`. - -We probably want as much as possible the same behavior for Job and -ReplicationController. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]() - diff --git a/selinux.md b/selinux.md deleted file mode 100644 index ece83d44..00000000 --- a/selinux.md +++ /dev/null @@ -1,317 +0,0 @@ -## Abstract - -A proposal for enabling containers in a pod to share volumes using a pod level SELinux context. - -## Motivation - -Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin -authors should not have to explicitly account for SELinux except for volume types that require -special handling of the SELinux context during setup. - -Currently, each container in a pod has an SELinux context. This is not an ideal factoring for -sharing resources using SELinux. - -We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a -generic way. - -Goals of this design: - -1. Describe the problems with a container SELinux context -2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context - which is backward compatible with the v1.0.0 API - -## Constraints and Assumptions - -1. We will not support securing containers within a pod from one another -2. Volume plugins should not have to handle setting SELinux context on volumes -3. We will not deal with shared storage - -## Current State Overview - -### Docker - -Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux -context of a container can be overridden with the `SecurityOpt` api that allows setting the different -parts of the SELinux context individually. - -Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different -use-cases: - -1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's - SELinux context -2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's - SElinux context, but remove the MCS labels, making the volume shareable between containers - -We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container -(from an SELinux standpoint) can use the volume. - -### rkt - -rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts` -and allocates a unique MCS label per pod. - -### Kubernetes - - -There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the -EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a -patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem -in general of handling SELinux in kubernetes to merging this PR. - -A new `PodSecurityContext` type has been added that carries information about security attributes -that apply to the entire pod and that apply to all containers in a pod. See: - -1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939) -1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823) - -## Use Cases - -1. As a cluster operator, I want to support securing pods from one another using SELinux when - SELinux integration is enabled in the cluster -2. As a user, I want volumes sharing to work correctly amongst containers in pods - -#### SELinux context: pod- or container- level? - -Currently, SELinux context is specifiable only at the container level. This is an inconvenient -factoring for sharing volumes and other SELinux-secured resources between containers because there -is no way in SELinux to share resources between processes with different MCS labels except to -remove MCS labels from the shared resource. This is a big security risk: _any container_ in the -system can work with a resource which has the same SELinux context as it and no MCS labels. Since -we are also not interested in isolating containers in a pod from one another, the SELinux context -should be shared by all containers in a pod to facilitate isolation from the containers in other -pods and sharing resources amongst all the containers of a pod. - -#### Volumes - -Kubernetes volumes can be divided into two broad categories: - -1. Unshared storage: - 1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret, - downward api. All volumes in this category delegate to `EmptyDir` for their underlying - storage. - 2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively - by a single pod*. -2. Shared storage: - 1. `hostPath` is shared storage because it is necessarily used by a container and the host - 2. Network file systems such as NFS, Glusterfs, Cephfs, etc. - 3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because - they may be used simultaneously by multiple pods. - -For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon` -operation on the volume directory after running the volume plugin's `Setup` function. For these -volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume -plugin code. Some volume plugins may need to use the SELinux context during a mount operation in -certain cases. To account for this, our design must have a way for volume plugins to state that -a particular volume should or should not receive generic label management. - -For shared storage, the picture is murkier. Labels for existing shared storage will be managed -outside Kubernetes and administrators will have to set the SELinux context of pods correctly. -The problem of solving SELinux label management for new shared storage is outside the scope for -this proposal. - -## Analysis - -The system needs to be able to: - -1. Model correctly which volumes require SELinux label management -1. Relabel volumes with the correct SELinux context when required - -### Modeling whether a volume requires label management - -#### Unshared storage: volumes derived from `EmptyDir` - -Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure -that the ownership and SELinux context (when relevant) are set correctly for the volume to be -usable. - -#### Unshared storage: network block devices - -Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way -as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir` -volumes, permissions and ownership can be managed on the client side by the Kubelet when used -exclusively by one pod. When the volumes are used outside of a persistent volume, or with the -`ReadWriteOnce` mode, they are effectively unshared storage. - -When used by multiple pods, there are many additional use-cases to analyze before we can be -confident that we can support SELinux label management robustly with these file systems. The right -design is one that makes it easy to experiment and develop support for ownership management with -volume plugins to enable developers and cluster operators to continue exploring these issues. - -#### Shared storage: hostPath - -The `hostPath` volume should only be used by effective-root users, and the permissions of paths -exposed into containers via hostPath volumes should always be managed by the cluster operator. If -the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath` -volume could affect changes in the state of arbitrary paths within the host's filesystem. This -would be a severe security risk, so we will consider hostPath a corner case that the kubelet should -never perform ownership management for. - -#### Shared storage: network - -Ownership management of shared storage is a complex topic. SELinux labels for existing shared -storage will be managed externally from Kubernetes. For this case, our API should make it simple to -express whether a particular volume should have these concerns managed by Kubernetes. - -We will not attempt to address the concerns of new shared storage in this proposal. - -When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany` -modes, it is shared storage, and thus outside the scope of this proposal. - -#### API requirements - -From the above, we know that label management must be applied: - -1. To some volume types always -2. To some volume types never -3. To some volume types *sometimes* - -Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it -is desirable for other container runtime implementations to provide similar functionality. - -Relabeling should be an optional aspect of a volume plugin to accommodate: - -1. volume types for which generalized relabeling support is not sufficient -2. testing for each volume plugin individually - -## Proposed Design - -Our design should minimize code for handling SELinux labelling required in the Kubelet and volume -plugins. - -### Deferral: MCS label allocation - -Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the -primitives for higher level composition; making these automatic is a longer-term goal. Allocating -groups and MCS labels are fairly complex problems in their own right, and so our proposal will not -encompass either of these topics. There are several problems that the solution for allocation -depends on: - -1. Users and groups in Kubernetes -2. General auth policy in Kubernetes -3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893) - -### API changes - -The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823) -adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is -the addition of the semantics to this field: - -* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership -management in the Kubelet have their SELinuxContext set from this field. - -```go -package api - -type PodSecurityContext struct { - // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's - // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. - // - // This field will be used to set the SELinux of volumes that support SELinux label management - // by the kubelet. - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` -} -``` - -The V1 API is extended with the same semantics: - -```go -package v1 - -type PodSecurityContext struct { - // SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's - // SecurityContext.SELinuxOptions field is set, that setting has precedent for that container. - // - // This field will be used to set the SELinux of volumes that support SELinux label management - // by the kubelet. - SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"` -} -``` - -#### API backward compatibility - -Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive -SELinux label management for their volumes. This is acceptable since old clients won't know about -this field and won't have any expectation of their volumes being managed this way. - -The existing backward compatibility semantics for SELinux do not change at all with this proposal. - -### Kubelet changes - -The Kubelet should be modified to perform SELinux label management when required for a volume. The -criteria to activate the kubelet SELinux label management for volumes are: - -1. SELinux integration is enabled in the cluster -2. SELinux is enabled on the node -3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set -4. The volume plugin supports SELinux label management - -The `volume.Mounter` interface should have a new method added that indicates whether the plugin -supports SELinux label management: - -```go -package volume - -type Builder interface { - // other methods omitted - SupportsSELinux() bool -} -``` - -Individual volume plugins are responsible for correctly reporting whether they support label -management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its -derivations will be tested with ownership management support: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | false | -| `awsElasticBlockStore` | false | -| `nfs` | false | -| `iscsi` | false | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | false | -| `cinder` | false | -| `cephfs` | false | - -Ultimately, the matrix will theoretically look like: - -| Plugin Name | SupportsOwnershipManagement | -|-------------------------|-------------------------------| -| `hostPath` | false | -| `emptyDir` | true | -| `gitRepo` | true | -| `secret` | true | -| `downwardAPI` | true | -| `gcePersistentDisk` | true | -| `awsElasticBlockStore` | true | -| `nfs` | false | -| `iscsi` | true | -| `glusterfs` | false | -| `persistentVolumeClaim` | depends on underlying volume and PV mode | -| `rbd` | true | -| `cinder` | false | -| `cephfs` | false | - -In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a -function of the container runtime implementations. Initially, we will modify the docker runtime -implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish -generic label management for docker containers. - -Volume types that require SELinux context information at mount must be injected with and respect the -enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism -will be used to carry information about label management enablement to the volume plugins that have -to manage labels individually. - -This allows the volume plugins to determine when they do and don't want this type of support from -the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selinux.md?pixel)]() - diff --git a/service_accounts.md b/service_accounts.md deleted file mode 100644 index 89a3771b..00000000 --- a/service_accounts.md +++ /dev/null @@ -1,210 +0,0 @@ -# Service Accounts - -## Motivation - -Processes in Pods may need to call the Kubernetes API. For example: - - scheduler - - replication controller - - node controller - - a map-reduce type framework which has a controller that then tries to make a -dynamically determined number of workers and watch them - - continuous build and push system - - monitoring system - -They also may interact with services other than the Kubernetes API, such as: - - an image repository, such as docker -- both when the images are pulled to -start the containers, and for writing images in the case of pods that generate -images. - - accessing other cloud services, such as blob storage, in the context of a -large, integrated, cloud offering (hosted or private). - - accessing files in an NFS volume attached to the pod - -## Design Overview - -A service account binds together several things: - - a *name*, understood by users, and perhaps by peripheral systems, for an -identity - - a *principal* that can be authenticated and [authorized](../admin/authorization.md) - - a [security context](security_context.md), which defines the Linux -Capabilities, User IDs, Groups IDs, and other capabilities and controls on -interaction with the file system and OS. - - a set of [secrets](secrets.md), which a container may use to access various -networked resources. - -## Design Discussion - -A new object Kind is added: - -```go -type ServiceAccount struct { - TypeMeta `json:",inline" yaml:",inline"` - ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"` - - username string - securityContext ObjectReference // (reference to a securityContext object) - secrets []ObjectReference // (references to secret objects -} -``` - -The name ServiceAccount is chosen because it is widely used already (e.g. by -Kerberos and LDAP) to refer to this type of account. Note that it has no -relation to Kubernetes Service objects. - -The ServiceAccount object does not include any information that could not be -defined separately: - - username can be defined however users are defined. - - securityContext and secrets are only referenced and are created using the -REST API. - -The purpose of the serviceAccount object is twofold: - - to bind usernames to securityContexts and secrets, so that the username can -be used to refer succinctly in contexts where explicitly naming securityContexts -and secrets would be inconvenient - - to provide an interface to simplify allocation of new securityContexts and -secrets. - -These features are explained later. - -### Names - -From the standpoint of the Kubernetes API, a `user` is any principal which can -authenticate to Kubernetes API. This includes a human running `kubectl` on her -desktop and a container in a Pod on a Node making API calls. - -There is already a notion of a username in Kubernetes, which is populated into a -request context after authentication. However, there is no API object -representing a user. While this may evolve, it is expected that in mature -installations, the canonical storage of user identifiers will be handled by a -system external to Kubernetes. - -Kubernetes does not dictate how to divide up the space of user identifier -strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or -may be qualified to allow for federated identity (`alice@example.com` vs. -`alice@example.org`.) Naming convention may distinguish service accounts from -user accounts (e.g. `alice@example.com` vs. -`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but -Kubernetes does not require this. - -Kubernetes also does not require that there be a distinction between human and -Pod users. It will be possible to setup a cluster where Alice the human talks to -the Kubernetes API as username `alice` and starts pods that also talk to the API -as user `alice` and write files to NFS as user `alice`. But, this is not -recommended. - -Instead, it is recommended that Pods and Humans have distinct identities, and -reference implementations will make this distinction. - -The distinction is useful for a number of reasons: - - the requirements for humans and automated processes are different: - - Humans need a wide range of capabilities to do their daily activities. -Automated processes often have more narrowly-defined activities. - - Humans may better tolerate the exceptional conditions created by -expiration of a token. Remembering to handle this in a program is more annoying. -So, either long-lasting credentials or automated rotation of credentials is -needed. - - A Human typically keeps credentials on a machine that is not part of the -cluster and so not subject to automatic management. A VM with a -role/service-account can have its credentials automatically managed. - - the identity of a Pod cannot in general be mapped to a single human. - - If policy allows, it may be created by one human, and then updated by -another, and another, until its behavior cannot be attributed to a single human. - -**TODO**: consider getting rid of separate serviceAccount object and just -rolling its parts into the SecurityContext or Pod Object. - -The `secrets` field is a list of references to /secret objects that an process -started as that service account should have access to be able to assert that -role. - -The secrets are not inline with the serviceAccount object. This way, most or -all users can have permission to `GET /serviceAccounts` so they can remind -themselves what serviceAccounts are available for use. - -Nothing will prevent creation of a serviceAccount with two secrets of type -`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and -client libraries will have some behavior, TBD, to handle the case of multiple -secrets of a given type (pick first or provide all and try each in order, etc). - -When a serviceAccount and a matching secret exist, then a `User.Info` for the -serviceAccount and a `BearerToken` from the secret are added to the map of -tokens used by the authentication process in the apiserver, and similarly for -other types. (We might have some types that do not do anything on apiserver but -just get pushed to the kubelet.) - -### Pods - -The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If -this is unset, then a default value is chosen. If it is set, then the -corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account -Finalizer (see below). - -TBD: how policy limits which users can make pods with which service accounts. - -### Authorization - -Kubernetes API Authorization Policies refer to users. Pods created with a -`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to -authenticate to the Kubernetes APIserver as a particular user. So any policy -that is desired can be applied to them. - -A higher level workflow is needed to coordinate creation of serviceAccounts, -secrets and relevant policy objects. Users are free to extend Kubernetes to put -this business logic wherever is convenient for them, though the Service Account -Finalizer is one place where this can happen (see below). - -### Kubelet - -The kubelet will treat as "not ready to run" (needing a finalizer to act on it) -any Pod which has an empty SecurityContext. - -The kubelet will set a default, restrictive, security context for any pods -created from non-Apiserver config sources (http, file). - -Kubelet watches apiserver for secrets which are needed by pods bound to it. - -**TODO**: how to only let kubelet see secrets it needs to know. - -### The service account finalizer - -There are several ways to use Pods with SecurityContexts and Secrets. - -One way is to explicitly specify the securityContext and all secrets of a Pod -when the pod is initially created, like this: - -**TODO**: example of pod with explicit refs. - -Another way is with the *Service Account Finalizer*, a plugin process which is -optional, and which handles business logic around service accounts. - -The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount -definitions. - -First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no -`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext -and secrets references for the corresponding `serviceAccount`. - -Second, if ServiceAccount definitions change, it may take some actions. - -**TODO**: decide what actions it takes when a serviceAccount definition changes. -Does it stop pods, or just allow someone to list ones that are out of spec? In -general, people may want to customize this? - -Third, if a new namespace is created, it may create a new serviceAccount for -that namespace. This may include a new username (e.g. -`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), -a new securityContext, a newly generated secret to authenticate that -serviceAccount to the Kubernetes API, and default policies for that service -account. - -**TODO**: more concrete example. What are typical default permissions for -default service account (e.g. readonly access to services in the same namespace -and read-write access to events in that namespace?) - -Finally, it may provide an interface to automate creation of new -serviceAccounts. In that case, the user may want to GET serviceAccounts to see -what has been created. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]() - diff --git a/simple-rolling-update.md b/simple-rolling-update.md deleted file mode 100644 index c4a5f671..00000000 --- a/simple-rolling-update.md +++ /dev/null @@ -1,131 +0,0 @@ -## Simple rolling update - -This is a lightweight design document for simple -[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`. - -Complete execution flow can be found [here](#execution-details). See the -[example of rolling update](../user-guide/update-demo/) for more information. - -### Lightweight rollout - -Assume that we have a current replication controller named `foo` and it is -running image `image:v1` - -`kubectl rolling-update foo [foo-v2] --image=myimage:v2` - -If the user doesn't specify a name for the 'next' replication controller, then -the 'next' replication controller is renamed to -the name of the original replication controller. - -Obviously there is a race here, where if you kill the client between delete foo, -and creating the new version of 'foo' you might be surprised about what is -there, but I think that's ok. See [Recovery](#recovery) below - -If the user does specify a name for the 'next' replication controller, then the -'next' replication controller is retained with its existing name, and the old -'foo' replication controller is deleted. For the purposes of the rollout, we add -a unique-ifying label `kubernetes.io/deployment` to both the `foo` and -`foo-next` replication controllers. The value of that label is the hash of the -complete JSON representation of the`foo-next` or`foo` replication controller. -The name of this label can be overridden by the user with the -`--deployment-label-key` flag. - -#### Recovery - -If a rollout fails or is terminated in the middle, it is important that the user -be able to resume the roll out. To facilitate recovery in the case of a crash of -the updating process itself, we add the following annotations to each -replication controller in the `kubernetes.io/` annotation namespace: - * `desired-replicas` The desired number of replicas for this replication -controller (either N or zero) - * `update-partner` A pointer to the replication controller resource that is -the other half of this update (syntax `` the namespace is assumed to be -identical to the namespace of this replication controller.) - -Recovery is achieved by issuing the same command again: - -```sh -kubectl rolling-update foo [foo-v2] --image=myimage:v2 -``` - -Whenever the rolling update command executes, the kubectl client looks for -replication controllers called `foo` and `foo-next`, if they exist, an attempt -is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is -created, and the rollout is a new rollout. If `foo` doesn't exist, then it is -assumed that the rollout is nearly completed, and `foo-next` is renamed to -`foo`. Details of the execution flow are given below. - - -### Aborting a rollout - -Abort is assumed to want to reverse a rollout in progress. - -`kubectl rolling-update foo [foo-v2] --rollback` - -This is really just semantic sugar for: - -`kubectl rolling-update foo-v2 foo` - -With the added detail that it moves the `desired-replicas` annotation from -`foo-v2` to `foo` - - -### Execution Details - -For the purposes of this example, assume that we are rolling from `foo` to -`foo-next` where the only change is an image update from `v1` to `v2` - -If the user doesn't specify a `foo-next` name, then it is either discovered from -the `update-partner` annotation on `foo`. If that annotation doesn't exist, -then `foo-next` is synthesized using the pattern -`-` - -#### Initialization - - * If `foo` and `foo-next` do not exist: - * Exit, and indicate an error to the user, that the specified controller -doesn't exist. - * If `foo` exists, but `foo-next` does not: - * Create `foo-next` populate it with the `v2` image, set -`desired-replicas` to `foo.Spec.Replicas` - * Goto Rollout - * If `foo-next` exists, but `foo` does not: - * Assume that we are in the rename phase. - * Goto Rename - * If both `foo` and `foo-next` exist: - * Assume that we are in a partial rollout - * If `foo-next` is missing the `desired-replicas` annotation - * Populate the `desired-replicas` annotation to `foo-next` using the -current size of `foo` - * Goto Rollout - -#### Rollout - - * While size of `foo-next` < `desired-replicas` annotation on `foo-next` - * increase size of `foo-next` - * if size of `foo` > 0 - decrease size of `foo` - * Goto Rename - -#### Rename - - * delete `foo` - * create `foo` that is identical to `foo-next` - * delete `foo-next` - -#### Abort - - * If `foo-next` doesn't exist - * Exit and indicate to the user that they may want to simply do a new -rollout with the old version - * If `foo` doesn't exist - * Exit and indicate not found to the user - * Otherwise, `foo-next` and `foo` both exist - * Set `desired-replicas` annotation on `foo` to match the annotation on -`foo-next` - * Goto Rollout with `foo` and `foo-next` trading places. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]() - diff --git a/taint-toleration-dedicated.md b/taint-toleration-dedicated.md deleted file mode 100644 index c523319f..00000000 --- a/taint-toleration-dedicated.md +++ /dev/null @@ -1,291 +0,0 @@ -# Taints, Tolerations, and Dedicated Nodes - -## Introduction - -This document describes *taints* and *tolerations*, which constitute a generic -mechanism for restricting the set of pods that can use a node. We also describe -one concrete use case for the mechanism, namely to limit the set of users (or -more generally, authorization domains) who can access a set of nodes (a feature -we call *dedicated nodes*). There are many other uses--for example, a set of -nodes with a particular piece of hardware could be reserved for pods that -require that hardware, or a node could be marked as unschedulable when it is -being drained before shutdown, or a node could trigger evictions when it -experiences hardware or software problems or abnormal node configurations; see -issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and -[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion. - -## Taints, tolerations, and dedicated nodes - -A *taint* is a new type that is part of the `NodeSpec`; when present, it -prevents pods from scheduling onto the node unless the pod *tolerates* the taint -(tolerations are listed in the `PodSpec`). Note that there are actually multiple -flavors of taints: taints that prevent scheduling on a node, taints that cause -the scheduler to try to avoid scheduling on a node but do not prevent it, taints -that prevent a pod from starting on Kubelet even if the pod's `NodeName` was -written directly (i.e. pod did not go through the scheduler), and taints that -evict already-running pods. -[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) -has more background on these different scenarios. We will focus on the first -kind of taint in this doc, since it is the kind required for the "dedicated -nodes" use case. - -Implementing dedicated nodes using taints and tolerations is straightforward: in -essence, a node that is dedicated to group A gets taint `dedicated=A` and the -pods belonging to group A get toleration `dedicated=A`. (The exact syntax and -semantics of taints and tolerations are described later in this doc.) This keeps -all pods except those belonging to group A off of the nodes. This approach -easily generalizes to pods that are allowed to schedule into multiple dedicated -node groups, and nodes that are a member of multiple dedicated node groups. - -Note that because tolerations are at the granularity of pods, the mechanism is -very flexible -- any policy can be used to determine which tolerations should be -placed on a pod. So the "group A" mentioned above could be all pods from a -particular namespace or set of namespaces, or all pods with some other arbitrary -characteristic in common. We expect that any real-world usage of taints and -tolerations will employ an admission controller to apply the tolerations. For -example, to give all pods from namespace A access to dedicated node group A, an -admission controller would add the corresponding toleration to all pods from -namespace A. Or to give all pods that require GPUs access to GPU nodes, an -admission controller would add the toleration for GPU taints to pods that -request the GPU resource. - -Everything that can be expressed using taints and tolerations can be expressed -using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g. -in the example in the previous paragraph, you could put a label `dedicated=A` on -the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not* -belonging to group A. But it is cumbersome to express exclusion policies using -node affinity because every time you add a new type of restricted node, all pods -that aren't allowed to use those nodes need to start avoiding those nodes using -node affinity. This means the node affinity list can get quite long in clusters -with lots of different groups of special nodes (lots of dedicated node groups, -lots of different kinds of special hardware, etc.). Moreover, you need to also -update any Pending pods when you add new types of special nodes. In contrast, -with taints and tolerations, when you add a new type of special node, "regular" -pods are unaffected, and you just need to add the necessary toleration to the -pods you subsequent create that need to use the new type of special nodes. To -put it another way, with taints and tolerations, only pods that use a set of -special nodes need to know about those special nodes; with the node affinity -approach, pods that have no interest in those special nodes need to know about -all of the groups of special nodes. - -One final comment: in practice, it is often desirable to not only keep "regular" -pods off of special nodes, but also to keep "special" pods off of regular nodes. -An example in the dedicated nodes case is to not only keep regular users off of -dedicated nodes, but also to keep dedicated users off of non-dedicated (shared) -nodes. In this case, the "non-dedicated" nodes can be modeled as their own -dedicated node group (for example, tainted as `dedicated=shared`), and pods that -are not given access to any dedicated nodes ("regular" pods) would be given a -toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations -will be added by an admission controller.) In this case taints/tolerations are -still better than node affinity because with taints/tolerations each pod only -needs one special "marking", versus in the node affinity case where every time -you add a dedicated node group (i.e. a new `dedicated=` value), you need to add -a new node affinity rule to all pods (including pending pods) except the ones -allowed to use that new dedicated node group. - -## API - -```go -// The node this Taint is attached to has the effect "effect" on -// any pod that that does not tolerate the Taint. -type Taint struct { - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - Value string `json:"value,omitempty"` - Effect TaintEffect `json:"effect"` -} - -type TaintEffect string - -const ( - // Do not allow new pods to schedule unless they tolerate the taint, - // but allow all pods submitted to Kubelet without going through the scheduler - // to start, and allow all already-running pods to continue running. - // Enforced by the scheduler. - TaintEffectNoSchedule TaintEffect = "NoSchedule" - // Like TaintEffectNoSchedule, but the scheduler tries not to schedule - // new pods onto the node, rather than prohibiting new pods from scheduling - // onto the node. Enforced by the scheduler. - TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule" - // Do not allow new pods to schedule unless they tolerate the taint, - // do not allow pods to start on Kubelet unless they tolerate the taint, - // but allow all already-running pods to continue running. - // Enforced by the scheduler and Kubelet. - TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit" - // Do not allow new pods to schedule unless they tolerate the taint, - // do not allow pods to start on Kubelet unless they tolerate the taint, - // and try to eventually evict any already-running pods that do not tolerate the taint. - // Enforced by the scheduler and Kubelet. - TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute" -) - -// The pod this Toleration is attached to tolerates any taint that matches -// the triple using the matching operator . -type Toleration struct { - Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"` - // operator represents a key's relationship to the value. - // Valid operators are Exists and Equal. Defaults to Equal. - // Exists is equivalent to wildcard for value, so that a pod can - // tolerate all taints of a particular category. - Operator TolerationOperator `json:"operator"` - Value string `json:"value,omitempty"` - Effect TaintEffect `json:"effect"` - // TODO: For forgiveness (#1574), we'd eventually add at least a grace period - // here, and possibly an occurrence threshold and period. -} - -// A toleration operator is the set of operators that can be used in a toleration. -type TolerationOperator string - -const ( - TolerationOpExists TolerationOperator = "Exists" - TolerationOpEqual TolerationOperator = "Equal" -) - -``` - -(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375) -to understand the motivation for the various taint effects.) - -We will add: - -```go - // Multiple tolerations with the same key are allowed. - Tolerations []Toleration `json:"tolerations,omitempty"` -``` - -to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type -TaintEffectPreferNoSchedule) in order to be able to schedule onto that node. - -We will add: - -```go - // Multiple taints with the same key are not allowed. - Taints []Taint `json:"taints,omitempty"` -``` - -to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union -of the taints specified by various sources. For now, the only source is -the `NodeSpec` itself, but in the future one could imagine a node inheriting -taints from pods (if we were to allow taints to be attached to pods), from -the node's startup configuration, etc. The scheduler should look at the `Taints` -in `NodeStatus`, not in `NodeSpec`. - -Taints and tolerations are not scoped to namespace. - -## Implementation plan: taints, tolerations, and dedicated nodes - -Using taints and tolerations to implement dedicated nodes requires these steps: - -1. Add the API described above -1. Add a scheduler predicate function that respects taints and tolerations (for -TaintEffectNoSchedule) and a scheduler priority function that respects taints -and tolerations (for TaintEffectPreferNoSchedule). -1. Add to the Kubelet code to implement the "no admit" behavior of -TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute -1. Implement code in Kubelet that evicts a pod that no longer satisfies -TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the -controllers instead, but since taints might be used to enforce security -policies, it is better to do in kubelet because kubelet can respond quickly and -can guarantee the rules will be applied to all pods. Eviction may need to happen -under a variety of circumstances: when a taint is added, when an existing taint -is updated, when a toleration is removed from a pod, or when a toleration is -modified on a pod. -1. Add a new `kubectl` command that adds/removes taints to/from nodes, -1. (This is the one step is that is specific to dedicated nodes) Implement an -admission controller that adds tolerations to pods that are supposed to be -allowed to use dedicated nodes (for example, based on pod's namespace). - -In the future one can imagine a generic policy configuration that configures an -admission controller to apply the appropriate tolerations to the desired class -of pods and taints to Nodes upon node creation. It could be used not just for -policies about dedicated nodes, but also other uses of taints and tolerations, -e.g. nodes that are restricted due to their hardware configuration. - -The `kubectl` command to add and remove taints on nodes will be modeled after -`kubectl label`. Examples usages: - -```sh -# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'. -# If a taint with that key already exists, its value and effect are replaced as specified. -$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute - -# Remove from node 'foo' the taint with key 'dedicated' if one exists. -$ kubectl taint nodes foo dedicated- -``` - -## Example: implementing a dedicated nodes policy - -Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available -only to pods in a particular namespace `banana`. First the administrator does - -```sh -$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute -$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute -$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute - -``` - -(assuming they want to evict pods that are already running on those nodes if those -pods don't already tolerate the new taint) - -Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify -a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`. - -In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having -to enumerate them by name. - -## Future work - -At present, the Kubernetes security model allows any user to add and remove any -taints and tolerations. Obviously this makes it impossible to securely enforce -rules like dedicated nodes. We need some mechanism that prevents regular users -from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them -from mutating any fields of `NodeSpec`) and from mutating the `Tolerations` -field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549) -is relevant. - -Another security vulnerability arises if nodes are added to the cluster before -receiving their taint. Thus we need to ensure that a new node does not become -"Ready" until it has been configured with its taints. One way to do this is to -have an admission controller that adds the taint whenever a Node object is -created. - -A quota policy may want to treat nodes differently based on what taints, if any, -they have. For example, if a particular namespace is only allowed to access -dedicated nodes, then it may be convenient to give the namespace unlimited -quota. (To use finite quota, you'd have to size the namespace's quota to the sum -of the sizes of the machines in the dedicated node group, and update it when -nodes are added/removed to/from the group.) - -It's conceivable that taints and tolerations could be unified with -[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265). -We have chosen not to do this for the reasons described in the "Future work" -section of that doc. - -## Backward compatibility - -Old scheduler versions will ignore taints and tolerations. New scheduler -versions will respect them. - -Users should not start using taints and tolerations until the full -implementation has been in Kubelet and the master for enough binary versions -that we feel comfortable that we will not need to roll back either Kubelet or -master to a version that does not support them. Longer-term we will use a -programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)). - -## Related issues - -This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190). -There are a number of other related issues, all of which are linked to from -[#17190](https://github.com/kubernetes/kubernetes/issues/17190). - -The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574). - -The concepts of taints and tolerations were originally developed as part of the -Omega project at Google. - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]() - diff --git a/ubernetes-cluster-state.png b/ubernetes-cluster-state.png deleted file mode 100644 index 56ec2df8..00000000 Binary files a/ubernetes-cluster-state.png and /dev/null differ diff --git a/ubernetes-design.png b/ubernetes-design.png deleted file mode 100644 index 44924846..00000000 Binary files a/ubernetes-design.png and /dev/null differ diff --git a/ubernetes-scheduling.png b/ubernetes-scheduling.png deleted file mode 100644 index 01774882..00000000 Binary files a/ubernetes-scheduling.png and /dev/null differ diff --git a/versioning.md b/versioning.md deleted file mode 100644 index ae724b12..00000000 --- a/versioning.md +++ /dev/null @@ -1,174 +0,0 @@ -# Kubernetes API and Release Versioning - -Reference: [Semantic Versioning](http://semver.org) - -Legend: - -* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released. -This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the -major version, **Y** is the minor version, and **Z** is the patch version.) -* **API vX[betaY]** refers to the version of the HTTP API. - -## Release versioning - -### Minor version scheme and timeline - -* Kube X.Y.0-alpha.W, W > 0 (Branch: master) - * Alpha releases are released roughly every two weeks directly from the master -branch. - * No cherrypick releases. If there is a critical bugfix, a new release from -master can be created ahead of schedule. -* Kube X.Y.Z-beta.W (Branch: release-X.Y) - * When master is feature-complete for Kube X.Y, we will cut the release-X.Y -branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential -to X.Y. - * This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0. - * If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases, -(X.Y.0-beta.W | W > 0) as necessary. -* Kube X.Y.0 (Branch: release-X.Y) - * Final release, cut from the release-X.Y branch cut two weeks prior. - * X.Y.1-beta.0 will be tagged at the same commit on the same branch. - * X.Y.0 occur 3 to 4 months after X.(Y-1).0. -* Kube X.Y.Z, Z > 0 (Branch: release-X.Y) - * [Patch releases](#patch-releases) are released as we cherrypick commits into -the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed. - * X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is -tagged on the followup commit that updates pkg/version/base.go with the beta -version. -* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z) - * These are special and different in that the X.Y.Z tag is branched to isolate -the emergency/critical fix from all other changes that have landed on the -release branch since the previous tag - * Cut release-X.Y.Z branch to hold the isolated patch release - * Tag release-X.Y.Z branch + fixes with X.Y.(Z+1) - * Branched [patch releases](#patch-releases) are rarely needed but used for -emergency/critical fixes to the latest release - * See [#19849](https://issues.k8s.io/19849) tracking the work that is needed -for this kind of release to be possible. - -### Major version timeline - -There is no mandated timeline for major versions. They only occur when we need -to start the clock on deprecating features. A given major version should be the -latest major version for at least one year from its original release date. - -### CI and dev version scheme - -* Continuous integration versions also exist, and are versioned off of alpha and -beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an -additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after -X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds -that are built off of a dirty build tree, (during development, with things in -the tree that are not checked it,) it will be appended with -dirty. - -### Supported releases and component skew - -We expect users to stay reasonably up-to-date with the versions of Kubernetes -they use in production, but understand that it may take time to upgrade, -especially for production-critical components. - -We expect users to be running approximately the latest patch release of a given -minor release; we often include critical bug fixes in -[patch releases](#patch-release), and so encourage users to upgrade as soon as -possible. - -Different components are expected to be compatible across different amounts of -skew, all relative to the master version. Nodes may lag masters components by -up to two minor versions but should be at a version no newer than the master; a -client should be skewed no more than one minor version from the master, but may -lead the master by up to one minor version. For example, a v1.3 master should -work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4 -clients. - -Furthermore, we expect to "support" three minor releases at a time. "Support" -means we expect users to be running that version in production, though we may -not port fixes back before the latest minor version. For example, when v1.3 -comes out, v1.0 will no longer be supported: basically, that means that the -reasonable response to the question "my v1.0 cluster isn't working," is, "you -should probably upgrade it, (and probably should have some time ago)". With -minor releases happening approximately every three months, that means a minor -release is supported for approximately nine months. - -This policy is in line with -[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade). - -## API versioning - -### Release versions as related to API versions - -Here is an example major release cycle: - -* **Kube 1.0 should have API v1 without v1beta\* API versions** - * The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have -the stable v1 API. This enables you to migrate all your objects off of the beta -API versions of the API and allows us to remove those beta API versions in Kube -1.0 with no effect. There will be tooling to help you detect and migrate any -v1beta\* data versions or calls to v1 before you do the upgrade. -* **Kube 1.x may have API v2beta*** - * The first incarnation of a new (backwards-incompatible) API in HEAD is - v2beta1. By default this will be unregistered in apiserver, so it can change - freely. Once it is available by default in apiserver (which may not happen for -several minor releases), it cannot change ever again because we serialize -objects in versioned form, and we always need to be able to deserialize any -objects that are saved in etcd, even between alpha versions. If further changes -to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x -versions. -* **Kube 1.y (where y is the last version of the 1.x series) must have final -API v2** - * Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two - things: (1) users can upgrade to API v2 when running Kube 1.x and then switch - over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can - cleanup and remove all API v2beta\* versions because no one should have - v2beta\* objects left in their database. As mentioned above, tooling will exist - to make sure there are no calls or references to a given API version anywhere - inside someone's kube installation before someone upgrades. - * Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. -It *may* include the v1 API as well if the burden is not high - this will be -determined on a per-major-version basis. - -#### Rationale for API v2 being complete before v2.0's release - -It may seem a bit strange to complete the v2 API before v2.0 is released, -but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* -APIs *is* a breaking change, which is what necessitates the major version bump. -There are other ways to do this, but having the major release be the fresh start -of that release's API without the baggage of its beta versions seems most -intuitive out of the available options. - -## Patch releases - -Patch releases are intended for critical bug fixes to the latest minor version, -such as addressing security vulnerabilities, fixes to problems affecting a large -number of users, severe problems with no workaround, and blockers for products -based on Kubernetes. - -They should not contain miscellaneous feature additions or improvements, and -especially no incompatibilities should be introduced between patch versions of -the same minor version (or even major version). - -Dependencies, such as Docker or Etcd, should also not be changed unless -absolutely necessary, and also just to fix critical bugs (so, at most patch -version changes, not new major nor minor versions). - -## Upgrades - -* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a -rolling upgrade across their cluster. (Rolling upgrade means being able to -upgrade the master first, then one node at a time. See #4855 for details.) - * However, we do not recommend upgrading more than two minor releases at a -time (see [Supported releases](#supported-releases)), and do not recommend -running non-latest patch releases of a given minor release. -* No hard breaking changes over version boundaries. - * For example, if a user is at Kube 1.x, we may require them to upgrade to -Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across -major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as -graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone -to go from 1.x to 1.x+y before they go to 2.x. - -There is a separate question of how to track the capabilities of a kubelet to -facilitate rolling upgrades. That is not addressed here. - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]() - diff --git a/volume-snapshotting.md b/volume-snapshotting.md deleted file mode 100644 index e92ed3d1..00000000 --- a/volume-snapshotting.md +++ /dev/null @@ -1,523 +0,0 @@ -Kubernetes Snapshotting Proposal -================================ - -**Authors:** [Cindy Wang](https://github.com/ciwang) - -## Background - -Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). - -Typical existing backup solutions offer on demand or scheduled snapshots. - -An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume. - -Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves. - -## Objectives - -For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals. - -* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers. - - * Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods. - -* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API. - - * Nongoal: Support Delete and Restore snapshot operations in API. - -* Goal 3: Implement snapshotting interface for GCE PDs. - - * Nongoal: Implement snapshotting interface for non GCE PD volumes. - -### Feature Roadmap - -Major features, in order of priority (bold features are priorities for v1): - -* **On demand snapshots** - - * **API to create new snapshots and list existing snapshots** - - * API to restore a disk from a snapshot and delete old snapshots - -* Scheduled snapshots - -* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node) - -## Requirements - -### Performance - -* Time SLA from issuing a snapshot to completion: - -* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location. - -* This should be on the order of a few minutes. - -### Reliability - -* Data corruption - - * Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons: - - * GCE and Amazon can create snapshots while the application is running. - - * Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code. - - * Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown. - -* Snapshot failure - - * Case: Failure during external process, such as during API call or upload - - * Log error, retry until success (indefinitely) - - * Case: Failure within Kubernetes, such as controller restarts - - * If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state. - -## Solution Overview - -Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects. - -* **Create:** - - * Key: create.snapshot.volume.alpha.kubernetes.io - - * Value: [snapshot name] - -* **List:** - - * Key: snapshot.volume.alpha.kubernetes.io/[snapshot name] - - * Value: [snapshot timestamp] - -A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes. - -The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage). - -## Detailed Design - -### API - -* Create snapshot - - * Usage: - - * Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter - - * When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list. - - * API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually. - - * PVC control loop in master - - * If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV. - - * If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV. - -* List existing snapshots - - * Only displayed as annotations on PVC object. - - * Only lists unique names and timestamps of snapshots taken using the Kubernetes API. - - * Usage: - - * Get the PVC object - - * Snapshots are listed as key-value pairs within the PVC annotations - -### SnapshotController - -![Snapshot Controller Diagram](volume-snapshotting.png?raw=true "Snapshot controller diagram") - -**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests. - -**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully. - -**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful. - -The controller will have a loop that does the following: - -* Fetch State - - * Fetch all PVC objects from the API server. - -* Act - - * Trigger snapshot: - - * Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation. - -* Persist State - - * Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server. - -Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor. - -The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor. - -### Create Snapshot Logic - -To create a snapshot: - -* Acquire operation lock for volume so that no other attach or detach operations can be started for volume. - - * Abort if there is already a pending operation for the specified volume (main loop will retry, if needed). - -* Spawn a new thread: - - * Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC. - - * For any errors, log the error, and terminate the thread (the main controller will retry as needed). - - * Once a snapshot is created successfully: - - * Make a call to the API server to delete the create snapshot annotation in the PVC object. - - * Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations. - -*Brainstorming notes below, read at your own risk!* - -* * * - - -Open questions: - -* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API? - - * It seems that the API route is a bit more feasible in implementation and can also be fully utilized. - - * Can the API call methods on VolumePlugins? Yeah via controller - - * The scheduler gives users functionality that doesn’t already exist, but required adding an entirely new controller - -* Should the list and restore operations be part of v1? - -* Do we call them snapshots or backups? - - * From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application." - -* At what minimum granularity should snapshots be allowed? - -* How do we store information about the most recent snapshot in case the controller restarts? - -* In case of error, do we err on the side of fewer or more snapshots? - -Snapshot Scheduler - -1. PVC API Object - -A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression. - -* CRUD operations on snapshot schedules - - * Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/) - - * The cron expression provides flexibility to decrease the interval between snapshots in future versions - - * Read: Display snapshot schedule to user via kubectl get pvc - - * Update: Do not support changing the snapshot schedule for an existing PVC - - * Delete: Do not support deleting the snapshot schedule for an existing PVC - - * In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name - -* Validation - - * Cron expressions must have a 0 in the minutes place and use exact, not interval syntax - - * [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals). - - * If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user - - Objective - -Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins. - -Goal: Implement snapshotting interface for GCE PDs. - -Goal: Protect against data loss by allowing users to restore snapshots of their disks. - -Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes. - -Nongoal: Use snapshotting to provide additional features such as migration. - - Background - -Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). - -Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage. - -Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules. - - what is the value case? Who wants this? What do we enable by implementing this? - -I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious. - -Requirements - -Functionality - -Should this support PVs, direct volumes, or both? - -Should we support deletion? - -Should we support restores? - -Automated schedule -- times or intervals? Before major event? - -Performance - -Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing? - -* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards. - - * Pending = uploading to GCE - -* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting - -Reliability - -How do we ascertain that deletions happen when we want them to? - -For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes. - -We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this. - -What happens when the snapshot fails as these are async operations? - -Retry (for some time period? indefinitely?) and log the error - -Other - -What is the UI for seeing the list of snapshots? - -In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots. - -Overview - -There are several design options for the design of each layer of implementation as follows. - -1. **Public API:** - -Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive: - - 1. New Volume API object - - 1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object. - - 2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot. - - 3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object. - - 4. Pros - - 1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule. - - 5. Cons - - 2. Heavyweight codewise; involves changing and touching a lot of existing code. - - 3. Potentially bad UX: How is the Volume API object created? - - 1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands. - - 2. When a unique volume is specified in a pod or PV spec. - - 2. Directly in volume definition in the pod/PV object - - 6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. - - 7. Pros - - 4. Easy for users to implement and understand - - 8. Cons - - 5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used. - - 6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set. - - 3. Only in PV object - - 9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule. - - 10. Pros - - 7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property. - - 11. Cons - - 8. No support for direct volumes - - 9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job - - 10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules. - - 4. Annotations: key value pairs on API object - - 12. User experience is the same as (b) - - 13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours. - - 14. Pros - - 11. Less invasive to the codebase than (a-c) - - 15. Cons - - 12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation. - -2. **Business logic:** - - 5. Does this go on the master, node, or both? - - 16. Where the snapshot is stored - - 13. GCE, Amazon: cloud storage - - 14. Others stored on volume itself (gluster) or external drive (iSCSI) - - 17. Requirements for snapshot operation - - 15. Application flush, sync, and fsfreeze before creating snapshot - - 6. Suggestion: - - 18. New SnapshotController on master - - 16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot - - 17. If controller restarts and we miss a snapshot in the process, just skip it - - 3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5) - - 18. If snapshotting call fails, retry for an amount of time specified in retryPeriod - - 19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period - - 19. Logic to prepare the disk for snapshotting on node - - 20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD) - - 7. Alternatives: login entirely on node - - 20. Problems: - - 21. If pod moves from one node to another - - 4. A different node is in now in charge of snapshotting - - 5. If the volume plugin requires external memory for snapshots, we need to move the existing data - - 22. If the same pod exists on two different nodes, which node is in charge - -3. **Volume plugin interface/internal API:** - - 8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin) - - 9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot - - 10. Similar to volume.attach call - -4. **Other questions:** - - 11. Snapshot period - - 12. Time or period - - 13. What is our SLO around time accuracy? - - 21. Best effort, but no guarantees (depends on time or period) -- if going with time. - - 14. What if we miss a snapshot? - - 22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity - - 15. Will we know when an operation has failed? How do we report that? - - 23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases - - 16. Will we be responsible for GCing old snapshots? - - 24. Maybe this can be explicit non-goal, in the future can automate garbage collection - - 17. If the pod dies do we continue creating snapshots? - - 18. How to communicate errors (PD doesn’t support snapshotting, time period unsupported) - - 19. Off schedule snapshotting like before an application upgrade - - 20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this? - -Options, pros, cons, suggestion/recommendation - -Example 1b - -During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod’s associated volume. - -For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2): - -apiVersion: v1 -kind: Pod -metadata: - name: test-pd -spec: - containers: - - image: gcr.io/google_containers/test-webserver - name: test-container - volumeMounts: - - mountPath: /test-pd - name: test-volume - volumes: - - name: test-volume - # This GCE PD must already exist. - gcePersistentDisk: - pdName: my-data-disk - fsType: ext4 - -Introduce a new field into the volume spec: - -apiVersion: v1 -kind: Pod -metadata: - name: test-pd -spec: - containers: - - image: gcr.io/google_containers/test-webserver - name: test-container - volumeMounts: - - mountPath: /test-pd - name: test-volume - volumes: - - name: test-volume - # This GCE PD must already exist. - gcePersistentDisk: - pdName: my-data-disk - fsType: ext4 - -** ssTimes: ****[1, 5]** - - Caveats - -* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because - - * this does not provide value to the user and only adds an extra layer of indirection/complexity. - - * ? - - Dependencies - -* Kubernetes - -* Persistent volume snapshot support through API - - * POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot - - - - -[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/volume-snapshotting.md?pixel)]() - diff --git a/volume-snapshotting.png b/volume-snapshotting.png deleted file mode 100644 index 1b1ea748..00000000 Binary files a/volume-snapshotting.png and /dev/null differ -- cgit v1.2.3